Skip to Content
ReportingMolEval Metrics

MolEval Metrics

HEDGEHOG includes a vendored copy of MolEval  (from MolScore v1.9.5, MIT license) to compute intrinsic distribution quality metrics at multiple pipeline stages. These metrics evaluate the diversity and intrinsic quality of molecule sets without requiring a reference dataset.

Active Metrics

Nine metrics are enabled by default. Validity and Uniqueness are disabled because they are always 1.0 after RDKit parsing and deduplication. MCE-18 is computed separately and reported alongside the intrinsic MolEval metrics.

MetricWhat It MeasuresRangeInterpretation
IntDiv1Internal diversity (1 - mean pairwise Tanimoto similarity)0—1Higher = more diverse set. Values near 1.0 mean molecules are structurally dissimilar.
IntDiv2Internal diversity squared (IntDiv1^2)0—1Amplifies differences at high diversity. More sensitive to clustering.
SEDivSphere exclusion diversity — fraction of chemical space “covered” by the set0—1Higher = better coverage of Tanimoto space. Measures how well the set fills diversity spheres.
ScaffDivScaffold diversity — ratio of unique Murcko scaffolds to total molecules0—1Higher = more structurally distinct core scaffolds. Low values indicate scaffold redundancy.
ScaffUniquenessFraction of scaffolds that appear only once0—1Higher = each scaffold is represented by a single molecule. Low = scaffold over-representation.
FGFunctional group diversity ratio0—1Higher = wider variety of functional groups across the set.
RSRing system diversity ratio0—1Higher = more varied ring systems. Low values suggest reliance on a few ring types.
FiltersFraction passing medicinal chemistry filters (MCF + PAINS)0—1Higher = cleaner set. Molecules failing PAINS or MCF contain known promiscuous substructures.
MCE18Mean MCE-18 molecular complexity0—∞Higher = more complex structures (reported as an average over molecules).

Pipeline Checkpoints

Metrics are computed at five stages to track how chemical diversity evolves through the filtering pipeline:

CheckpointSource FileDescription
Inputinput/sampled_molecules.csvRaw generated molecules before any filtering
Descriptorsstages/01_descriptors_initial/filtered/filtered_molecules.csvAfter descriptor calculation and initial property filtering
StructFiltersstages/03_structural_filters_post/filtered_molecules.csvAfter all structural filters (PAINS, NIBR, Lilly, medchem, etc.)
Synthesisstages/04_synthesis/filtered_molecules.csvAfter retrosynthesis feasibility filtering
DockingFiltersstages/06_docking_filters/filtered_molecules.csvAfter docking score threshold filtering

The expected trend is:

  • Diversity metrics (IntDiv, SEDiv, ScaffDiv) may decrease as aggressive filtering removes outliers, or may increase if redundant molecules are removed.
  • Filters metric should increase across stages as problematic substructures are eliminated.
  • ScaffUniqueness typically increases as duplicate scaffolds are filtered out.

Configuration

MolEval metrics are configured via config_moleval.yml, referenced from the main pipeline config:

# config_moleval.yml run: true n_jobs: -1 device: cpu # cpu or cuda:0 max_molecules: 2000 # subsample for O(N^2) metrics # Metric groups (toggle individual metric families) validity: false # Always 1.0 after RDKit parsing uniqueness: false # Always 1.0 after dedup internal_diversity: true # IntDiv1, IntDiv2 se_diversity: true # Sphere exclusion diversity scaffold_diversity: true # ScaffDiv, ScaffUniqueness functional_groups: true # FG diversity ratio ring_systems: true # RS diversity ratio filters: true # MCF + PAINS passage rate mce18: true # Mean MCE-18 molecular complexity

Key Options

OptionTypeDefaultDescription
runbooltrueEnable/disable all MolEval metrics
n_jobsint-1Parallel workers for metric computation (-1 = all available cores)
devicestrcpuCompute device (cpu or cuda:0)
max_moleculesint2000Subsample threshold for O(N^2) pairwise metrics
mce18booltrueEnable mean MCE-18 molecular complexity reporting

Deterministic Computation

Large sets are capped by max_molecules before O(N^2) metrics are computed. The shipped config_moleval.yml does not expose a seed key; do not rely on a documented seed parameter for reproducible subsampling unless you add one in code and config.

For reproducibility, keep the input table, sampling step, config snapshot, and max_molecules value fixed.

Output

MolEval results appear in three places:

  1. HTML report — line chart and heatmap in dedicated sections.
  2. RUN_INFO.md — a Markdown table appended to the run info file with per-stage metric values.
  3. report_data.json — raw metric data under the moleval key.
Last updated on