MolEval Metrics

HEDGEHOG includes a vendored copy of MolEval (from MolScore v1.9.5, MIT license) to compute intrinsic distribution quality metrics at multiple pipeline stages. These metrics evaluate the diversity and intrinsic quality of molecule sets without requiring a reference dataset.

Active Metrics

Nine metrics are enabled by default. Validity and Uniqueness are disabled because they are always 1.0 after RDKit parsing and deduplication. MCE-18 is computed separately and reported alongside the intrinsic MolEval metrics.

Metric	What It Measures	Range	Interpretation
IntDiv1	Internal diversity (1 - mean pairwise Tanimoto similarity)	0—1	Higher = more diverse set. Values near 1.0 mean molecules are structurally dissimilar.
IntDiv2	Internal diversity squared (IntDiv1^2)	0—1	Amplifies differences at high diversity. More sensitive to clustering.
SEDiv	Sphere exclusion diversity — fraction of chemical space “covered” by the set	0—1	Higher = better coverage of Tanimoto space. Measures how well the set fills diversity spheres.
ScaffDiv	Scaffold diversity — ratio of unique Murcko scaffolds to total molecules	0—1	Higher = more structurally distinct core scaffolds. Low values indicate scaffold redundancy.
ScaffUniqueness	Fraction of scaffolds that appear only once	0—1	Higher = each scaffold is represented by a single molecule. Low = scaffold over-representation.
FG	Functional group diversity ratio	0—1	Higher = wider variety of functional groups across the set.
RS	Ring system diversity ratio	0—1	Higher = more varied ring systems. Low values suggest reliance on a few ring types.
Filters	Fraction passing medicinal chemistry filters (MCF + PAINS)	0—1	Higher = cleaner set. Molecules failing PAINS or MCF contain known promiscuous substructures.
MCE18	Mean MCE-18 molecular complexity	0—∞	Higher = more complex structures (reported as an average over molecules).

Pipeline Checkpoints

Metrics are computed at five stages to track how chemical diversity evolves through the filtering pipeline:

Checkpoint	Source File	Description
Input	`input/sampled_molecules.csv`	Raw generated molecules before any filtering
Descriptors	`stages/01_descriptors_initial/filtered/filtered_molecules.csv`	After descriptor calculation and initial property filtering
StructFilters	`stages/03_structural_filters_post/filtered_molecules.csv`	After all structural filters (PAINS, NIBR, Lilly, medchem, etc.)
Synthesis	`stages/04_synthesis/filtered_molecules.csv`	After retrosynthesis feasibility filtering
DockingFilters	`stages/06_docking_filters/filtered_molecules.csv`	After docking score threshold filtering

The expected trend is:

Diversity metrics (IntDiv, SEDiv, ScaffDiv) may decrease as aggressive filtering removes outliers, or may increase if redundant molecules are removed.
Filters metric should increase across stages as problematic substructures are eliminated.
ScaffUniqueness typically increases as duplicate scaffolds are filtered out.

Configuration

MolEval metrics are configured via config_moleval.yml, referenced from the main pipeline config:


# config_moleval.yml
run: true
n_jobs: -1
device: cpu            # cpu or cuda:0
max_molecules: 2000    # subsample for O(N^2) metrics
 
# Metric groups (toggle individual metric families)
validity: false          # Always 1.0 after RDKit parsing
uniqueness: false        # Always 1.0 after dedup
internal_diversity: true # IntDiv1, IntDiv2
se_diversity: true       # Sphere exclusion diversity
scaffold_diversity: true # ScaffDiv, ScaffUniqueness
functional_groups: true  # FG diversity ratio
ring_systems: true       # RS diversity ratio
filters: true            # MCF + PAINS passage rate
mce18: true              # Mean MCE-18 molecular complexity

Key Options

Option	Type	Default	Description
`run`	bool	`true`	Enable/disable all MolEval metrics
`n_jobs`	int	`-1`	Parallel workers for metric computation (`-1` = all available cores)
`device`	str	`cpu`	Compute device (`cpu` or `cuda:0`)
`max_molecules`	int	`2000`	Subsample threshold for O(N^2) pairwise metrics
`mce18`	bool	`true`	Enable mean MCE-18 molecular complexity reporting

Deterministic Computation

Large sets are capped by max_molecules before O(N^2) metrics are computed. The shipped config_moleval.yml does not expose a seed key; do not rely on a documented seed parameter for reproducible subsampling unless you add one in code and config.

For reproducibility, keep the input table, sampling step, config snapshot, and max_molecules value fixed.

Output

MolEval results appear in three places:

HTML report — line chart and heatmap in dedicated sections.
RUN_INFO.md — a Markdown table appended to the run info file with per-stage metric values.
report_data.json — raw metric data under the moleval key.