MolEval Metrics
HEDGEHOG includes a vendored copy of MolEval (from MolScore v1.9.5, MIT license) to compute intrinsic distribution quality metrics at multiple pipeline stages. These metrics evaluate the diversity and intrinsic quality of molecule sets without requiring a reference dataset.
Active Metrics
Nine metrics are enabled by default. Validity and Uniqueness are disabled because they are always 1.0 after RDKit parsing and deduplication. MCE-18 is computed separately and reported alongside the intrinsic MolEval metrics.
| Metric | What It Measures | Range | Interpretation |
|---|---|---|---|
| IntDiv1 | Internal diversity (1 - mean pairwise Tanimoto similarity) | 0—1 | Higher = more diverse set. Values near 1.0 mean molecules are structurally dissimilar. |
| IntDiv2 | Internal diversity squared (IntDiv1^2) | 0—1 | Amplifies differences at high diversity. More sensitive to clustering. |
| SEDiv | Sphere exclusion diversity — fraction of chemical space “covered” by the set | 0—1 | Higher = better coverage of Tanimoto space. Measures how well the set fills diversity spheres. |
| ScaffDiv | Scaffold diversity — ratio of unique Murcko scaffolds to total molecules | 0—1 | Higher = more structurally distinct core scaffolds. Low values indicate scaffold redundancy. |
| ScaffUniqueness | Fraction of scaffolds that appear only once | 0—1 | Higher = each scaffold is represented by a single molecule. Low = scaffold over-representation. |
| FG | Functional group diversity ratio | 0—1 | Higher = wider variety of functional groups across the set. |
| RS | Ring system diversity ratio | 0—1 | Higher = more varied ring systems. Low values suggest reliance on a few ring types. |
| Filters | Fraction passing medicinal chemistry filters (MCF + PAINS) | 0—1 | Higher = cleaner set. Molecules failing PAINS or MCF contain known promiscuous substructures. |
| MCE18 | Mean MCE-18 molecular complexity | 0—∞ | Higher = more complex structures (reported as an average over molecules). |
Pipeline Checkpoints
Metrics are computed at five stages to track how chemical diversity evolves through the filtering pipeline:
| Checkpoint | Source File | Description |
|---|---|---|
| Input | input/sampled_molecules.csv | Raw generated molecules before any filtering |
| Descriptors | stages/01_descriptors_initial/filtered/filtered_molecules.csv | After descriptor calculation and initial property filtering |
| StructFilters | stages/03_structural_filters_post/filtered_molecules.csv | After all structural filters (PAINS, NIBR, Lilly, medchem, etc.) |
| Synthesis | stages/04_synthesis/filtered_molecules.csv | After retrosynthesis feasibility filtering |
| DockingFilters | stages/06_docking_filters/filtered_molecules.csv | After docking score threshold filtering |
The expected trend is:
- Diversity metrics (IntDiv, SEDiv, ScaffDiv) may decrease as aggressive filtering removes outliers, or may increase if redundant molecules are removed.
- Filters metric should increase across stages as problematic substructures are eliminated.
- ScaffUniqueness typically increases as duplicate scaffolds are filtered out.
Configuration
MolEval metrics are configured via config_moleval.yml, referenced from the main pipeline config:
# config_moleval.yml
run: true
n_jobs: -1
device: cpu # cpu or cuda:0
max_molecules: 2000 # subsample for O(N^2) metrics
# Metric groups (toggle individual metric families)
validity: false # Always 1.0 after RDKit parsing
uniqueness: false # Always 1.0 after dedup
internal_diversity: true # IntDiv1, IntDiv2
se_diversity: true # Sphere exclusion diversity
scaffold_diversity: true # ScaffDiv, ScaffUniqueness
functional_groups: true # FG diversity ratio
ring_systems: true # RS diversity ratio
filters: true # MCF + PAINS passage rate
mce18: true # Mean MCE-18 molecular complexityKey Options
| Option | Type | Default | Description |
|---|---|---|---|
run | bool | true | Enable/disable all MolEval metrics |
n_jobs | int | -1 | Parallel workers for metric computation (-1 = all available cores) |
device | str | cpu | Compute device (cpu or cuda:0) |
max_molecules | int | 2000 | Subsample threshold for O(N^2) pairwise metrics |
mce18 | bool | true | Enable mean MCE-18 molecular complexity reporting |
Deterministic Computation
Large sets are capped by max_molecules before O(N^2) metrics are computed.
The shipped config_moleval.yml does not expose a seed key; do not rely on a
documented seed parameter for reproducible subsampling unless you add one in code
and config.
For reproducibility, keep the input table, sampling step, config snapshot, and
max_molecules value fixed.
Output
MolEval results appear in three places:
- HTML report — line chart and heatmap in dedicated sections.
RUN_INFO.md— a Markdown table appended to the run info file with per-stage metric values.report_data.json— raw metric data under themolevalkey.