Structural Filters
The structural filters stage applies medicinal chemistry rules and structural alert databases to remove molecules with undesirable substructures, excessive complexity, or known toxicophores. This stage runs after descriptor-based filtering in the pipeline:
- Post-descriptors: the standard position, running after descriptor-based filtering. Early normalization and cleanup happens in the Mol Prep stage.
Filter Reference
Database-Based Filters
| Filter | Config Key | What It Detects | Why It Matters |
|---|---|---|---|
| Common Structural Alerts | calculate_common_alerts | Matches against curated SMARTS databases (Dundee, BMS, Glaxo, PAINS, etc.) | Flags reactive, toxic, or assay-interfering substructures from 20+ published rule sets |
| NIBR | calculate_NIBR | Novartis In-house structural filters | Catches chemotypes with poor developability based on Novartis medicinal chemistry experience |
| Lilly Demerits | calculate_lilly | Eli Lilly Medchem Rules with demerit scoring | Assigns numeric demerit scores; molecules exceeding thresholds are rejected |
Rule-Based Filters
| Filter | Config Key | What It Detects | Why It Matters |
|---|---|---|---|
| Bredt Rule | calculate_bredt | Violations of Bredt’s rule at bridgehead positions | Bridgehead double bonds in small bicyclic systems are synthetically impossible |
| Molecular Complexity | calculate_molcomplexity | Overly complex molecular graphs | Excessively complex molecules are difficult to synthesize and often have poor ADME |
| Molecular Graph Stats | calculate_molgraph_stats | Graph-theoretic severity metrics (severities 1—11) | Detects topological anomalies like unusual ring systems or chain branching |
Medchem Functional Filters
These five filters use the medchem.functional API:
| Filter | Config Key | What It Detects | Default | Key Parameters |
|---|---|---|---|---|
| Protecting Groups | calculate_protecting_groups | Common protecting groups (Boc, Fmoc, Cbz, etc.) | On | — |
| Ring Infraction | calculate_ring_infraction | Strained ring systems and infractions | On | ring_infraction_hetcycle_min_size: 4 |
| Stereo Centers | calculate_stereo_center | Excessive or undefined stereocenters | On | stereo_max_centers: 4, stereo_max_undefined: 2 |
| Halogenicity | calculate_halogenicity | Excessive halogen content | On | halogenicity_thresh_F: 6, halogenicity_thresh_Br: 3, halogenicity_thresh_Cl: 3 |
| Symmetry | calculate_symmetry | Highly symmetric molecules | Off | symmetry_threshold: 0.8 |
The symmetry filter is disabled by default because many approved drugs are symmetric (e.g., ibuprofen, metformin). Enable it only if your project specifically requires asymmetric scaffolds.
Pipeline Position
Structural filters run after the descriptor stage and write results to 03_structural_filters_post/.
Common Alerts Configuration
The common alerts filter is the most configurable. It uses a curated CSV database of SMARTS patterns organized by rule set:
HEDGEHOG ships three structural alert profiles:
config_structFilters_strict.yml- conservative, for high-confidence hygiene screeningconfig_structFilters_balanced.yml- practical mid-conservatism profileconfig_structFilters_exploration.yml- least conservative, keeps more chemistry diversity
The default config_structFilters.yml uses the exploration profile.
calculate_common_alerts: true
common_alerts_auto_n_jobs: true
common_alerts_small_input_threshold: 1000
common_alerts_small_input_n_jobs: 1
common_alerts_large_input_n_jobs: 12
include_rulesets:
- Dundee
- BMS
- Inpharmatica
- LD50-Oral
- Glaxo
- PAINS
- AlphaScreen-Hitters
- Frequent-Hitter
- Chelator
- SureChEMBL
- GST-Hitters
- HIS-Hitters
- LuciferaseInhibitor
exclude_descriptions:
Dundee:
- Aliphatic long chain
- isolated alkene
- triple bond
Inpharmatica:
- Filter82_pyridinium
LD50-Oral:
- phenylpiperazine
SureChEMBL:
- aminothiazole
HIS-Hitters:
- Picolylamines_AWhen common_alerts_auto_n_jobs: true, worker selection is size-aware:
< 1000molecules: use1worker1000..9999molecules: use12workers>= 10000molecules: use all available workers
Set common_alerts_auto_n_jobs: false to use regular stage/global n_jobs resolution instead.
The exclude_descriptions section lets you whitelist specific alerts within a rule set. For example, if your target requires an imidazole moiety, you can exclude the Toxicophore imidazole alert while keeping all other toxicophore rules active.
Use the strict profile when broad toxicology, sensitization, and reactivity rulesets should be active. Use the exploration profile when retaining chemical diversity matters more than conservative early rejection.
Configuration
Full configuration in config_structFilters.yml:
run: true
filter_data: true
parse_input_n_jobs: -1
write_per_filter_outputs: true
generate_plots: true
generate_failure_analysis: true
combine_in_memory: true
parallel_scheduler: processes
# Database-based filters
calculate_common_alerts: true
calculate_NIBR: true
calculate_lilly: true
molgraph_scheduler: processes
nibr_scheduler: processes
lilly_scheduler: threads
# Rule-based filters
calculate_bredt: true
calculate_molgraph_stats: true
calculate_molcomplexity: true
# Medchem functional filters
calculate_protecting_groups: true
calculate_ring_infraction: true
ring_infraction_hetcycle_min_size: 4
calculate_stereo_center: true
stereo_max_centers: 4
stereo_max_undefined: 2
calculate_halogenicity: true
halogenicity_thresh_F: 6
halogenicity_thresh_Br: 3
halogenicity_thresh_Cl: 3
calculate_symmetry: false
symmetry_threshold: 0.8Output Files
| File | Description |
|---|---|
{filter_name}/metrics.csv | Per-filter summary statistics |
{filter_name}/extended.csv | Detailed per-molecule results for each filter |
{filter_name}/filtered_molecules.csv | Molecules passing each individual filter |
filtered_molecules.csv | Molecules passing all enabled filters |
failed_molecules.csv | Molecules failing any filter |
plots/molecule_counts_comparison.png | Bar chart comparing molecule counts across filters |
plots/restriction_ratios_comparison.png | Restriction ratio comparison across filters |
Usage
# Run structural filters as part of the full pipeline
uv run hedgehog
# Run structural filters stage only
uv run hedgehog --stage struct_filters
# Short alias
uv run hedge --stage struct_filtersIn single-stage mode (--stage struct_filters), HEDGEHOG reuses existing descriptors output when available. If descriptors output is missing, it falls back to MolPrep output (or sampled input) in the same run folder.