Skip to Content
Pipeline StagesStructural Filters

Structural Filters

The structural filters stage applies medicinal chemistry rules and structural alert databases to remove molecules with undesirable substructures, excessive complexity, or known toxicophores. This stage runs after descriptor-based filtering in the pipeline:

  • Post-descriptors: the standard position, running after descriptor-based filtering. Early normalization and cleanup happens in the Mol Prep stage.

Filter Reference

Database-Based Filters

FilterConfig KeyWhat It DetectsWhy It Matters
Common Structural Alertscalculate_common_alertsMatches against curated SMARTS databases (Dundee, BMS, Glaxo, PAINS, etc.)Flags reactive, toxic, or assay-interfering substructures from 20+ published rule sets
NIBRcalculate_NIBRNovartis In-house structural filtersCatches chemotypes with poor developability based on Novartis medicinal chemistry experience
Lilly Demeritscalculate_lillyEli Lilly Medchem Rules with demerit scoringAssigns numeric demerit scores; molecules exceeding thresholds are rejected

Rule-Based Filters

FilterConfig KeyWhat It DetectsWhy It Matters
Bredt Rulecalculate_bredtViolations of Bredt’s rule at bridgehead positionsBridgehead double bonds in small bicyclic systems are synthetically impossible
Molecular Complexitycalculate_molcomplexityOverly complex molecular graphsExcessively complex molecules are difficult to synthesize and often have poor ADME
Molecular Graph Statscalculate_molgraph_statsGraph-theoretic severity metrics (severities 1—11)Detects topological anomalies like unusual ring systems or chain branching

Medchem Functional Filters

These five filters use the medchem.functional API:

FilterConfig KeyWhat It DetectsDefaultKey Parameters
Protecting Groupscalculate_protecting_groupsCommon protecting groups (Boc, Fmoc, Cbz, etc.)On
Ring Infractioncalculate_ring_infractionStrained ring systems and infractionsOnring_infraction_hetcycle_min_size: 4
Stereo Centerscalculate_stereo_centerExcessive or undefined stereocentersOnstereo_max_centers: 4, stereo_max_undefined: 2
Halogenicitycalculate_halogenicityExcessive halogen contentOnhalogenicity_thresh_F: 6, halogenicity_thresh_Br: 3, halogenicity_thresh_Cl: 3
Symmetrycalculate_symmetryHighly symmetric moleculesOffsymmetry_threshold: 0.8

The symmetry filter is disabled by default because many approved drugs are symmetric (e.g., ibuprofen, metformin). Enable it only if your project specifically requires asymmetric scaffolds.

Pipeline Position

Structural filters run after the descriptor stage and write results to 03_structural_filters_post/.

Common Alerts Configuration

The common alerts filter is the most configurable. It uses a curated CSV database of SMARTS patterns organized by rule set:

HEDGEHOG ships three structural alert profiles:

  • config_structFilters_strict.yml - conservative, for high-confidence hygiene screening
  • config_structFilters_balanced.yml - practical mid-conservatism profile
  • config_structFilters_exploration.yml - least conservative, keeps more chemistry diversity

The default config_structFilters.yml uses the exploration profile.

calculate_common_alerts: true common_alerts_auto_n_jobs: true common_alerts_small_input_threshold: 1000 common_alerts_small_input_n_jobs: 1 common_alerts_large_input_n_jobs: 12 include_rulesets: - Dundee - BMS - Inpharmatica - LD50-Oral - Glaxo - PAINS - AlphaScreen-Hitters - Frequent-Hitter - Chelator - SureChEMBL - GST-Hitters - HIS-Hitters - LuciferaseInhibitor exclude_descriptions: Dundee: - Aliphatic long chain - isolated alkene - triple bond Inpharmatica: - Filter82_pyridinium LD50-Oral: - phenylpiperazine SureChEMBL: - aminothiazole HIS-Hitters: - Picolylamines_A

When common_alerts_auto_n_jobs: true, worker selection is size-aware:

  • < 1000 molecules: use 1 worker
  • 1000..9999 molecules: use 12 workers
  • >= 10000 molecules: use all available workers

Set common_alerts_auto_n_jobs: false to use regular stage/global n_jobs resolution instead.

The exclude_descriptions section lets you whitelist specific alerts within a rule set. For example, if your target requires an imidazole moiety, you can exclude the Toxicophore imidazole alert while keeping all other toxicophore rules active.

Use the strict profile when broad toxicology, sensitization, and reactivity rulesets should be active. Use the exploration profile when retaining chemical diversity matters more than conservative early rejection.

Configuration

Full configuration in config_structFilters.yml:

run: true filter_data: true parse_input_n_jobs: -1 write_per_filter_outputs: true generate_plots: true generate_failure_analysis: true combine_in_memory: true parallel_scheduler: processes # Database-based filters calculate_common_alerts: true calculate_NIBR: true calculate_lilly: true molgraph_scheduler: processes nibr_scheduler: processes lilly_scheduler: threads # Rule-based filters calculate_bredt: true calculate_molgraph_stats: true calculate_molcomplexity: true # Medchem functional filters calculate_protecting_groups: true calculate_ring_infraction: true ring_infraction_hetcycle_min_size: 4 calculate_stereo_center: true stereo_max_centers: 4 stereo_max_undefined: 2 calculate_halogenicity: true halogenicity_thresh_F: 6 halogenicity_thresh_Br: 3 halogenicity_thresh_Cl: 3 calculate_symmetry: false symmetry_threshold: 0.8

Output Files

FileDescription
{filter_name}/metrics.csvPer-filter summary statistics
{filter_name}/extended.csvDetailed per-molecule results for each filter
{filter_name}/filtered_molecules.csvMolecules passing each individual filter
filtered_molecules.csvMolecules passing all enabled filters
failed_molecules.csvMolecules failing any filter
plots/molecule_counts_comparison.pngBar chart comparing molecule counts across filters
plots/restriction_ratios_comparison.pngRestriction ratio comparison across filters

Usage

# Run structural filters as part of the full pipeline uv run hedgehog # Run structural filters stage only uv run hedgehog --stage struct_filters # Short alias uv run hedge --stage struct_filters

In single-stage mode (--stage struct_filters), HEDGEHOG reuses existing descriptors output when available. If descriptors output is missing, it falls back to MolPrep output (or sampled input) in the same run folder.

Last updated on