Skip to Content
Pipeline StagesOverview

Pipeline Overview

HEDGEHOG implements a sequential molecular filtering pipeline that progressively narrows a large set of generated molecules down to high-quality drug candidates. Each stage applies domain-specific filters, and only molecules that pass all enabled stages reach the final output.

Pipeline Stages

#StagePurposeTypical Retention
1Mol Prep (Datamol)Standardize molecules and remove salts/fragments/charges/stereochemistry; canonicalize tautomers70—95%
2Molecular DescriptorsCompute 22 physicochemical descriptors and filter by threshold ranges40—80%
3Structural FiltersApply medicinal chemistry alerts, complexity checks, and structural rules50—90%
4Synthesis AnalysisEvaluate synthetic accessibility (SA Score, SYBA, RA Score) and retrosynthesis feasibility30—70%
5Molecular DockingDock surviving molecules into the target protein binding site~100% (scoring, not filtering)
6Docking FiltersFilter docked poses by search-box containment, clashes, interactions, and conformer deviation20—70%
7’Final DescriptorsRecompute descriptors on the final molecule set for reporting100% (no filtering)

Retention rates are approximate and depend heavily on the input molecule set and configuration thresholds.

Data Flow

input/sampled_molecules.csv | v [Stage 1: Mol Prep] | v [Stage 2: Descriptors] | filtered_molecules.csv v [Stage 3: Structural Filters] | filtered_molecules.csv v [Stage 4: Synthesis Analysis] | filtered_molecules.csv v [Stage 5: Molecular Docking] | gnina_out.sdf / smina_out.sdf / matcha_out.sdf v [Stage 6: Docking Filters] | filtered_molecules.csv v [Stage 7': Final Descriptors] | v output/final_molecules.csv

Most stages write a stable filtered_molecules.csv, but input resolution is not a strict “previous directory only” rule. When a downstream stage runs, HEDGEHOG prefers the latest usable upstream artifact and can fall back to earlier filtered outputs in the same run folder. This is what enables targeted stage reruns.

Sequential Filtering Approach

The pipeline is designed as a funnel: cheap, fast filters run first (descriptors, structural alerts), and expensive operations (docking, retrosynthesis) run later on a smaller molecule set. This ordering minimizes computational cost while maximizing coverage.

Key design decisions:

  • Early exit and skips: Mol Prep stops downstream execution when it completes with zero molecules. Other stages may complete with empty CSVs or be reported as skipped when their required input is missing or empty.
  • Single-stage mode: any stage can be requested with --stage <name>, which is useful for debugging or re-running a single step. When you request any stage other than mol_prep, HEDGEHOG may still run mol_prep first if it is enabled in config_mol_prep.yml, so downstream stages operate on standardized molecules.
  • Provenance: all configuration files are snapshotted into a configs/ directory at the start of each run, and a RUN_INFO.md is generated at completion.

Input Data Contract

Recommended molecule input is CSV/TSV with a smiles header:

smiles,model_name CCO,baseline CCN,baseline c1ccccc1,baseline

Required:

  • smiles

Optional:

  • model_name or name
  • mol_idx

Generated:

  • mol_idx is assigned automatically if missing and is used to join descriptors, filters, docking scores, and report data.

Headerless .smi files are supported by extension for simple one-SMILES-per-line inputs, with an optional second whitespace token used as model_name. CSV/TSV with a smiles header remains the recommended production format.

Stage Dependencies

StageNeedsReads From by DefaultWrites
mol_prepRaw molecule tablegenerated_mols_path / sampled inputstages/00_mol_prep/filtered_molecules.csv
descriptorsMolecule tableMol Prep output when enabled, otherwise sampled inputstages/01_descriptors_initial/metrics/ and filtered/filtered_molecules.csv
struct_filtersDescriptor or molecule tableDescriptors output, then Mol Prep/sample fallbackstages/03_structural_filters_post/filtered_molecules.csv
synthesisFiltered moleculesStructural filters, descriptors, or Mol Prep outputstages/04_synthesis/filtered_molecules.csv
dockingMolecules plus receptor/reference ligandLatest usable filtered tablestages/05_docking/{tool}/{tool}_out.sdf
docking_filtersDocking posesstages/05_docking/...stages/06_docking_filters/filtered_molecules.csv
final_descriptorsFinal survivor tableDocking filters or latest usable filtered tablestages/07_descriptors_final/...

Stage Configuration

Each stage has its own YAML configuration file referenced from the master config.yml:

config_mol_prep: src/hedgehog/configs/config_mol_prep.yml config_descriptors: src/hedgehog/configs/config_descriptors.yml config_structFilters: src/hedgehog/configs/config_structFilters.yml config_synthesis: src/hedgehog/configs/config_synthesis.yml config_docking: src/hedgehog/configs/config_docking.yml config_docking_filters: src/hedgehog/configs/config_docking_filters.yml

Every stage config includes a run: true/false flag that controls whether it executes. See the Configuration Reference for full details.

Output Structure

Each pipeline run writes into the configured output directory. By default, fresh runs use an auto-numbered run directory derived from results/run (for example results/run, results/run_1, results/run_2, …). --out can override the directory and --reuse can intentionally reuse an existing one.

results/run_N/ +-- input/ | +-- sampled_molecules.csv +-- stages/ | +-- 00_mol_prep/ | +-- 01_descriptors_initial/ | +-- 03_structural_filters_post/ | +-- 04_synthesis/ | +-- 05_docking/ | +-- 06_docking_filters/ | +-- 07_descriptors_final/ +-- output/ | +-- final_molecules.csv +-- configs/ | +-- master_config_resolved.yml | +-- config_*.yml +-- report.html +-- report_data.json +-- stage_filter_audit.ipynb +-- RUN_INFO.md +-- run_YYYYMMDD_HHMMSS.log

output/final_molecules.csv is the canonical final artifact. It preserves all columns from the latest filtered dataset and includes aggregated docking score columns (gnina_affinity, gnina_cnnscore, gnina_cnnaffinity, gnina_cnn_vs, smina_affinity, matcha_affinity) when docking outputs are available.

.RUN_INCOMPLETE is a transient marker file written at the start of a run. It is removed on successful completion. If it remains in the run folder, the pipeline was interrupted or failed and the run log should be reviewed.

Last updated on