Getting Started
This guide covers installation, running your first pipeline, and understanding the output.
Prerequisites
- Python 3.10+ — HEDGEHOG requires Python 3.10 or later.
- uv — used for dependency management and running the CLI. Install from astral.sh/uv .
- git — to clone the repository.
For the retrosynthesis (AiZynthFinder) part of the synthesis stage, you need the upstream aizynthfinder package and its public data. Use the built-in CLI setup command to install the optional dependency into the project environment and download the data into modules/aizynthfinder/.
Installation
Recommended: install from a source checkout.
# Clone the repository
git clone https://github.com/LigandPro/hedgehog.git
cd hedgehog
# Install HEDGEHOG and Python dependencies
uv syncThis is the supported path for the default sample configs, the setup helpers, and the TUI. The repository checkout includes the editable config files, bundled examples, and the modules/ workspace used by optional tool data installers.
Advanced: install from PyPI only if you plan to supply your own config and input paths instead of relying on the repository layout:
python -m pip install hedgehog
hedgehog --helpFor full retrosynthesis runs, install AiZynthFinder after the base environment is working:
uv run hedgehog setup aizynthfinderIf you prefer on-demand installation of optional tools during a full pipeline run, use:
uv run hedgehog --auto-installLegacy/manual fallback:
./modules/install_aizynthfinder.shFirst Safe Run
From a source checkout, start with the descriptor/filter-only smoke run:
uv run hedgehog --stage descriptors --stage struct_filters --force-newOr using the short alias:
uv run hedge --stage descriptors --stage struct_filters --force-newThis uses the default configuration at src/hedgehog/configs/config.yml and the test molecules in src/hedgehog/configs/examples/, but it avoids docking and retrosynthesis. Use this run to verify that the Python environment, bundled examples, descriptor calculation, and structural filters work.
Full Pipeline Run
After the smoke run passes and optional tools are configured, run the full pipeline:
uv run hedgehog setup aizynthfinder
uv run hedgehog --auto-installFull pipeline execution may require AiZynthFinder, GNINA/SMINA/Matcha, valid receptor structures, reference ligands, and enough CPU/GPU resources.
If you installed from PyPI instead of cloning the repository, do not rely on this default sample workflow. Pass your own --config and input paths, or use a source checkout.
Note: the default config.yml contains absolute paths for optional external preparation tools. If you do not have these tools, set ligand_preparation_tool / protein_preparation_tool to empty values (or disable the affected stages) before your first run.
Input Data Contract
The recommended input format is a CSV or TSV file with a smiles header:
smiles,model_name
CCO,baseline
CCN,baseline
c1ccccc1,baselineFor multi-model comparisons, keep one row per molecule/model pair:
smiles,model_name
CCO,model_a
CCO,model_b
CCN,model_a
CCN,model_bRequired:
smiles
Optional:
model_nameornamemol_idx
Generated:
mol_idxis assigned automatically if missing.
Headerless .smi files are supported by extension for simple one-SMILES-per-line
inputs, with an optional second whitespace token used as model_name. CSV/TSV
with a smiles header remains the most explicit format for production runs.
Expected Output
When you run the pipeline, you will see:
- Banner — the HEDGEHOG banner with version information.
- Mol Prep — Datamol-based standardization and strict filtering (salts/fragments cleanup, uncharging, tautomer canonicalization, stereo removal).
- Stage execution — each enabled stage runs in order, logging progress and molecule counts. Stages that are disabled in the configuration are skipped.
- Report generation — an HTML report, report data, and a stage-audit Jupyter notebook are generated automatically at the end of a successful run.
Typical console output looks like:
🦔 HEDGEHOG
Starting pipeline...
Stage 1: Mol Prep (Datamol) — 500 molecules processed, 451 passed
Stage 2: Molecular Descriptors — 451 molecules processed, 423 passed
Stage 3: Structural Filters — 423 molecules processed, 387 passed
Stage 4: Synthesis Analysis — 387 molecules processed, 312 passed
...
Pipeline completed: 7/7 stages successfulUnderstanding Results
Results are saved to an auto-numbered directory under the configured output path (default: results/run_N/). Each run creates a new numbered folder unless you use the --reuse flag.
The output directory structure:
results/run_1/
├── input/
│ └── sampled_molecules.csv # Input molecules (sampled)
├── stages/
│ ├── 00_mol_prep/ # Datamol-based standardization + strict filtering
│ ├── 01_descriptors_initial/ # Physicochemical descriptors
│ ├── 03_structural_filters_post/ # Post-descriptors structural filters
│ ├── 04_synthesis/ # Retrosynthesis analysis
│ ├── 05_docking/ # Molecular docking (SMINA/GNINA/Matcha)
│ ├── 06_docking_filters/ # Post-docking pose & interaction filters
│ └── 07_descriptors_final/ # Final descriptor recalculation
├── output/
│ └── final_molecules.csv # Final filtered molecules
├── configs/ # Configuration snapshots
│ ├── master_config_resolved.yml
│ └── config_*.yml
├── report.html # Interactive HTML report
├── report_data.json # Report data
├── stage_filter_audit.ipynb # Jupyter notebook for stage-by-stage molecule audit
└── RUN_INFO.md # Run summary with MolEval metrics tableEach stage directory contains a filtered_molecules.csv file with the molecules that passed that stage, along with any stage-specific outputs (plots, intermediate files). The report.html file is the primary deliverable for a browser-based overview, while stage_filter_audit.ipynb lets you inspect passed or dropped molecules per stage in mols2grid and compare them against descriptor, synthesis, or docking-filter thresholds.