Skip to Content
Getting Started

Getting Started

This guide covers installation, running your first pipeline, and understanding the output.

Prerequisites

  • Python 3.10+ — HEDGEHOG requires Python 3.10 or later.
  • uv — used for dependency management and running the CLI. Install from astral.sh/uv .
  • git — to clone the repository.

For the retrosynthesis (AiZynthFinder) part of the synthesis stage, you need the upstream aizynthfinder package and its public data. Use the built-in CLI setup command to install the optional dependency into the project environment and download the data into modules/aizynthfinder/.

Installation

Recommended: install from a source checkout.

# Clone the repository git clone https://github.com/LigandPro/hedgehog.git cd hedgehog # Install HEDGEHOG and Python dependencies uv sync

This is the supported path for the default sample configs, the setup helpers, and the TUI. The repository checkout includes the editable config files, bundled examples, and the modules/ workspace used by optional tool data installers.

Advanced: install from PyPI only if you plan to supply your own config and input paths instead of relying on the repository layout:

python -m pip install hedgehog hedgehog --help

For full retrosynthesis runs, install AiZynthFinder after the base environment is working:

uv run hedgehog setup aizynthfinder

If you prefer on-demand installation of optional tools during a full pipeline run, use:

uv run hedgehog --auto-install

Legacy/manual fallback:

./modules/install_aizynthfinder.sh

First Safe Run

From a source checkout, start with the descriptor/filter-only smoke run:

uv run hedgehog --stage descriptors --stage struct_filters --force-new

Or using the short alias:

uv run hedge --stage descriptors --stage struct_filters --force-new

This uses the default configuration at src/hedgehog/configs/config.yml and the test molecules in src/hedgehog/configs/examples/, but it avoids docking and retrosynthesis. Use this run to verify that the Python environment, bundled examples, descriptor calculation, and structural filters work.

Full Pipeline Run

After the smoke run passes and optional tools are configured, run the full pipeline:

uv run hedgehog setup aizynthfinder uv run hedgehog --auto-install

Full pipeline execution may require AiZynthFinder, GNINA/SMINA/Matcha, valid receptor structures, reference ligands, and enough CPU/GPU resources.

If you installed from PyPI instead of cloning the repository, do not rely on this default sample workflow. Pass your own --config and input paths, or use a source checkout.

Note: the default config.yml contains absolute paths for optional external preparation tools. If you do not have these tools, set ligand_preparation_tool / protein_preparation_tool to empty values (or disable the affected stages) before your first run.

Input Data Contract

The recommended input format is a CSV or TSV file with a smiles header:

smiles,model_name CCO,baseline CCN,baseline c1ccccc1,baseline

For multi-model comparisons, keep one row per molecule/model pair:

smiles,model_name CCO,model_a CCO,model_b CCN,model_a CCN,model_b

Required:

  • smiles

Optional:

  • model_name or name
  • mol_idx

Generated:

  • mol_idx is assigned automatically if missing.

Headerless .smi files are supported by extension for simple one-SMILES-per-line inputs, with an optional second whitespace token used as model_name. CSV/TSV with a smiles header remains the most explicit format for production runs.

Expected Output

When you run the pipeline, you will see:

  1. Banner — the HEDGEHOG banner with version information.
  2. Mol Prep — Datamol-based standardization and strict filtering (salts/fragments cleanup, uncharging, tautomer canonicalization, stereo removal).
  3. Stage execution — each enabled stage runs in order, logging progress and molecule counts. Stages that are disabled in the configuration are skipped.
  4. Report generation — an HTML report, report data, and a stage-audit Jupyter notebook are generated automatically at the end of a successful run.

Typical console output looks like:

🦔 HEDGEHOG Starting pipeline... Stage 1: Mol Prep (Datamol) — 500 molecules processed, 451 passed Stage 2: Molecular Descriptors — 451 molecules processed, 423 passed Stage 3: Structural Filters — 423 molecules processed, 387 passed Stage 4: Synthesis Analysis — 387 molecules processed, 312 passed ... Pipeline completed: 7/7 stages successful

Understanding Results

Results are saved to an auto-numbered directory under the configured output path (default: results/run_N/). Each run creates a new numbered folder unless you use the --reuse flag.

The output directory structure:

results/run_1/ ├── input/ │ └── sampled_molecules.csv # Input molecules (sampled) ├── stages/ │ ├── 00_mol_prep/ # Datamol-based standardization + strict filtering │ ├── 01_descriptors_initial/ # Physicochemical descriptors │ ├── 03_structural_filters_post/ # Post-descriptors structural filters │ ├── 04_synthesis/ # Retrosynthesis analysis │ ├── 05_docking/ # Molecular docking (SMINA/GNINA/Matcha) │ ├── 06_docking_filters/ # Post-docking pose & interaction filters │ └── 07_descriptors_final/ # Final descriptor recalculation ├── output/ │ └── final_molecules.csv # Final filtered molecules ├── configs/ # Configuration snapshots │ ├── master_config_resolved.yml │ └── config_*.yml ├── report.html # Interactive HTML report ├── report_data.json # Report data ├── stage_filter_audit.ipynb # Jupyter notebook for stage-by-stage molecule audit └── RUN_INFO.md # Run summary with MolEval metrics table

Each stage directory contains a filtered_molecules.csv file with the molecules that passed that stage, along with any stage-specific outputs (plots, intermediate files). The report.html file is the primary deliverable for a browser-based overview, while stage_filter_audit.ipynb lets you inspect passed or dropped molecules per stage in mols2grid and compare them against descriptor, synthesis, or docking-filter thresholds.

Last updated on