Docking
The docking stage writes ligand SDFs and runs molecular docking simulations against a target protein structure. HEDGEHOG supports three docking engines (SMINA, GNINA, and Matcha). The default config keeps prepare_ligands: false, which uses input molecules directly where possible and preserves a near 1:1 mapping from input rows to docking ligands.
Docking Engines
| Feature | SMINA | GNINA | Matcha |
|---|---|---|---|
| Scoring | Empirical (Vina-based) | CNN-based + empirical | Learned pose generation + optional GNINA scoring |
| Speed | Fast | Slower (GPU-accelerated) | Slower (ML inference) |
| Output scores | minimizedAffinity | minimizedAffinity, CNNscore, CNNaffinity | minimizedAffinity when GNINA scoring is enabled |
| GPU support | No | Yes (CUDA) | Yes (recommended) |
| Output format | SDF | SDF | SDF |
| Best for | Large-scale virtual screening | Accurate pose prediction | ML-based pose generation with optional rescoring |
SMINA is a fork of AutoDock Vina with enhanced scoring and minimization. GNINA extends SMINA with convolutional neural network (CNN) scoring functions trained on protein-ligand complexes from the PDB.
Select the tool in the docking config:
tools: gnina # Options: all, gnina, smina, matcha, or a comma-separated listUse tools: all to run SMINA, GNINA, and Matcha together. For subsets, provide a comma-separated YAML value such as gnina,matcha or smina,gnina.
Ligand Preparation
Docking requires 3D molecular structures in SDF format.
Default Path: Direct RDKit SDF Generation
With prepare_ligands: false (the shipped default), the docking stage uses built-in RDKit conversion:
- Parse SMILES with
Chem.MolFromSmiles() - Add explicit hydrogens with
Chem.AddHs() - Generate 3D coordinates with
AllChem.EmbedMolecule()(ETKDG method) - Optimize geometry with
AllChem.UFFOptimizeMolecule()(UFF force field)
This produces reasonable starting geometries and keeps row counts predictable, but it does not handle tautomer enumeration or stereoisomer expansion.
External Ligand Preparation Tool (Optional)
For production-quality ligand preparation, configure an external tool path in config.yml:
ligand_preparation_tool: /path/to/ligand_prep_toolAn external preparation tool can provide:
- Tautomer enumeration
- Stereoisomer expansion
- Ionization state prediction at physiological pH
- Optimized 3D geometry with OPLS force field
The pipeline auto-detects the input format (CSV or SMI) and calls the tool accordingly.
Custom Handler Support
ligand_preparation_tool and protein_preparation_tool can point to any executable (absolute path or command available in PATH).
External handlers are optional:
- If no external tool is configured, HEDGEHOG uses built-in behavior.
- Ligands fall back to RDKit preparation.
- Protein input is used as-is when no protein handler is configured.
For custom handlers, ensure they run non-interactively and produce valid output files expected by the docking stage.
External ligand preparation is controlled by prepare_ligands in config_docking.yml:
prepare_ligands: falseSet it to true only when you explicitly want to run the configured external
ligand preparation path. One input molecule may produce multiple prepared
ligands, which can change row counts and downstream mapping.
For pipeline input preprocessing before stages:
- CSV input always uses built-in RDKit preprocessing.
- SMI-like input may use
ligand_preparation_tool. - The produced CSV must contain a
smilescolumn (case-insensitive).
Protein Preparation
The receptor PDB file is specified in the docking config:
receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdbFor best results, prepare the receptor with a dedicated structure-preparation tool (or PDB2PQR) before docking. Key steps include:
- Removing water molecules and co-crystallized ligands
- Adding hydrogens at physiological pH
- Optimizing hydrogen-bond networks
- Minimizing heavy-atom positions (restrained)
Autobox Configuration
The docking search box defines the 3D region where the docking engine places ligand poses. HEDGEHOG determines the box from a reference ligand:
gnina_config:
autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf # Reference ligand SDF
autobox_add: 4 # Padding in AngstromsThe search box is computed as the bounding box of the reference ligand’s atoms, expanded by autobox_add Angstroms in each direction.
Important: The reference ligand SDF must be in the same coordinate frame as the receptor PDB. If using an apo (unbound) structure, you may need to superimpose the reference ligand from a holo (bound) structure onto the apo receptor coordinates.
Configuration
Full configuration in config_docking.yml:
run: true
tools: gnina # all, gnina, smina, matcha, or a comma-separated list
receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdb
auto_run: true # Execute scripts after generation
run_in_background: false
prepare_ligands: false # Keep input-to-ligand mapping near 1:1
gnina_per_process_cpu: 8 # CPU threads per GNINA process
gnina_parallel_jobs_max: 6 # Cap auto parallel jobs
smina_config:
bin: smina # Binary name or path
autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf
autobox_add: 4
cpu: 32
seed: 42
exhaustiveness: 8 # Search thoroughness
num_modes: 1 # Max poses per molecule in the default config
gnina_config:
bin: gnina
autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf
autobox_add: 4
cpu: 8
seed: 42
no_gpu: false # Use GPU when available
num_modes: 1 # Max poses per molecule in the default config
matcha_config:
checkout_dir: modules/matcha_remote # Managed git checkout created from GitHub PR head
uv_bin: uv # Launcher used for `uv run --project`
autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf
device: auto
n_samples: 20
scorer: gnina
scorer_minimize: true
physical_only: false
keep_workdir: falseBy default, Matcha is fetched from the remote LigandPro/Matcha repository and checked out at the head SHA of the latest open PR without approvals. The managed checkout is stored under modules/matcha_remote/.
For production runs, prefer a controlled Matcha checkout or an explicitly pinned setup path instead of relying on dynamically selected development code.
Set auto_run: false to generate docking scripts without executing them. This is useful for running docking on a cluster or GPU node separately.
Batch Docking Workflow
The docking stage processes molecules individually:
- Each molecule is written to a separate SDF file in
_workdir/molecules/ - Per-molecule docking config files are generated in
_workdir/configs/ - Batch shell scripts (
run_gnina.sh,run_smina.sh,run_matcha.sh) are generated - If
auto_run: true, scripts are executed automatically - Per-molecule results are collected in
_workdir/gnina/,_workdir/smina/, ormatcha/<run_name>/best_poses/ - Results are aggregated into
gnina/gnina_out.sdf,smina/smina_out.sdf, ormatcha/matcha_out.sdf
Output Structure
stages/05_docking/
+-- ligands.csv Prepared ligands
+-- job_meta.json Job metadata
+-- gnina/
| +-- gnina_out.sdf Aggregated GNINA results
+-- smina/
| +-- smina_out.sdf Aggregated SMINA results
+-- matcha/
| +-- matcha_out.sdf Aggregated Matcha results
+-- _workdir/
+-- molecules/ Per-molecule SDF files
+-- configs/ Per-molecule docking configs
+-- gnina/ GNINA per-molecule results & logs
+-- smina/ SMINA per-molecule results & logs
+-- run_matcha.sh
+-- run_gnina.sh
+-- run_smina.shUsage
# Run docking as part of the full pipeline
uv run hedgehog
# Run docking stage only
uv run hedgehog --stage docking
# Short alias
uv run hedge --stage dockingTo run the generated scripts manually (e.g., on a GPU node):
cd results/stages/05_docking/_workdir
./run_matcha.sh
./run_gnina.sh
./run_smina.sh