Skip to Content

Docking

The docking stage writes ligand SDFs and runs molecular docking simulations against a target protein structure. HEDGEHOG supports three docking engines (SMINA, GNINA, and Matcha). The default config keeps prepare_ligands: false, which uses input molecules directly where possible and preserves a near 1:1 mapping from input rows to docking ligands.

Docking Engines

FeatureSMINAGNINAMatcha
ScoringEmpirical (Vina-based)CNN-based + empiricalLearned pose generation + optional GNINA scoring
SpeedFastSlower (GPU-accelerated)Slower (ML inference)
Output scoresminimizedAffinityminimizedAffinity, CNNscore, CNNaffinityminimizedAffinity when GNINA scoring is enabled
GPU supportNoYes (CUDA)Yes (recommended)
Output formatSDFSDFSDF
Best forLarge-scale virtual screeningAccurate pose predictionML-based pose generation with optional rescoring

SMINA is a fork of AutoDock Vina with enhanced scoring and minimization. GNINA extends SMINA with convolutional neural network (CNN) scoring functions trained on protein-ligand complexes from the PDB.

Select the tool in the docking config:

tools: gnina # Options: all, gnina, smina, matcha, or a comma-separated list

Use tools: all to run SMINA, GNINA, and Matcha together. For subsets, provide a comma-separated YAML value such as gnina,matcha or smina,gnina.

Ligand Preparation

Docking requires 3D molecular structures in SDF format.

Default Path: Direct RDKit SDF Generation

With prepare_ligands: false (the shipped default), the docking stage uses built-in RDKit conversion:

  1. Parse SMILES with Chem.MolFromSmiles()
  2. Add explicit hydrogens with Chem.AddHs()
  3. Generate 3D coordinates with AllChem.EmbedMolecule() (ETKDG method)
  4. Optimize geometry with AllChem.UFFOptimizeMolecule() (UFF force field)

This produces reasonable starting geometries and keeps row counts predictable, but it does not handle tautomer enumeration or stereoisomer expansion.

External Ligand Preparation Tool (Optional)

For production-quality ligand preparation, configure an external tool path in config.yml:

ligand_preparation_tool: /path/to/ligand_prep_tool

An external preparation tool can provide:

  • Tautomer enumeration
  • Stereoisomer expansion
  • Ionization state prediction at physiological pH
  • Optimized 3D geometry with OPLS force field

The pipeline auto-detects the input format (CSV or SMI) and calls the tool accordingly.

Custom Handler Support

ligand_preparation_tool and protein_preparation_tool can point to any executable (absolute path or command available in PATH).

External handlers are optional:

  • If no external tool is configured, HEDGEHOG uses built-in behavior.
  • Ligands fall back to RDKit preparation.
  • Protein input is used as-is when no protein handler is configured.

For custom handlers, ensure they run non-interactively and produce valid output files expected by the docking stage.

External ligand preparation is controlled by prepare_ligands in config_docking.yml:

prepare_ligands: false

Set it to true only when you explicitly want to run the configured external ligand preparation path. One input molecule may produce multiple prepared ligands, which can change row counts and downstream mapping.

For pipeline input preprocessing before stages:

  • CSV input always uses built-in RDKit preprocessing.
  • SMI-like input may use ligand_preparation_tool.
  • The produced CSV must contain a smiles column (case-insensitive).

Protein Preparation

The receptor PDB file is specified in the docking config:

receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdb

For best results, prepare the receptor with a dedicated structure-preparation tool (or PDB2PQR) before docking. Key steps include:

  • Removing water molecules and co-crystallized ligands
  • Adding hydrogens at physiological pH
  • Optimizing hydrogen-bond networks
  • Minimizing heavy-atom positions (restrained)

Autobox Configuration

The docking search box defines the 3D region where the docking engine places ligand poses. HEDGEHOG determines the box from a reference ligand:

gnina_config: autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf # Reference ligand SDF autobox_add: 4 # Padding in Angstroms

The search box is computed as the bounding box of the reference ligand’s atoms, expanded by autobox_add Angstroms in each direction.

Important: The reference ligand SDF must be in the same coordinate frame as the receptor PDB. If using an apo (unbound) structure, you may need to superimpose the reference ligand from a holo (bound) structure onto the apo receptor coordinates.

Configuration

Full configuration in config_docking.yml:

run: true tools: gnina # all, gnina, smina, matcha, or a comma-separated list receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdb auto_run: true # Execute scripts after generation run_in_background: false prepare_ligands: false # Keep input-to-ligand mapping near 1:1 gnina_per_process_cpu: 8 # CPU threads per GNINA process gnina_parallel_jobs_max: 6 # Cap auto parallel jobs smina_config: bin: smina # Binary name or path autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf autobox_add: 4 cpu: 32 seed: 42 exhaustiveness: 8 # Search thoroughness num_modes: 1 # Max poses per molecule in the default config gnina_config: bin: gnina autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf autobox_add: 4 cpu: 8 seed: 42 no_gpu: false # Use GPU when available num_modes: 1 # Max poses per molecule in the default config matcha_config: checkout_dir: modules/matcha_remote # Managed git checkout created from GitHub PR head uv_bin: uv # Launcher used for `uv run --project` autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf device: auto n_samples: 20 scorer: gnina scorer_minimize: true physical_only: false keep_workdir: false

By default, Matcha is fetched from the remote LigandPro/Matcha repository and checked out at the head SHA of the latest open PR without approvals. The managed checkout is stored under modules/matcha_remote/.

For production runs, prefer a controlled Matcha checkout or an explicitly pinned setup path instead of relying on dynamically selected development code.

Set auto_run: false to generate docking scripts without executing them. This is useful for running docking on a cluster or GPU node separately.

Batch Docking Workflow

The docking stage processes molecules individually:

  1. Each molecule is written to a separate SDF file in _workdir/molecules/
  2. Per-molecule docking config files are generated in _workdir/configs/
  3. Batch shell scripts (run_gnina.sh, run_smina.sh, run_matcha.sh) are generated
  4. If auto_run: true, scripts are executed automatically
  5. Per-molecule results are collected in _workdir/gnina/, _workdir/smina/, or matcha/<run_name>/best_poses/
  6. Results are aggregated into gnina/gnina_out.sdf, smina/smina_out.sdf, or matcha/matcha_out.sdf

Output Structure

stages/05_docking/ +-- ligands.csv Prepared ligands +-- job_meta.json Job metadata +-- gnina/ | +-- gnina_out.sdf Aggregated GNINA results +-- smina/ | +-- smina_out.sdf Aggregated SMINA results +-- matcha/ | +-- matcha_out.sdf Aggregated Matcha results +-- _workdir/ +-- molecules/ Per-molecule SDF files +-- configs/ Per-molecule docking configs +-- gnina/ GNINA per-molecule results & logs +-- smina/ SMINA per-molecule results & logs +-- run_matcha.sh +-- run_gnina.sh +-- run_smina.sh

Usage

# Run docking as part of the full pipeline uv run hedgehog # Run docking stage only uv run hedgehog --stage docking # Short alias uv run hedge --stage docking

To run the generated scripts manually (e.g., on a GPU node):

cd results/stages/05_docking/_workdir ./run_matcha.sh ./run_gnina.sh ./run_smina.sh
Last updated on