Docking

The docking stage writes ligand SDFs and runs molecular docking simulations against a target protein structure. HEDGEHOG supports three docking engines (SMINA, GNINA, and Matcha). The default config keeps prepare_ligands: false, which uses input molecules directly where possible and preserves a near 1:1 mapping from input rows to docking ligands.

Docking Engines

Feature	SMINA	GNINA	Matcha
Scoring	Empirical (Vina-based)	CNN-based + empirical	Learned pose generation + optional GNINA scoring
Speed	Fast	Slower (GPU-accelerated)	Slower (ML inference)
Output scores	`minimizedAffinity`	`minimizedAffinity`, `CNNscore`, `CNNaffinity`	`minimizedAffinity` when GNINA scoring is enabled
GPU support	No	Yes (CUDA)	Yes (recommended)
Output format	SDF	SDF	SDF
Best for	Large-scale virtual screening	Accurate pose prediction	ML-based pose generation with optional rescoring

SMINA is a fork of AutoDock Vina with enhanced scoring and minimization. GNINA extends SMINA with convolutional neural network (CNN) scoring functions trained on protein-ligand complexes from the PDB.

Select the tool in the docking config:


tools: gnina    # Options: all, gnina, smina, matcha, or a comma-separated list

Use tools: all to run SMINA, GNINA, and Matcha together. For subsets, provide a comma-separated YAML value such as gnina,matcha or smina,gnina.

Ligand Preparation

Docking requires 3D molecular structures in SDF format.

Default Path: Direct RDKit SDF Generation

With prepare_ligands: false (the shipped default), the docking stage uses built-in RDKit conversion:

Parse SMILES with Chem.MolFromSmiles()
Add explicit hydrogens with Chem.AddHs()
Generate 3D coordinates with AllChem.EmbedMolecule() (ETKDG method)
Optimize geometry with AllChem.UFFOptimizeMolecule() (UFF force field)

This produces reasonable starting geometries and keeps row counts predictable, but it does not handle tautomer enumeration or stereoisomer expansion.

External Ligand Preparation Tool (Optional)

For production-quality ligand preparation, configure an external tool path in config.yml:


ligand_preparation_tool: /path/to/ligand_prep_tool

An external preparation tool can provide:

Tautomer enumeration
Stereoisomer expansion
Ionization state prediction at physiological pH
Optimized 3D geometry with OPLS force field

The pipeline auto-detects the input format (CSV or SMI) and calls the tool accordingly.

Custom Handler Support

ligand_preparation_tool and protein_preparation_tool can point to any executable (absolute path or command available in PATH).

External handlers are optional:

If no external tool is configured, HEDGEHOG uses built-in behavior.
Ligands fall back to RDKit preparation.
Protein input is used as-is when no protein handler is configured.

For custom handlers, ensure they run non-interactively and produce valid output files expected by the docking stage.

External ligand preparation is controlled by prepare_ligands in config_docking.yml:


prepare_ligands: false

Set it to true only when you explicitly want to run the configured external ligand preparation path. One input molecule may produce multiple prepared ligands, which can change row counts and downstream mapping.

For pipeline input preprocessing before stages:

CSV input always uses built-in RDKit preprocessing.
SMI-like input may use ligand_preparation_tool.
The produced CSV must contain a smiles column (case-insensitive).

Protein Preparation

The receptor PDB file is specified in the docking config:


receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdb

For best results, prepare the receptor with a dedicated structure-preparation tool (or PDB2PQR) before docking. Key steps include:

Removing water molecules and co-crystallized ligands
Adding hydrogens at physiological pH
Optimizing hydrogen-bond networks
Minimizing heavy-atom positions (restrained)

Autobox Configuration

The docking search box defines the 3D region where the docking engine places ligand poses. HEDGEHOG determines the box from a reference ligand:


gnina_config:
  autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf   # Reference ligand SDF
  autobox_add: 4                                  # Padding in Angstroms

The search box is computed as the bounding box of the reference ligand’s atoms, expanded by autobox_add Angstroms in each direction.

Important: The reference ligand SDF must be in the same coordinate frame as the receptor PDB. If using an apo (unbound) structure, you may need to superimpose the reference ligand from a holo (bound) structure onto the apo receptor coordinates.

Configuration

Full configuration in config_docking.yml:


run: true
tools: gnina                              # all, gnina, smina, matcha, or a comma-separated list
receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdb
auto_run: true                            # Execute scripts after generation
run_in_background: false
prepare_ligands: false                    # Keep input-to-ligand mapping near 1:1
gnina_per_process_cpu: 8                  # CPU threads per GNINA process
gnina_parallel_jobs_max: 6                # Cap auto parallel jobs
 
smina_config:
  bin: smina                              # Binary name or path
  autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf
  autobox_add: 4
  cpu: 32
  seed: 42
  exhaustiveness: 8                       # Search thoroughness
  num_modes: 1                            # Max poses per molecule in the default config
 
gnina_config:
  bin: gnina
  autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf
  autobox_add: 4
  cpu: 8
  seed: 42
  no_gpu: false                          # Use GPU when available
  num_modes: 1                           # Max poses per molecule in the default config
 
matcha_config:
  checkout_dir: modules/matcha_remote    # Managed git checkout created from GitHub PR head
  uv_bin: uv                             # Launcher used for `uv run --project`
  autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf
  device: auto
  n_samples: 20
  scorer: gnina
  scorer_minimize: true
  physical_only: false
  keep_workdir: false

By default, Matcha is fetched from the remote LigandPro/Matcha repository and checked out at the head SHA of the latest open PR without approvals. The managed checkout is stored under modules/matcha_remote/.

For production runs, prefer a controlled Matcha checkout or an explicitly pinned setup path instead of relying on dynamically selected development code.

Set auto_run: false to generate docking scripts without executing them. This is useful for running docking on a cluster or GPU node separately.

Batch Docking Workflow

The docking stage processes molecules individually:

Each molecule is written to a separate SDF file in _workdir/molecules/
Per-molecule docking config files are generated in _workdir/configs/
Batch shell scripts (run_gnina.sh, run_smina.sh, run_matcha.sh) are generated
If auto_run: true, scripts are executed automatically
Per-molecule results are collected in _workdir/gnina/, _workdir/smina/, or matcha/<run_name>/best_poses/
Results are aggregated into gnina/gnina_out.sdf, smina/smina_out.sdf, or matcha/matcha_out.sdf

Output Structure


stages/05_docking/
+-- ligands.csv                Prepared ligands
+-- job_meta.json              Job metadata
+-- gnina/
|   +-- gnina_out.sdf          Aggregated GNINA results
+-- smina/
|   +-- smina_out.sdf          Aggregated SMINA results
+-- matcha/
|   +-- matcha_out.sdf         Aggregated Matcha results
+-- _workdir/
    +-- molecules/             Per-molecule SDF files
    +-- configs/               Per-molecule docking configs
    +-- gnina/                 GNINA per-molecule results & logs
    +-- smina/                 SMINA per-molecule results & logs
    +-- run_matcha.sh
    +-- run_gnina.sh
    +-- run_smina.sh

Usage


# Run docking as part of the full pipeline
uv run hedgehog
 
# Run docking stage only
uv run hedgehog --stage docking
 
# Short alias
uv run hedge --stage docking

To run the generated scripts manually (e.g., on a GPU node):


cd results/stages/05_docking/_workdir
./run_matcha.sh
./run_gnina.sh
./run_smina.sh