Docking Filters

The docking filters stage applies a cascade of quality checks to docked poses, removing those with steric clashes, implausible geometries, or missing key interactions. It then deduplicates poses to produce a list of unique molecules for downstream processing.

Filter Cascade

Filters are applied in order from cheapest to most expensive. When the aggregation mode is all (default), the search-box filter can short-circuit evaluation: poses that fail it skip all subsequent filters.

Verifies that the docked pose lies within the configured docking search box. This catches poses that drifted outside the binding site during optimization.

Parameter	Default	Description
`enabled`	`true`	Enable/disable this filter
`max_outside_fraction`	`0.0`	Maximum fraction of atoms allowed outside the box (0.0 = all atoms must be inside)
`short_circuit`	`true`	Skip heavy filters for failed poses (only in `all` mode)

The search box is resolved from the docking configuration: either explicit center/size coordinates, or computed from autobox_ligand + autobox_add (the same reference ligand used for docking).

Filter 1: Pose Quality

The shipped default backend is posebusters_fast, a fast pose-quality check for protein-ligand clashes, volume overlap, and protein distance. The legacy optional backend is posecheck, which exposes strain-energy parameters.

Parameter	Default	Description
`enabled`	`true`	Enable/disable this filter
`backend`	`posebusters_fast`	Backend: `posebusters_fast` or legacy `posecheck`
`clash_cutoff`	`0.75`	Relative VDW distance cutoff for fast clash detection
`volume_clash_cutoff`	`0.075`	ShapeTverskyIndex overlap threshold for volume clash detection
`max_distance`	`5.0`	Maximum minimum ligand-protein distance in Angstroms
`max_clashes`	`2`	Legacy PoseCheck maximum allowed steric clashes
`max_strain_energy`	`50.0`	Legacy PoseCheck maximum ligand strain energy in kcal/mol
`strain_forcefield`	`UFF`	Legacy PoseCheck force field for strain calculation
`clash_tolerance`	`0.5`	Legacy PoseCheck VDW overlap tolerance

Clashes indicate that the ligand overlaps with protein atoms in a physically impossible way. High strain energy means the ligand is in an energetically unfavorable conformation.

Filter 2: Interaction Analysis (ProLIF)

Uses ProLIF to compute a protein-ligand interaction fingerprint and check for required/forbidden contacts.

Parameter	Default	Description
`enabled`	`true`	Enable/disable this filter
`min_hbonds`	`0`	Minimum hydrogen bonds required
`required_residues`	`['ASP12']`	Residue identifiers that must have at least one interaction
`forbidden_residues`	`[]`	Residues that must NOT have any interaction
`interaction_types`	`[HBDonor, HBAcceptor, Hydrophobic, VdWContact]`	Interaction types to detect
`reporting.enabled`	`true`	Generate interaction reporting artifacts
`similarity_threshold`	`0.0`	Tanimoto similarity to reference interaction fingerprint (0 = disabled)

This filter is especially useful when you know the binding mode should involve specific residues (e.g., a catalytic aspartate) or should avoid certain contacts (e.g., a cysteine that causes covalent binding).

Residue identifiers are matched against ProLIF interaction column labels. The bundled demo configuration uses ASP12. Depending on your prepared receptor and ProLIF naming, identifiers may look different, so inspect the generated interaction report before finalizing required_residues.

Filter 3: Shepherd-Score (3D Shape Similarity)

Compares the 3D molecular shape of each pose to a reference ligand using Gaussian overlap Tanimoto.

Parameter	Default	Description
`enabled`	`false`	Disabled by default (requires reference ligand)
`backend`	`auto`	`auto` = worker -> in-process -> soft-skip, `worker` = worker only, `inprocess` = in-process only
`auto_install_worker`	`true`	If worker command is missing, try auto-installing `.venv-shepherd-worker`
`worker_python`	`null`	Optional interpreter for auto-install (`python3.12`, `python3.11`, `python3.10`)
`reference_ligand`	`null`	Path to reference ligand SDF
`min_shape_score`	`0.5`	Minimum shape Tanimoto score
`alpha`	`0.81`	Gaussian width parameter

This filter is disabled by default because it requires a known reference ligand for comparison. Enable it when you have a co-crystallized ligand or known active compound and want to ensure poses adopt a similar shape. For reproducible setup, install an isolated worker environment:


uv run hedgehog setup shepherd-worker --yes

If no Shepherd backend is available at runtime, HEDGEHOG soft-skips this filter (logs a warning and marks pass_shepherd_score=true).

Filter 4: Conformer Deviation

Checks if the docked pose is geometrically plausible by generating multiple low-energy conformers and measuring the RMSD between the docked pose and the closest conformer.

Parameter	Default	Description
`enabled`	`true`	Enable/disable this filter
`use_nvmolkit`	`true`	Try nvMolKit acceleration when available (falls back to RDKit if unavailable)
`num_conformers`	`50`	Number of conformers to generate
`conformer_method`	`ETKDGv3`	Conformer generation method (ETKDG, ETKDGv2, ETKDGv3)
`max_rmsd_to_conformer`	`3.0`	Maximum RMSD in Angstroms to closest conformer
`random_seed`	`42`	Seed for reproducible conformer generation
`include_hydrogens`	`false`	Include hydrogens in RMSD matching
`max_matches`	`10000`	Cap symmetry matching complexity
`early_stop_on_pass`	`true`	Stop comparison once any conformer passes
`optimize_conformers`	`false`	UFF optimization of generated conformers

A high minimum RMSD indicates that the docking engine placed the ligand in a conformation that is energetically unlikely for the molecule to adopt in solution. This catches docking artifacts where the scoring function found a favorable protein-ligand interaction at the cost of internal strain. For isolated setup of optional nvMolKit dependencies:


uv run hedgehog setup nvmolkit-worker

Aggregation

The aggregation mode controls how per-filter results are combined:


aggregation:
  mode: "all"         # "all" = pass every filter, "any" = pass at least one
  save_metrics: true   # Save detailed per-pose metrics CSV
  save_failed: false   # Save molecules that failed filtering

all (default): a pose must pass every enabled filter. This is the conservative approach for drug discovery campaigns.
any: a pose passes if it passes at least one filter. Useful for exploratory analysis.

Deduplication

Docking can produce multiple poses per molecule when num_modes is greater than 1 (the default config uses num_modes: 1). After filtering, the pipeline deduplicates to unique molecules:

All passing poses are saved to filtered_poses.csv (full pose-level detail)
Poses are sorted by minimizedAffinity (best affinity first)
For each unique mol_idx, only the best-scoring pose is kept
Deduplicated molecules are saved to filtered_molecules.csv

SMILES for the output are taken from the original ligands.csv (2D SMILES) rather than regenerated from 3D coordinates, which preserves the original stereochemistry encoding.

Configuration

Full configuration in config_docking_filters.yml:


run: true
run_after_docking: true
 
input_sdf: null       # null = auto-detect from docking output
receptor_pdb: null    # null = use receptor from docking config
 
search_box:
  enabled: true
  max_outside_fraction: 0.0
  short_circuit: true
 
pose_quality:
  enabled: true
  backend: "posebusters_fast"
  clash_cutoff: 0.75
  volume_clash_cutoff: 0.075
  max_distance: 5.0
  max_clashes: 2
  max_strain_energy: 50.0
  strain_forcefield: "UFF"
  clash_tolerance: 0.5
 
interactions:
  enabled: true
  min_hbonds: 0
  required_residues: ['ASP12']
  forbidden_residues: []
  interaction_types:
    - HBDonor
    - HBAcceptor
    - Hydrophobic
    - VdWContact
  reporting:
    enabled: true
 
shepherd_score:
  enabled: false
  backend: "auto"
  auto_install_worker: true
  worker_python: null
  reference_ligand: null
  min_shape_score: 0.5
  alpha: 0.81
 
conformer_deviation:
  enabled: true
  use_nvmolkit: true
  num_conformers: 50
  conformer_method: "ETKDGv3"
  max_rmsd_to_conformer: 3.0
  random_seed: 42
  include_hydrogens: false
  max_matches: 10000
  early_stop_on_pass: true
  optimize_conformers: false
 
aggregation:
  mode: "all"
  save_metrics: true
  save_failed: false

Output Files

File	Description
`metrics.csv`	Per-pose filter metrics and pass/fail flags for every filter
`filtered_molecules.csv`	Unique molecules (best pose per molecule) passing all filters
`filtered_poses.csv`	All passing poses with full metrics (before deduplication)
`filtered_poses.sdf`	3D structures of all passing poses in SDF format

The pipeline-level output file output/final_molecules.csv now keeps all upstream columns and adds aggregated docking scores per molecule:

gnina_affinity
gnina_cnnscore
gnina_cnnaffinity
gnina_cnn_vs
smina_affinity
matcha_affinity

For each tool, values are taken from the best pose per molecule (minimum affinity).

Usage


# Run docking filters as part of the full pipeline
uv run hedgehog
 
# Run docking filters stage only (requires docking output to exist)
uv run hedgehog --stage docking_filters
 
# Short alias
uv run hedge --stage docking_filters