Docking Filters
The docking filters stage applies a cascade of quality checks to docked poses, removing those with steric clashes, implausible geometries, or missing key interactions. It then deduplicates poses to produce a list of unique molecules for downstream processing.
Filter Cascade
Filters are applied in order from cheapest to most expensive. When the aggregation mode is all (default), the search-box filter can short-circuit evaluation: poses that fail it skip all subsequent filters.
Filter 0: Search-Box Containment
Verifies that the docked pose lies within the configured docking search box. This catches poses that drifted outside the binding site during optimization.
| Parameter | Default | Description |
|---|---|---|
enabled | true | Enable/disable this filter |
max_outside_fraction | 0.0 | Maximum fraction of atoms allowed outside the box (0.0 = all atoms must be inside) |
short_circuit | true | Skip heavy filters for failed poses (only in all mode) |
The search box is resolved from the docking configuration: either explicit center/size coordinates, or computed from autobox_ligand + autobox_add (the same reference ligand used for docking).
Filter 1: Pose Quality
The shipped default backend is posebusters_fast, a fast pose-quality check for
protein-ligand clashes, volume overlap, and protein distance. The legacy
optional backend is posecheck, which exposes strain-energy parameters.
| Parameter | Default | Description |
|---|---|---|
enabled | true | Enable/disable this filter |
backend | posebusters_fast | Backend: posebusters_fast or legacy posecheck |
clash_cutoff | 0.75 | Relative VDW distance cutoff for fast clash detection |
volume_clash_cutoff | 0.075 | ShapeTverskyIndex overlap threshold for volume clash detection |
max_distance | 5.0 | Maximum minimum ligand-protein distance in Angstroms |
max_clashes | 2 | Legacy PoseCheck maximum allowed steric clashes |
max_strain_energy | 50.0 | Legacy PoseCheck maximum ligand strain energy in kcal/mol |
strain_forcefield | UFF | Legacy PoseCheck force field for strain calculation |
clash_tolerance | 0.5 | Legacy PoseCheck VDW overlap tolerance |
Clashes indicate that the ligand overlaps with protein atoms in a physically impossible way. High strain energy means the ligand is in an energetically unfavorable conformation.
Filter 2: Interaction Analysis (ProLIF)
Uses ProLIF to compute a protein-ligand interaction fingerprint and check for required/forbidden contacts.
| Parameter | Default | Description |
|---|---|---|
enabled | true | Enable/disable this filter |
min_hbonds | 0 | Minimum hydrogen bonds required |
required_residues | ['ASP12'] | Residue identifiers that must have at least one interaction |
forbidden_residues | [] | Residues that must NOT have any interaction |
interaction_types | [HBDonor, HBAcceptor, Hydrophobic, VdWContact] | Interaction types to detect |
reporting.enabled | true | Generate interaction reporting artifacts |
similarity_threshold | 0.0 | Tanimoto similarity to reference interaction fingerprint (0 = disabled) |
This filter is especially useful when you know the binding mode should involve specific residues (e.g., a catalytic aspartate) or should avoid certain contacts (e.g., a cysteine that causes covalent binding).
Residue identifiers are matched against ProLIF interaction column labels. The
bundled demo configuration uses ASP12. Depending on your prepared receptor and
ProLIF naming, identifiers may look different, so inspect the generated
interaction report before finalizing required_residues.
Filter 3: Shepherd-Score (3D Shape Similarity)
Compares the 3D molecular shape of each pose to a reference ligand using Gaussian overlap Tanimoto.
| Parameter | Default | Description |
|---|---|---|
enabled | false | Disabled by default (requires reference ligand) |
backend | auto | auto = worker -> in-process -> soft-skip, worker = worker only, inprocess = in-process only |
auto_install_worker | true | If worker command is missing, try auto-installing .venv-shepherd-worker |
worker_python | null | Optional interpreter for auto-install (python3.12, python3.11, python3.10) |
reference_ligand | null | Path to reference ligand SDF |
min_shape_score | 0.5 | Minimum shape Tanimoto score |
alpha | 0.81 | Gaussian width parameter |
This filter is disabled by default because it requires a known reference ligand for comparison. Enable it when you have a co-crystallized ligand or known active compound and want to ensure poses adopt a similar shape. For reproducible setup, install an isolated worker environment:
uv run hedgehog setup shepherd-worker --yesIf no Shepherd backend is available at runtime, HEDGEHOG soft-skips this filter (logs a warning and marks pass_shepherd_score=true).
Filter 4: Conformer Deviation
Checks if the docked pose is geometrically plausible by generating multiple low-energy conformers and measuring the RMSD between the docked pose and the closest conformer.
| Parameter | Default | Description |
|---|---|---|
enabled | true | Enable/disable this filter |
use_nvmolkit | true | Try nvMolKit acceleration when available (falls back to RDKit if unavailable) |
num_conformers | 50 | Number of conformers to generate |
conformer_method | ETKDGv3 | Conformer generation method (ETKDG, ETKDGv2, ETKDGv3) |
max_rmsd_to_conformer | 3.0 | Maximum RMSD in Angstroms to closest conformer |
random_seed | 42 | Seed for reproducible conformer generation |
include_hydrogens | false | Include hydrogens in RMSD matching |
max_matches | 10000 | Cap symmetry matching complexity |
early_stop_on_pass | true | Stop comparison once any conformer passes |
optimize_conformers | false | UFF optimization of generated conformers |
A high minimum RMSD indicates that the docking engine placed the ligand in a conformation that is energetically unlikely for the molecule to adopt in solution. This catches docking artifacts where the scoring function found a favorable protein-ligand interaction at the cost of internal strain. For isolated setup of optional nvMolKit dependencies:
uv run hedgehog setup nvmolkit-workerAggregation
The aggregation mode controls how per-filter results are combined:
aggregation:
mode: "all" # "all" = pass every filter, "any" = pass at least one
save_metrics: true # Save detailed per-pose metrics CSV
save_failed: false # Save molecules that failed filteringall(default): a pose must pass every enabled filter. This is the conservative approach for drug discovery campaigns.any: a pose passes if it passes at least one filter. Useful for exploratory analysis.
Deduplication
Docking can produce multiple poses per molecule when num_modes is greater than 1 (the default config uses num_modes: 1). After filtering, the pipeline deduplicates to unique molecules:
- All passing poses are saved to
filtered_poses.csv(full pose-level detail) - Poses are sorted by
minimizedAffinity(best affinity first) - For each unique
mol_idx, only the best-scoring pose is kept - Deduplicated molecules are saved to
filtered_molecules.csv
SMILES for the output are taken from the original ligands.csv (2D SMILES) rather than regenerated from 3D coordinates, which preserves the original stereochemistry encoding.
Configuration
Full configuration in config_docking_filters.yml:
run: true
run_after_docking: true
input_sdf: null # null = auto-detect from docking output
receptor_pdb: null # null = use receptor from docking config
search_box:
enabled: true
max_outside_fraction: 0.0
short_circuit: true
pose_quality:
enabled: true
backend: "posebusters_fast"
clash_cutoff: 0.75
volume_clash_cutoff: 0.075
max_distance: 5.0
max_clashes: 2
max_strain_energy: 50.0
strain_forcefield: "UFF"
clash_tolerance: 0.5
interactions:
enabled: true
min_hbonds: 0
required_residues: ['ASP12']
forbidden_residues: []
interaction_types:
- HBDonor
- HBAcceptor
- Hydrophobic
- VdWContact
reporting:
enabled: true
shepherd_score:
enabled: false
backend: "auto"
auto_install_worker: true
worker_python: null
reference_ligand: null
min_shape_score: 0.5
alpha: 0.81
conformer_deviation:
enabled: true
use_nvmolkit: true
num_conformers: 50
conformer_method: "ETKDGv3"
max_rmsd_to_conformer: 3.0
random_seed: 42
include_hydrogens: false
max_matches: 10000
early_stop_on_pass: true
optimize_conformers: false
aggregation:
mode: "all"
save_metrics: true
save_failed: falseOutput Files
| File | Description |
|---|---|
metrics.csv | Per-pose filter metrics and pass/fail flags for every filter |
filtered_molecules.csv | Unique molecules (best pose per molecule) passing all filters |
filtered_poses.csv | All passing poses with full metrics (before deduplication) |
filtered_poses.sdf | 3D structures of all passing poses in SDF format |
The pipeline-level output file output/final_molecules.csv now keeps all upstream columns and adds aggregated docking scores per molecule:
gnina_affinitygnina_cnnscoregnina_cnnaffinitygnina_cnn_vssmina_affinitymatcha_affinity
For each tool, values are taken from the best pose per molecule (minimum affinity).
Usage
# Run docking filters as part of the full pipeline
uv run hedgehog
# Run docking filters stage only (requires docking output to exist)
uv run hedgehog --stage docking_filters
# Short alias
uv run hedge --stage docking_filters