Skip to Content
ConfigurationParameter Reference

Parameter Reference

Complete reference for every parameter in Hedgehog’s YAML configuration files. Each section shows the parameter table followed by the full default YAML.


config.yml

The main configuration file. Controls input/output paths, parallelism, and references to all stage-specific configs.

ParameterTypeDefaultDescription
generated_mols_pathstringsrc/hedgehog/configs/examples/moses_1000.csvPath to the CSV file containing generated molecules (must have a SMILES column)
target_mols_pathstringsrc/hedgehog/configs/examples/target_mols.csvPath to the CSV file containing target/reference molecules
folder_to_savestringresults/runOutput directory where all pipeline results are saved
n_jobsint-1Number of parallel workers for CPU-bound tasks (-1 = all available cores)
sample_sizeint10000Number of molecules to sample from the input file (null = use all)
batch_sizeint512Batch size for descriptor computation and other batched operations
save_sampled_molsbooltrueWhether to save the sampled molecule subset to disk
large_dataset_modeboolfalseEnable streaming chunked processing for very large pre-docking dataset statistics
large_dataset_chunk_rowsint250000Rows per processing chunk in large-dataset mode
large_dataset_single_csv_limitint1000000Maximum row count for also materializing compatibility CSV files from shard outputs
large_dataset_output_formatstringcsv.gzShard file format for large-dataset row-level intermediate tables
large_dataset_filter_databoolfalseIn large-dataset mode, whether filter pass/fail results should remove molecules from downstream outputs
large_dataset_enable_all_filtersbooltrueIn large-dataset mode, enable configured descriptor/structural filters as calculations even when they do not filter outputs
pains_file_pathstringsrc/hedgehog/vendor/moleval/metrics/wehi_pains.csvPath to the PAINS filter definitions file
mcf_file_pathstringsrc/hedgehog/vendor/moleval/metrics/mcf.csvPath to the MCF (medicinal chemistry filters) definitions file
ligand_preparation_toolstring(proprietary path)Absolute path to an external ligand preparation binary
protein_preparation_toolstring(proprietary path)Absolute path to an external protein preparation binary
config_mol_prepstringsrc/hedgehog/configs/config_mol_prep.ymlPath to the Mol Prep stage config
config_descriptorsstringsrc/hedgehog/configs/config_descriptors.ymlPath to the descriptors stage config
config_structFiltersstringsrc/hedgehog/configs/config_structFilters.ymlPath to the structural filters stage config
config_synthesisstringsrc/hedgehog/configs/config_synthesis.ymlPath to the synthesis stage config
config_dockingstringsrc/hedgehog/configs/config_docking.ymlPath to the docking stage config
config_docking_filtersstringsrc/hedgehog/configs/config_docking_filters.ymlPath to the docking filters stage config
config_weighted_scorestringsrc/hedgehog/configs/config_weighted_score.ymlPath to the weighted model assessment config
config_molevalstringsrc/hedgehog/configs/config_moleval.ymlPath to the MolEval reporting config

Default YAML

generated_mols_path: src/hedgehog/configs/examples/moses_1000.csv target_mols_path: src/hedgehog/configs/examples/target_mols.csv folder_to_save: results/run n_jobs: -1 sample_size: 10000 batch_size: 512 save_sampled_mols: true large_dataset_mode: false large_dataset_chunk_rows: 250000 large_dataset_single_csv_limit: 1000000 large_dataset_output_format: csv.gz large_dataset_filter_data: false large_dataset_enable_all_filters: true pains_file_path: src/hedgehog/vendor/moleval/metrics/wehi_pains.csv mcf_file_path: src/hedgehog/vendor/moleval/metrics/mcf.csv ligand_preparation_tool: /opt/proprietary_tools/ligand_prep/bin/ligand_prep protein_preparation_tool: /opt/proprietary_tools/protein_prep/bin/protein_prep config_mol_prep: src/hedgehog/configs/config_mol_prep.yml config_descriptors: src/hedgehog/configs/config_descriptors.yml config_structFilters: src/hedgehog/configs/config_structFilters.yml config_synthesis: src/hedgehog/configs/config_synthesis.yml config_docking: src/hedgehog/configs/config_docking.yml config_docking_filters: src/hedgehog/configs/config_docking_filters.yml config_weighted_score: src/hedgehog/configs/config_weighted_score.yml config_moleval: src/hedgehog/configs/config_moleval.yml

For laptops, shared servers, CI, or notebooks, prefer an explicit smaller n_jobs such as 4 or 8 instead of the all-cores default.


config_mol_prep.yml

Standardizes molecules before any descriptor computation. This stage aims to produce “clean” molecules by:

  • removing salts/solvents and keeping the largest fragment
  • disconnecting metals
  • neutralizing charges
  • canonicalizing tautomers (standardize_smiles)
  • removing stereochemistry
  • applying strict filters (allowed atom whitelist, no radicals, no isotopes, single fragment, neutral molecules)

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable Mol Prep
n_jobsint-1Worker count for molecule preparation (-1 = all available cores)
filters.allowed_atomslist[string][C, N, O, S, F, Cl, Br, I, P, H]Allowed atom symbols
filters.require_neutralbooltrueReject molecules with any formal charge
filters.require_single_fragmentbooltrueReject multi-fragment molecules
filters.reject_radicalsbooltrueReject molecules with radical electrons
filters.reject_isotopesbooltrueReject isotopically labeled molecules
output.write_duplicates_removedbooltrueWrite duplicates_removed.csv when duplicates are dropped

Default YAML

run: true n_jobs: -1 columns: smiles: smiles model_name: model_name mol_idx: mol_idx smiles_raw: smiles_raw steps: to_mol: ordered: true sanitize: false allow_cxsmiles: true strict_cxsmiles: true remove_hs: true fix_mol: enabled: true n_iter: 1 remove_singleton: true largest_only: false sanitize_mol: enabled: true remove_salts_solvents: enabled: true defn_data: null defn_format: smarts dont_remove_everything: true sanitize: true keep_largest_fragment: true standardize_mol: enabled: true disconnect_metals: true normalize: true reionize: true uncharge: true stereo: true remove_stereochemistry: true standardize_smiles: enabled: true filters: allowed_atoms: [C, N, O, S, F, Cl, Br, I, P, H] reject_radicals: true require_neutral: true reject_isotopes: true require_single_fragment: true output: write_duplicates_removed: true

config_descriptors.yml

Controls molecular descriptor calculation, filtering borders, and plotting options.

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable the descriptors stage
n_jobsint-1Number of parallel workers for descriptor calculation (-1 = auto)
batch_sizeint1000Batch size for descriptor computation
filter_databooltrueWhether to apply border-based filtering after descriptor calculation
preprocess.remove_chargesboolfalse(Deprecated) Descriptor-stage preprocessing; Mol Prep should be used instead
preprocess.remove_radicalsboolfalse(Deprecated) Descriptor-stage preprocessing; Mol Prep should be used instead
preprocess.remove_stereochemistryboolfalse(Deprecated) Descriptor-stage preprocessing; Mol Prep should be used instead

Structural Constraints (structural_constraints)

These constraints add topology-aware caps on typed atom classes, element counts, ring topology, and acyclic chain length. They are applied during the descriptors filtering stage in addition to generic descriptor borders.

All limits in this block are upper bounds. A molecule passes a given structural constraint if its computed count is less than or equal to the configured value.

ParameterTypeDefaultWhat it countsWhen usefulWhat the limit means
enabledbooltrueWhether the full structural_constraints block is activeUse false when you want only generic descriptor borders and no topology-aware capstrue applies all limits below; false ignores the entire block
type_limitsdict[string, int](see YAML below)Per-alias counts for specific typed atom classes (.=O, Car, Nd+, etc.)Useful when broad descriptors are not selective enough and you need direct control over atom-level motifsFor each alias key, value k means alias_count <= k
element_limits.Nint6Total nitrogen atoms (n_N_atoms)Useful to control basicity and nitrogen-driven polarity earlyMolecules with n_N_atoms > N are filtered out
element_limits.Oint4Total oxygen atoms (n_O_atoms)Useful to limit highly oxygenated structures and keep polarity in rangeMolecules with n_O_atoms > O are filtered out
element_limits.Sint1Total sulfur atoms (n_S_atoms)Useful when sulfur-containing motifs are allowed but should remain rareMolecules with n_S_atoms > S are filtered out
max_n_or_o_atomsint10Combined nitrogen and oxygen count (n_NO_atoms)Useful as a single heteroatom cap for polar atom loadMolecules with n_NO_atoms above this value are filtered out
max_small_rings_3_4int0Number of 3- and 4-membered rings (n_small_rings_3_4)Useful to suppress strained ring systems when they are not desired0 disallows all 3/4-membered rings; 1 allows up to one, etc.
max_acyclic_chain_lengthint4Length (in heavy atoms) of the longest non-ring chain (max_acyclic_chain_length)Useful to avoid long linear appendages and excessive flexibilityMolecules with a longer acyclic chain than this value are filtered out

type_limits Alias Keys

These are the supported type_limits keys and how they are counted in the descriptors stage.

AliasWhat it countsWhen useful
.=Osp2 oxygen atoms that behave as acceptors (carbonyl-like oxygens)Limit dense carbonyl-like chemistry
C2rNon-aromatic ring carbons with sp2 hybridizationControl unsaturated non-aromatic ring content
C3rRing carbons with sp3 hybridizationControl saturated ring carbon load
CarAromatic carbon atomsControl aromatic density directly at atom level
Cs2Non-ring, non-aromatic sp2 carbonsLimit non-ring unsaturation
Cs3Non-ring sp3 carbonsCap long aliphatic/saturated carbon content
CspCarbon atoms with sp hybridizationLimit linear/triple-bond carbon motifs
NacNeutral nitrogen atoms classified as acceptorsControl acceptor-type neutral nitrogens
Nd+Positively charged donor nitrogens with at least one hydrogenLimit protonated donor nitrogens
Nd0Neutral donor nitrogens with at least one hydrogenControl neutral donor nitrogen abundance
O_asp3 oxygen acceptors with no hydrogenControl ether-like acceptor oxygens
O_dsp3 oxygen donors with at least one hydrogenControl hydroxyl-like donor oxygens
SO2Sulfur atoms with at least two double-bonded oxygensLimit sulfonyl-like sulfur motifs
SulSulfur atoms with total valence 2Limit low-valence sulfur motifs
HalHalogen atoms (F, Cl, Br, I)Cap halogenation level

For any alias key in type_limits, a limit of k means molecules pass only if the alias count is <= k.

How structural_constraints Interact with borders

The descriptors stage applies both layers together:

  • borders define generic descriptor ranges such as molWt, logP, TPSA, hbd, hba, n_rings, and fsp3.
  • structural_constraints are converted into additional upper-bound checks on derived descriptor columns.

This means:

  • element_limits.N, element_limits.O, and element_limits.S act on n_N_atoms, n_O_atoms, and n_S_atoms.
  • max_n_or_o_atoms acts on n_NO_atoms and complements the per-element caps.
  • max_small_rings_3_4 acts on n_small_rings_3_4.
  • max_acyclic_chain_length acts on max_acyclic_chain_length.
  • type_limits act on alias-specific columns such as Car, Nd0, O_a, or SO2.

Use borders to shape broad property space and structural_constraints to cap specific motifs that can still pass those broad ranges.

Interpreting Failures

When descriptor filtering is enabled, the stage writes both computed values and pass/fail flags.

  • filtered/descriptors_failed.csv contains failed molecules with their computed descriptor values, including structural constraint columns such as n_O_atoms, n_NO_atoms, n_small_rings_3_4, max_acyclic_chain_length, and all active alias columns.
  • filtered/pass_flags.csv contains boolean pass flags for each checked column.

This lets you distinguish cases such as:

  • acceptable hba but excessive O_a
  • acceptable n_N_atoms but excessive Nd0
  • acceptable total ring count but disallowed n_small_rings_3_4

Example Tuning Patterns

# Conservative profile: tighter motif control structural_constraints: enabled: true type_limits: Car: 10 Hal: 2 Nd+: 0 SO2: 0 element_limits: N: 5 O: 4 S: 1 max_n_or_o_atoms: 8 max_small_rings_3_4: 0 max_acyclic_chain_length: 3
# Broader exploration profile structural_constraints: enabled: true type_limits: Car: 14 Hal: 4 Nd+: 1 SO2: 1 element_limits: N: 7 O: 5 S: 2 max_n_or_o_atoms: 11 max_small_rings_3_4: 1 max_acyclic_chain_length: 5

Border Parameters (borders)

These define the acceptable range for each molecular descriptor. Molecules outside these ranges are filtered out when filter_data is true.

ParameterTypeDefaultDescription
allowed_charslist[string][C, N, S, O, F, Cl, Br, I, P, H]Allowed chemical elements in molecules
n_atoms_min / n_atoms_maxint10 / 100Total atom count range
n_heavy_atoms_min / n_heavy_atoms_maxint10 / 50Heavy (non-hydrogen) atom count range
n_het_atoms_min / n_het_atoms_maxint2 / 15Heteroatom count range
n_N_atoms_min / n_N_atoms_maxint0 / 12Nitrogen atom count range
fN_atoms_min / fN_atoms_maxfloat0 / 0.22Fraction of nitrogen atoms (among heavy atoms) range
fNS_atoms_min / fNS_atoms_maxfloat0 / 0.3Fraction of nitrogen and sulfur atoms (among heavy atoms) range
molWt_min / molWt_maxfloat200 / 550Molecular weight range (Da)
logP_min / logP_maxfloat-0.4 / 5.6Crippen logP range
sw_min / sw_maxfloat-20 / 1Sw (water solubility estimate) range
ring_size_min / ring_size_maxint3 / 12Individual ring size range
n_rings_min / n_rings_maxint0 / 6Total ring count range
n_aroma_rings_min / n_aroma_rings_maxint0 / 5Aromatic ring count range
n_fused_aromatic_rings_min / n_fused_aromatic_rings_maxint0 / 2Fused aromatic ring count range
n_rigid_bonds_min / n_rigid_bonds_maxint0 / 30Rigid bond count range
n_rot_bonds_min / n_rot_bonds_maxint0 / 8Rotatable bond count range
hbd_min / hbd_maxint0 / 4Hydrogen bond donor count range
hba_min / hba_maxint1 / 9Hydrogen bond acceptor count range
fsp3_min / fsp3_maxfloat0.15 / 0.8Fraction of sp3 carbons range
has_spider_side_chains_min / has_spider_side_chains_maxint0 / 0Spider side-chain flag range (0 rejects molecules with two or more long scaffold appendages)
fraction_ring_system_min / fraction_ring_system_maxfloat0.25 / 1Fraction of heavy atoms in the Murcko scaffold
mce18_min / mce18_maxfloat20 / 140MCE-18 complexity score range
tpsa_min / tpsa_maxfloat20 / 140Topological polar surface area range (A^2)
qed_min / qed_maxfloat0.3 / 1Quantitative estimate of drug-likeness range

Plotting Settings

ParameterTypeDefaultDescription
filtered_cols_to_plotlist[string](see YAML below)Descriptor columns to include in filtered distribution plots
discrete_features_to_plotlist[string](see YAML below)Columns treated as discrete (bar charts instead of KDE)
not_to_smooth_plot_by_sideslist[string](see YAML below)Columns where KDE side-smoothing is disabled
renamerdict[string, string](see YAML below)Display names for descriptors in plot labels

Default YAML

run: true n_jobs: -1 batch_size: 1000 filter_data: true preprocess: remove_charges: false remove_radicals: false remove_stereochemistry: false structural_constraints: enabled: true type_limits: ".=O": 4 C2r: 6 C3r: 6 Car: 12 Cs2: 6 Cs3: 8 Csp: 2 Nac: 3 Nd+: 1 Nd0: 2 O_a: 4 O_d: 2 SO2: 1 Sul: 1 Hal: 3 element_limits: N: 6 O: 4 S: 1 max_n_or_o_atoms: 10 max_small_rings_3_4: 0 max_acyclic_chain_length: 4 borders: allowed_chars: - C - N - S - O - F - Cl - Br - I - P - H n_atoms_min: 10 n_atoms_max: 100 n_heavy_atoms_min: 10 n_heavy_atoms_max: 50 n_het_atoms_min: 2 n_het_atoms_max: 15 n_N_atoms_min: 0 n_N_atoms_max: 12 fN_atoms_min: 0 fN_atoms_max: 0.22 fNS_atoms_min: 0 fNS_atoms_max: 0.3 molWt_min: 200 molWt_max: 550 logP_min: -0.4 logP_max: 5.6 sw_min: -20 sw_max: 1 ring_size_min: 3 ring_size_max: 12 n_rings_min: 0 n_rings_max: 6 n_aroma_rings_min: 0 n_aroma_rings_max: 5 n_fused_aromatic_rings_min: 0 n_fused_aromatic_rings_max: 2 n_rigid_bonds_min: 0 n_rigid_bonds_max: 30 n_rot_bonds_min: 0 n_rot_bonds_max: 8 hbd_min: 0 hbd_max: 4 hba_min: 1 hba_max: 9 fsp3_min: 0.15 fsp3_max: 0.8 has_spider_side_chains_min: 0 has_spider_side_chains_max: 0 fraction_ring_system_min: 0.25 fraction_ring_system_max: 1 mce18_min: 20 mce18_max: 140 tpsa_min: 20 tpsa_max: 140 qed_min: 0.3 qed_max: 1 filtered_cols_to_plot: - chars - n_atoms - n_heavy_atoms - n_het_atoms - n_N_atoms - n_O_atoms - n_S_atoms - n_NO_atoms - fN_atoms - fNS_atoms - n_small_rings_3_4 - max_acyclic_chain_length - has_spider_side_chains - fraction_ring_system - ".=O" - C2r - C3r - Car - Cs2 - Cs3 - Csp - Nac - Nd+ - Nd0 - O_a - O_d - SO2 - Sul - Hal - molWt - logP - sw - ring_size - n_rings - n_aroma_rings - n_fused_aromatic_rings - n_rigid_bonds - n_rot_bonds - hbd - hba - fsp3 - mce18 - tpsa - qed discrete_features_to_plot: - chars - n_het_atoms - n_N_atoms - n_O_atoms - n_S_atoms - n_NO_atoms - ring_size - n_rings - n_aroma_rings - n_small_rings_3_4 - max_acyclic_chain_length - has_spider_side_chains - ".=O" - C2r - C3r - Car - Cs2 - Cs3 - Csp - Nac - Nd+ - Nd0 - O_a - O_d - SO2 - Sul - Hal - n_fused_aromatic_rings - n_rigid_bonds - n_rot_bonds - hbd - hba not_to_smooth_plot_by_sides: - n_atoms - n_heavy_atoms - fN_atoms - fNS_atoms - molWt - fsp3 - fraction_ring_system - tpsa - qed renamer: chars: Chars in molecules n_atoms: Number of Atoms n_heavy_atoms: Number of Heavy Atoms n_het_atoms: Number of heteroatoms n_N_atoms: Number of Nitrogen Atoms n_O_atoms: Number of Oxygen Atoms n_S_atoms: Number of Sulfur Atoms n_NO_atoms: Number of Nitrogen or Oxygen Atoms fN_atoms: Fraction of Nitrogen Atoms fNS_atoms: Fraction of Nitrogen and Sulfur Atoms molWt: Molecular Weight logP: logP sw: Sw ring_size: Size of rings n_rings: Number of rings n_aroma_rings: Number of aromatic rings n_small_rings_3_4: Number of 3/4-membered rings max_acyclic_chain_length: Longest acyclic chain length has_spider_side_chains: Has spider side chains fraction_ring_system: Fraction of ring system atoms n_fused_aromatic_rings: Number of fused aromatic rings n_rigid_bonds: Number of rigid bonds n_rot_bonds: Number of rotatable bonds hbd: Hydrogen Bond Donors hba: Hydrogen Bond Acceptors fsp3: Fraction of SP3 mce18: MCE-18 Complexity tpsa: TPSA qed: QED ".=O": Type Limit count for .=O C2r: Type Limit count for C2r C3r: Type Limit count for C3r Car: Type Limit count for Car Cs2: Type Limit count for Cs2 Cs3: Type Limit count for Cs3 Csp: Type Limit count for Csp Nac: Type Limit count for Nac Nd+: Type Limit count for Nd+ Nd0: Type Limit count for Nd0 O_a: Type Limit count for O_a O_d: Type Limit count for O_d SO2: Type Limit count for SO2 Sul: Type Limit count for Sul Hal: Type Limit count for Hal

config_structFilters.yml

Controls structural alert screening and medicinal chemistry filters. Molecules flagged by enabled filters are removed from the pipeline.

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable the structural filters stage
filter_databooltrueWhether to actually remove flagged molecules from downstream stages
parse_input_n_jobsint-1Worker count for parsing input molecules
write_per_filter_outputsbooltrueWrite per-filter output folders and CSVs
generate_plotsbooltrueGenerate structural filter plots
generate_failure_analysisbooltrueGenerate failure-analysis outputs
combine_in_memorybooltrueCombine enabled filter results in memory before writing the final output
parallel_schedulerstringprocessesDefault scheduler for parallel filter execution

Structural Alerts

ParameterTypeDefaultDescription
alerts_data_pathstringsrc/hedgehog/struct_filters/data/common_alerts_collection.csvPath to CSV file containing structural alert SMARTS patterns
calculate_common_alertsbooltrueEnable SMARTS-based structural alert screening
common_alerts_auto_n_jobsbooltrueEnable size-aware worker selection for Common Alerts
common_alerts_small_input_thresholdint1000Molecule-count threshold for the small-input worker setting
common_alerts_small_input_n_jobsint1Worker count when input size is below common_alerts_small_input_threshold
common_alerts_large_input_n_jobsint12Worker count when input size is between common_alerts_small_input_threshold and 10000
include_rulesetslist[string](see YAML below)Alert rulesets to activate (e.g., Dundee, BMS, PAINS, Glaxo, etc.)
exclude_descriptionsdict[string, list[string]](see YAML below)Per-ruleset list of alert descriptions to exclude (override false positives)

Molecular Graph and Complexity Filters

ParameterTypeDefaultDescription
calculate_molgraph_statsbooltrueCompute molecular graph statistics (connectivity, bridges, etc.)
calculate_molcomplexitybooltrueCompute molecular complexity scores
calculate_NIBRbooltrueRun Novartis In-silico ADME/Tox (NIBR) filter
molgraph_schedulerstringprocessesParallelism for molecular graph calculations
nibr_schedulerstringprocessesParallelism for NIBR: threads or processes
calculate_bredtbooltrueRun Bredt’s rule violation check (strained bridgehead double bonds)
calculate_lillybooltrueRun Lilly medchem rules filter
lilly_schedulerstringthreadsParallelism for Lilly: threads or processes

Medchem Functional Filters

ParameterTypeDefaultDescription
calculate_protecting_groupsbooltrueFlag molecules containing protecting groups (Boc, Fmoc, Cbz, etc.)
calculate_ring_infractionbooltrueFlag molecules with strained ring infractions
ring_infraction_hetcycle_min_sizeint4Minimum heterocycle ring size before flagging as an infraction
calculate_stereo_centerbooltrueFlag molecules with excessive stereocenters
stereo_max_centersint4Maximum allowed total stereocenters
stereo_max_undefinedint2Maximum allowed undefined stereocenters
calculate_halogenicitybooltrueFlag molecules with excessive halogen counts
halogenicity_thresh_Fint6Maximum allowed fluorine atoms
halogenicity_thresh_Brint3Maximum allowed bromine atoms
halogenicity_thresh_Clint3Maximum allowed chlorine atoms
calculate_symmetryboolfalseFlag highly symmetric molecules (off by default — many drugs are symmetric)
symmetry_thresholdfloat0.8Symmetry score threshold above which a molecule is flagged

Structural Filter Profiles

The default structural filter configuration is the exploration profile in config_structFilters.yml. Three named ready-to-use profile files are shipped alongside it:

  • config_structFilters_strict.yml - conservative profile for high-confidence hygiene screening
  • config_structFilters_balanced.yml - practical mid-conservatism profile
  • config_structFilters_exploration.yml - least conservative profile for retaining more chemistry diversity

Default YAML

# Structural filters config - exploration profile (default) run: true filter_data: true parse_input_n_jobs: -1 write_per_filter_outputs: true generate_plots: true generate_failure_analysis: true combine_in_memory: true parallel_scheduler: processes alerts_data_path: src/hedgehog/struct_filters/data/common_alerts_collection.csv calculate_common_alerts: true common_alerts_auto_n_jobs: true common_alerts_small_input_threshold: 1000 common_alerts_small_input_n_jobs: 1 common_alerts_large_input_n_jobs: 12 include_rulesets: - Dundee - BMS - Inpharmatica - LD50-Oral - Glaxo - PAINS - AlphaScreen-Hitters - Frequent-Hitter - Chelator - SureChEMBL - GST-Hitters - HIS-Hitters - LuciferaseInhibitor exclude_descriptions: Dundee: - Aliphatic long chain - isolated alkene - triple bond Inpharmatica: - Filter82_pyridinium LD50-Oral: - phenylpiperazine SureChEMBL: - aminothiazole HIS-Hitters: - Picolylamines_A calculate_molgraph_stats: true calculate_molcomplexity: true calculate_NIBR: true molgraph_scheduler: processes nibr_scheduler: processes calculate_bredt: true calculate_lilly: true lilly_scheduler: threads calculate_protecting_groups: true calculate_ring_infraction: true ring_infraction_hetcycle_min_size: 4 calculate_stereo_center: true stereo_max_centers: 4 stereo_max_undefined: 2 calculate_halogenicity: true halogenicity_thresh_F: 6 halogenicity_thresh_Br: 3 halogenicity_thresh_Cl: 3 calculate_symmetry: false symmetry_threshold: 0.8

config_synthesis.yml

Controls the retrosynthesis feasibility stage, including synthesizability score thresholds.

ParameterTypeDefaultDescription
runbooltrueEnable or disable the synthesis stage
n_jobsint-1Worker count for synthesis scoring and retrosynthesis (-1/0 = auto/all available cores)
enabled_scoreslistsa, syba, rascore, sync, scscore, nonpher, fsscore, gasaSynthesis score calculators to run. Optional scorers return NaN with warnings when their external dependencies are not configured
run_retrosynthesisbooltrueRun AiZynthFinder retrosynthetic analysis
filter_solved_onlybooltrueKeep only molecules for which a retrosynthetic route was found
sa_score_minfloat1Minimum synthetic accessibility score (Ertl)
sa_score_maxfloat4.5Maximum synthetic accessibility score (lower = easier to synthesize)
syba_score_minfloat0Minimum SYBA score (Bayesian synthesizability)
syba_score_maxfloatinfMaximum SYBA score
ra_score_minfloat0.5Minimum retrosynthetic accessibility score
ra_score_maxfloat1Maximum retrosynthetic accessibility score
sync_auto_installbooltrueDownload the SYNC checkpoint automatically when it is missing
sync_devicestringcpuTorch device for SYNC inference
sync_conformer_seedint61453RDKit ETKDG conformer seed for SYNC inputs
fsscore_pythonstring | nullnullPython interpreter for isolated FSScore worker environment
fsscore_model_pathstring | nullnullExplicit FSScore checkpoint path (*.ckpt)
fsscore_repo_pathstring | nullnullOptional FSScore checkout path used to resolve models/pretrain_graph_GGLGGL_ep242_best_valloss.ckpt
fsscore_batch_sizeint128Batch size passed to fsscore.score
fsscore_num_workersint | nullnullOptional dataloader worker count passed to fsscore.score
score_filtersobject{}Optional min/max filters for additional score columns such as sync_score, sc_score, nonpher_complexity_score, fs_score, or gasa_score
gasa.commandstringnullOptional local command template for batch gasa scoring using {input} and {output} placeholders
gasa.executablestringnullOptional local executable path/name used for gasa scoring (<exe> --smiles <SMILES>)
gasa.api_urlstringnullOptional local loopback HTTP endpoint for gasa scoring (POST {"smiles": ...})
gasa.timeout_secondsfloat30Timeout per gasa backend call

Default YAML

run: true n_jobs: -1 enabled_scores: - sa - syba - rascore - sync - scscore - nonpher - fsscore - gasa run_retrosynthesis: true filter_solved_only: true sa_score_min: 1 sa_score_max: 4.5 syba_score_min: 0 syba_score_max: inf ra_score_min: 0.5 ra_score_max: 1 sync_auto_install: true sync_device: cpu sync_conformer_seed: 61453 fsscore_python: fsscore_model_path: fsscore_repo_path: fsscore_batch_size: 128 fsscore_num_workers: score_filters: sync_score: min: 0.5 max: 1 sc_score: min: max: nonpher_complexity_score: min: max: fs_score: min: max: gasa_score: min: max: gasa: command: executable: api_url: timeout_seconds: 30

Optional external scorers are configured outside the base dependency set:

  • Set HEDGEHOG_OPTIONAL_ENV_ROOT to a writable host-local directory (for example ~/work/hedgehog_optional_envs) so FSScore/GASA/Nonpher uv bootstraps stay isolated and portable across servers. Keep output folders in shared storage.
  • nonpher can run in-process or via external worker (HEDGEHOG_NONPHER_PYTHON). If runtime is unavailable, nonpher_complexity_score is reported as NaN. Validate with uv run hedgehog setup nonpher-check or uv run hedgehog setup nonpher-check --python ~/work/hedgehog_optional_envs/nonpher/bin/python.
  • With --auto-install / HEDGEHOG_AUTO_INSTALL=1, HEDGEHOG attempts uv-only Nonpher bootstrap under $HEDGEHOG_OPTIONAL_ENV_ROOT/nonpher (or .venv-nonpher-worker) via pinned numpy<2 + rdkit-pypi + git installs for nonpher and molpher-lib.
  • If uv-only bootstrap fails on native blockers (for example cannot find -lmolpher or other unresolved linker/system dependencies), HEDGEHOG logs the exact blocker and leaves Nonpher scores as NaN.
  • HEDGEHOG_NONPHER_PYTHON always takes precedence and can point to any validated isolated interpreter, including a prebuilt shared hybrid runtime when uv-only is blocked.
  • uv run hedgehog setup fsscore --yes clones upstream FSScore checkout into modules/fsscore.
  • HEDGEHOG_FSSCORE_PYTHON points to isolated FSScore runtime. With HEDGEHOG_AUTO_INSTALL=1, missing Python/model settings can be auto-wired via ensure_fsscore_runtime when no explicit HEDGEHOG_FSSCORE_COMMAND is set.
  • HEDGEHOG_FSSCORE_MODEL_PATH sets an explicit checkpoint path. Alternatively, set HEDGEHOG_FSSCORE_REPO_PATH and HEDGEHOG resolves the default checkpoint under models/.
  • HEDGEHOG_FSSCORE_COMMAND can provide a custom command template using {input}, {output}, {smiles_col}, {model_path}, {batch_size}, and {n_jobs} placeholders.
  • HEDGEHOG_GASA_COMMAND can provide a custom batch command template using {input}, {output}, {smiles_col}, {model_path}, {batch_size}, and {n_jobs} placeholders.
  • HEDGEHOG_GASA_EXECUTABLE points to a local executable; HEDGEHOG_GASA_API_URL points to a local loopback API endpoint. With HEDGEHOG_AUTO_INSTALL=1, missing backend can be auto-populated through ensure_gasa_worker + hedgehog.workers.gasa_worker. If backend setup still fails, scores default to NaN with a warning.

config_docking.yml

Controls molecular docking using SMINA, GNINA, Matcha, or any explicit combination of them. Defines the receptor, search box, and engine-specific parameters.

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable the docking stage
toolsstringgninaDocking engine selection: all, gnina, smina, matcha, or a comma-separated list such as gnina,matcha
receptor_pdbstringsrc/hedgehog/configs/examples/7EW9_apo.pdbPath to the receptor PDB file
auto_runbooltrueAutomatically start docking after ligand preparation
run_in_backgroundboolfalseRun docking as a background process
prepare_ligandsboolfalseUse external ligand preparation before docking. false keeps the input molecule mapping as close to 1:1 as possible; true may expand one input molecule into multiple prepared ligands
gnina_per_process_cpuintgnina_config.cpuCPU threads per GNINA process in per-molecule mode
gnina_parallel_jobs_maxint6Upper bound for auto GNINA per-molecule job count

SMINA Configuration (smina_config)

ParameterTypeDefaultDescription
binstringsminaPath or name of the SMINA binary (resolved via PATH if not absolute)
autobox_ligandstringsrc/hedgehog/configs/examples/05C_from_7EW9.sdfReference ligand SDF for automatic search box definition
autobox_addfloat4Padding (Angstroms) added to each side of the autobox
cpuint32Number of CPU threads for docking
seedint42Random seed for reproducibility
exhaustivenessint8Search exhaustiveness (higher = more thorough, slower)
num_modesint1Maximum number of binding modes to generate per ligand

GNINA Configuration (gnina_config)

ParameterTypeDefaultDescription
binstringgninaPath or name of the GNINA binary (resolved via PATH if not absolute)
autobox_ligandstringsrc/hedgehog/configs/examples/05C_from_7EW9.sdfReference ligand SDF for automatic search box definition
autobox_addfloat4Padding (Angstroms) added to each side of the autobox
cpuint8Number of CPU threads for docking
seedint42Random seed for reproducibility
no_gpuboolfalseDisable GPU acceleration (false keeps GPU enabled when available)
num_modesint1Maximum number of binding modes to generate per ligand

Matcha Configuration (matcha_config)

ParameterTypeDefaultDescription
checkout_dirstringmodules/matcha_remoteManaged Matcha checkout directory populated from GitHub
uv_binstringuvLauncher used to invoke Matcha
autobox_ligandstringsrc/hedgehog/configs/examples/05C_from_7EW9.sdfOptional Matcha autobox reference ligand
devicestringautoMatcha device selection (auto, cpu, cuda, cuda:N, mps)
n_samplesint20Number of Matcha poses generated per ligand
scorerstringgninaMatcha scorer mode (gnina, custom, none)
scorer_minimizebooltrueMinimize poses during Matcha GNINA scoring
physical_onlyboolfalseKeep only physically valid poses in Matcha outputs
keep_workdirboolfalsePreserve Matcha internal work directory after the run

Default YAML

run: true tools: gnina receptor_pdb: src/hedgehog/configs/examples/7EW9_apo.pdb auto_run: true run_in_background: false prepare_ligands: false gnina_per_process_cpu: 8 gnina_parallel_jobs_max: 6 smina_config: bin: smina autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf autobox_add: 4 cpu: 32 seed: 42 exhaustiveness: 8 num_modes: 1 gnina_config: bin: gnina autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf autobox_add: 4 cpu: 8 seed: 42 no_gpu: false num_modes: 1 matcha_config: checkout_dir: modules/matcha_remote uv_bin: uv autobox_ligand: src/hedgehog/configs/examples/05C_from_7EW9.sdf device: auto n_samples: 20 scorer: gnina scorer_minimize: true physical_only: false keep_workdir: false

When prepare_ligands is true, one input molecule may produce several prepared ligands. This can change row counts and downstream mapping. Keep it false for the default 1:1-oriented docking path unless you explicitly need an external preparation workflow.


config_docking_filters.yml

Post-docking filters that evaluate the quality of docked poses and remove poor candidates. Five independent filters can be combined with all (every filter must pass) or any (at least one must pass) aggregation.

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable the docking filters stage
run_after_dockingbooltrueAutomatically run after the docking stage completes
input_sdfstring | nullnullPath to input SDF; if null, uses docking output
receptor_pdbstring | nullnullPath to receptor PDB; if null, uses docking config value

Ensures the docked pose remains inside the configured docking search box.

ParameterTypeDefaultDescription
enabledbooltrueEnable this filter
max_outside_fractionfloat0.0Maximum fraction of atoms allowed outside the box (0.0 = all must be inside)
short_circuitbooltrueWhen aggregation.mode is all, skip expensive filters for poses that already failed

Filter 1: Pose Quality (pose_quality)

Checks docked-pose quality. The default backend is posebusters_fast; the legacy optional backend is posecheck.

ParameterTypeDefaultDescription
enabledbooltrueEnable this filter
backendstringposebusters_fastPose quality backend: posebusters_fast or legacy posecheck
clash_cutofffloat0.75Relative VDW distance cutoff for fast clash detection
volume_clash_cutofffloat0.075ShapeTverskyIndex overlap threshold for fast volume clash detection
max_distancefloat5.0Maximum minimum ligand-protein distance in Angstroms
max_clashesint2Legacy PoseCheck maximum allowed steric clashes
max_strain_energyfloat50.0Legacy PoseCheck maximum ligand strain energy in kcal/mol
strain_forcefieldstringUFFLegacy PoseCheck force field for strain calculation
clash_tolerancefloat0.5Legacy PoseCheck VDW overlap tolerance in Angstroms

Filter 2: Interactions (interactions)

Evaluates protein-ligand interactions using ProLIF.

ParameterTypeDefaultDescription
enabledbooltrueEnable this filter
reference_ligandstring | nullnullPath to reference ligand SDF for interaction similarity
min_hbondsint0Minimum number of hydrogen bonds required (0 = no requirement)
required_residueslist[string]['ASP12']Residue identifiers that must have at least one interaction
forbidden_residueslist[string][]Residues that must NOT have any interaction
interaction_typeslist[string][HBDonor, HBAcceptor, Hydrophobic, VdWContact]Interaction types to evaluate
reporting.enabledbooltrueGenerate interaction reporting artifacts in the stage output
similarity_thresholdfloat0.0Minimum Tanimoto similarity to reference interactions (0 = disabled)

Filter 3: Shepherd-Score (shepherd_score)

3D molecular shape comparison to a reference ligand using Gaussian overlap. The default backend: auto tries isolated worker first, then in-process import, and soft-skips the filter if neither backend is available.

ParameterTypeDefaultDescription
enabledboolfalseEnable this filter (requires a reference ligand)
backendstringautoBackend mode: auto, worker, or inprocess
auto_install_workerbooltrueIf worker command is missing, attempt hedgehog setup shepherd-worker automatically
worker_pythonstring | nullnullOptional Python interpreter passed to worker setup (e.g. python3.12)
reference_ligandstring | nullnullPath to reference ligand SDF (required if enabled)
min_shape_scorefloat0.5Minimum Gaussian overlap Tanimoto score
alphafloat0.81Gaussian width parameter
align_before_scoringbooltrueAlign molecules before computing shape similarity

Filter 4: Conformer Deviation (conformer_deviation)

Checks whether the docked pose is geometrically plausible by comparing against generated conformers.

ParameterTypeDefaultDescription
enabledbooltrueEnable this filter
use_nvmolkitbooltrueTry nvMolKit acceleration when available (fallback to RDKit if unavailable)
num_conformersint50Number of reference conformers to generate
conformer_methodstringETKDGv3Conformer generation method: ETKDG, ETKDGv2, ETKDGv3
max_rmsd_to_conformerfloat3.0Maximum RMSD (Angstroms) between docked pose and nearest conformer
random_seedint42Random seed for conformer generation
include_hydrogensboolfalseInclude hydrogens in RMSD matching
max_matchesint10000Maximum symmetry matches considered by RMSD calculation
early_stop_on_passbooltrueStop conformer comparison as soon as one conformer passes
optimize_conformersboolfalseApply UFF force field optimization to conformers (slow)

Aggregation (aggregation)

ParameterTypeDefaultDescription
modestringallall = molecule must pass every enabled filter; any = pass at least one
save_metricsbooltrueSave detailed per-molecule metrics to a CSV file
save_failedboolfalseSave molecules that failed filtering to a separate file

Default YAML

run: true run_after_docking: true input_sdf: null receptor_pdb: null search_box: enabled: true max_outside_fraction: 0.0 short_circuit: true pose_quality: enabled: true backend: "posebusters_fast" clash_cutoff: 0.75 volume_clash_cutoff: 0.075 max_distance: 5.0 max_clashes: 2 max_strain_energy: 50.0 strain_forcefield: "UFF" clash_tolerance: 0.5 interactions: enabled: true reference_ligand: null min_hbonds: 0 required_residues: ['ASP12'] forbidden_residues: [] interaction_types: - HBDonor - HBAcceptor - Hydrophobic - VdWContact reporting: enabled: true similarity_threshold: 0.0 shepherd_score: enabled: false backend: "auto" auto_install_worker: true worker_python: null reference_ligand: null min_shape_score: 0.5 alpha: 0.81 align_before_scoring: true conformer_deviation: enabled: true use_nvmolkit: true num_conformers: 50 conformer_method: "ETKDGv3" max_rmsd_to_conformer: 3.0 random_seed: 42 include_hydrogens: false max_matches: 10000 early_stop_on_pass: true optimize_conformers: false aggregation: mode: "all" save_metrics: true save_failed: false

config_weighted_score.yml

Controls the post-run Generator Reality Assessment used by HTML reporting and RUN_INFO.md.

The scorecard is explainable and intended to rank generator behavior, not to estimate hit probability. It also reports a secondary Final Candidate Pool Quality score for the survivor set.

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable weighted model scoring output
versionstringv1Internal scorecard schema version
modestringgenerator_realityScoring mode label for the gate-aware generator score
target_final_countint100Target final count retained for secondary candidate-pool yield scoring
target_final_retentionfloat0.10Target final retention rate for generator yield scoring
confidence.min_final_molecules_highint100Minimum final molecules for high confidence
confidence.min_final_molecules_mediumint30Minimum final molecules for medium confidence

Component Weights (weights)

ParameterTypeDefaultDescription
weights.yieldfloat0.30Weight for final retention against target
weights.physchemfloat0.15Weight for descriptor all-pass gate survival
weights.structuralfloat0.25Weight for structural stage survival
weights.synthesisfloat0.10Weight for synthesis component
weights.docking_posefloat0.15Weight for docking/pipeline pose component
weights.diversityfloat0.05Weight for diversity metrics component

Weights are normalized over all configured components before scoring. When one component is unavailable, it is simply excluded, and the effective average is recomputed from the remaining available components.

physchem is measured from stages/01_descriptors_initial/filtered/pass_flags.csv as an all-pass descriptor gate rate, so it reflects the early generated set rather than the final survivor pool. The mean flag pass rate is retained as evidence only. structural uses the stage survival rate from filtered plus failed molecules, with the weakest structural filter as supporting evidence. Final descriptor files are used only as a fallback for older or partial runs. synthesis and docking_pose similarly prefer full stage evaluation artifacts before filtered/final survivor files.

Secondary Candidate Pool Weights (candidate_pool_weights)

candidate_pool_weights control the secondary Final Candidate Pool Quality score. It keeps the older survivor-pool interpretation: final-count yield saturation, mean descriptor flag pass rate, mean structural flag pass rate, and the same synthesis/docking/diversity formulas.

Yield and Structural Settings

ParameterTypeDefaultDescription
yield.modestringretentionUse final retention for the generator score; absolute restores count-saturation yield
yield.target_final_retentionfloat0.10Retention rate that maps to a full yield score
yield.count_weightfloat0.70Count-saturation weight for secondary candidate-pool yield
yield.retention_weightfloat0.30Log-retention weight for secondary candidate-pool yield
structural.stage_pass_weightfloat0.80Weight for structural stage survival
structural.worst_filter_weightfloat0.20Weight for the weakest structural filter pass rate

Hard Caps (hard_caps)

Hard caps prevent a model from receiving a high generator score when an early AND-gate rejects most molecules.

ParameterTypeDefaultDescription
hard_caps.structural_stage_pass_rate_belowfloat0.20Trigger threshold for structural stage survival
hard_caps.structural_stage_pass_rate_capfloat60.0Maximum score after structural cap trigger
hard_caps.descriptor_all_pass_rate_belowfloat0.50Trigger threshold for descriptor all-pass survival
hard_caps.descriptor_all_pass_rate_capfloat70.0Maximum score after descriptor cap trigger
hard_caps.final_retention_rate_belowfloat0.05Trigger threshold for final retention
hard_caps.final_retention_rate_capfloat70.0Maximum score after retention cap trigger

Docking Thresholds (docking)

ParameterTypeDefaultDescription
docking.bad_affinityfloat-6.0Affinity at which docking contribution starts to approach zero
docking.good_affinityfloat-9.0Affinity at which docking affinity contribution reaches upper bound
docking.bad_cnnscorefloat0.35GNINA CNN score lower bound
docking.good_cnnscorefloat0.85GNINA CNN score upper bound
docking.bad_cnnaffinityfloat4.5CnnAffinity lower bound
docking.good_cnnaffinityfloat6.5CnnAffinity upper bound

Increase strictness by moving bad_* upward and good_* downward, or relax by widening the interval.

Synthesis Thresholds (synthesis)

ParameterTypeDefaultDescription
synthesis.sa_minfloat1.0Easier-to-synthesize SA floor
synthesis.sa_maxfloat4.5Harder-to-synthesize SA ceiling
synthesis.ra_minfloat0.5Minimum retrosynthetic accessibility minimum
synthesis.ra_maxfloat1.0Retrosynthetic accessibility maximum
synthesis.syba_midpointfloat0.0Sigmoid midpoint for SYBA
synthesis.syba_scalefloat50.0Sigmoid width for SYBA
synthesis.target_search_time_secfloat30.0Reference retrosynthesis search time
synthesis.search_time_scale_secfloat20.0Search-time penalty scale

Raise or lower these to bias toward faster/easier synthetic routes.

Default YAML

run: true version: v1 mode: generator_reality target_final_count: 100 target_final_retention: 0.10 weights: yield: 0.30 physchem: 0.15 structural: 0.25 synthesis: 0.10 docking_pose: 0.15 diversity: 0.05 candidate_pool_weights: yield: 0.10 physchem: 0.15 structural: 0.15 synthesis: 0.20 docking_pose: 0.30 diversity: 0.10 yield: mode: retention target_final_retention: 0.10 count_weight: 0.70 retention_weight: 0.30 structural: stage_pass_weight: 0.80 worst_filter_weight: 0.20 docking: bad_affinity: -6.0 good_affinity: -9.0 bad_cnnscore: 0.35 good_cnnscore: 0.85 bad_cnnaffinity: 4.5 good_cnnaffinity: 6.5 synthesis: sa_min: 1.0 sa_max: 4.5 ra_min: 0.5 ra_max: 1.0 syba_midpoint: 0.0 syba_scale: 50.0 target_search_time_sec: 30.0 search_time_scale_sec: 20.0 confidence: min_final_molecules_high: 100 min_final_molecules_medium: 30 hard_caps: structural_stage_pass_rate_below: 0.20 structural_stage_pass_rate_cap: 60.0 descriptor_all_pass_rate_below: 0.50 descriptor_all_pass_rate_cap: 70.0 final_retention_rate_below: 0.05 final_retention_rate_cap: 70.0

config_moleval.yml

Controls generative evaluation metrics computed during report generation. These metrics assess diversity, scaffold coverage, and basic filter pass rates across pipeline stages.

General Settings

ParameterTypeDefaultDescription
runbooltrueEnable or disable MolEval metric computation
n_jobsint-1Number of parallel workers for metric computation (-1 = all available cores)
devicestringcpuCompute device: cpu or cuda:0 (for neural metrics)
max_moleculesint2000Subsample threshold for O(N^2) metrics; datasets larger than this are subsampled

Metric Groups

Each flag enables or disables a group of related metrics.

ParameterTypeDefaultDescription
validityboolfalseCompute validity rate (disabled by default — always 1.0 after RDKit parsing)
uniquenessboolfalseCompute uniqueness rate (disabled by default — always 1.0 after deduplication)
internal_diversitybooltrueCompute IntDiv1 and IntDiv2 (intra-set Tanimoto diversity)
se_diversitybooltrueCompute sphere-exclusion diversity (SEDiv)
scaffold_diversitybooltrueCompute ScaffDiv and ScaffUniqueness (Murcko scaffold analysis)
functional_groupsbooltrueCompute functional group diversity ratio (FG)
ring_systemsbooltrueCompute ring system diversity ratio (RS)
filtersbooltrueCompute MCF + PAINS filter passage rate
mce18booltrueCompute mean MCE-18 molecular complexity score

Default YAML

run: true n_jobs: -1 device: cpu max_molecules: 2000 validity: false uniqueness: false internal_diversity: true se_diversity: true scaffold_diversity: true functional_groups: true ring_systems: true filters: true mce18: true
Last updated on