Data Contract
This page documents the stable molecule-table contract used across pipeline stages.
Input Molecule Table
Recommended input is CSV or TSV with a smiles header:
smiles,model_name
CCO,model_a
CCN,model_a
c1ccccc1,model_bRequired:
smiles
Optional:
model_nameornamemol_idx
Generated:
mol_idxis assigned automatically if missing.
Stable Molecule Identity
mol_idx is the stable molecule identifier. HEDGEHOG uses it to join stage
outputs, descriptor tables, docking scores, docking-filter results, and report
data.
If you provide mol_idx, keep it unique within the input set. If you omit it,
HEDGEHOG creates IDs in the form LP-0001-00001, scoped by model_name.
Multi-Model Inputs
For model comparisons, use one row per generated molecule and model:
smiles,model_name
CCO,model_a
CCO,model_b
CCN,model_a
CCN,model_bDeduplication is model-aware: the same SMILES can appear under different
model_name values.
Headerless SMI Files
Simple .smi, .ismi, .cmi, and .smiles files are parsed by extension:
CCO
CCN ligand_2The first whitespace token is smiles; the optional second token becomes
model_name. CSV/TSV with a smiles header remains the clearest production
format because it makes identity and model labels explicit.
Stage Outputs
Most filtering stages write a stable filtered_molecules.csv:
smiles,model_name,mol_idx
CCO,model_a,LP-0001-00001Stages may also write richer detail files such as descriptor metrics, pass flags, synthesis scores, docking SDFs, and per-pose docking-filter metrics.
Zero-Row Outputs
A stage can complete successfully with zero surviving molecules. In that case it should still write a header-only CSV with the expected columns. This lets downstream reporting distinguish a valid empty result from a missing or failed stage output.
Sampling
sample_size controls how many molecules are sampled before the pipeline starts.
save_sampled_mols controls whether the sampled input table is written to
input/sampled_molecules.csv; it does not disable sampling.