Descriptors
The descriptors stage computes physicochemical descriptors for each molecule using RDKit (including an optional MCE-18 complexity score) and can filter molecules whose descriptor values fall outside configurable threshold ranges.
This stage runs twice in the pipeline: once early (initial filtering) and once at the end (recompute descriptors on the surviving molecule set). Filtering behavior is controlled by filter_data in config_descriptors.yml and applies to both runs unless you disable it.
In addition to the generic descriptor borders, the stage can also enforce structural constraints such as typed atom class limits, element count limits, maximum 3/4-membered rings, and maximum acyclic chain length.
Use generic borders when you want to shape broad physicochemical space
(molWt, logP, TPSA, hbd, hba, fsp3, and related ranges). Use
structural_constraints when you need direct control over concrete structural
motifs that can be underconstrained by broad descriptors alone, such as aromatic
carbon load, neutral donor nitrogens, sulfonyl sulfur motifs, strained small
rings, or long non-ring appendages.
Descriptor Reference
The stage computes more columns than it filters by default. The tables below
separate configured default filters from computed-only columns so the docs match
config_descriptors.yml.
Molecular Properties
| Descriptor | Config Key | Description | Default Min | Default Max |
|---|---|---|---|---|
| Allowed Characters | allowed_chars | Set of allowed atom types in SMILES | C,N,S,O,F,Cl,Br,I,P,H | — |
| Number of Atoms | n_atoms | Total atom count including hydrogens | 10 | 100 |
| Heavy Atoms | n_heavy_atoms | Non-hydrogen atom count | 10 | 50 |
| Heteroatoms | n_het_atoms | Count of non-carbon, non-hydrogen atoms | 2 | 15 |
| Nitrogen Atoms | n_N_atoms | Count of nitrogen atoms | 0 | 12 |
| Nitrogen Fraction | fN_atoms | Fraction of heavy atoms that are nitrogen | 0 | 0.22 |
| Nitrogen+Sulfur Fraction | fNS_atoms | Fraction of heavy atoms that are nitrogen or sulfur | 0 | 0.3 |
Chemical Features
| Descriptor | Config Key | Description | Default Min | Default Max |
|---|---|---|---|---|
| Molecular Weight | molWt | Daltons | 200 | 550 |
| logP | logP | Octanol-water partition coefficient (Wildman-Crippen) | -0.4 | 5.6 |
| Sw | sw | Water solubility estimate | -20 | 1 |
Structural
| Descriptor | Config Key | Description | Default Min | Default Max |
|---|---|---|---|---|
| Ring Size | ring_size | List of all ring sizes (each must fall within min—max) | 3 | 12 |
| Number of Rings | n_rings | Total ring count | 0 | 6 |
| Aromatic Rings | n_aroma_rings | Count of aromatic rings | 0 | 5 |
| Small 3/4-Membered Rings | n_small_rings_3_4 | Count of 3- or 4-membered rings | 0 | via structural_constraints.max_small_rings_3_4 |
| Max Acyclic Chain Length | max_acyclic_chain_length | Length of the longest acyclic chain | 0 | via structural_constraints.max_acyclic_chain_length |
| Fused Aromatic Rings | n_fused_aromatic_rings | Count of fused aromatic ring systems | 0 | 2 |
| Rigid Bonds | n_rigid_bonds | Non-rotatable bonds | 0 | 30 |
| Rotatable Bonds | n_rot_bonds | Freely rotatable bonds | 0 | 8 |
| H-Bond Donors | hbd | Hydrogen bond donor count | 0 | 4 |
| H-Bond Acceptors | hba | Hydrogen bond acceptor count | 1 | 9 |
| Fraction sp3 | fsp3 | Fraction of sp3-hybridized carbons | 0.15 | 0.8 |
| Spider Side Chains | has_spider_side_chains | 1 when a molecule has at least two long scaffold appendages | 0 | 0 |
| Fraction Ring System | fraction_ring_system | Fraction of heavy atoms in the Murcko scaffold | 0.25 | 1 |
| TPSA | tpsa | Topological Polar Surface Area (A^2) | 20 | 140 |
Drug-likeness
| Descriptor | Config Key | Description | Default Min | Default Max |
|---|---|---|---|---|
| QED | qed | Quantitative Estimation of Drug-likeness (0—1) | 0.3 | 1 |
| MCE-18 | mce18 | Molecular complexity estimate (J. Med. Chem. 2019) | 20 | 140 |
Computed-Only Columns
Some descriptor columns may be computed for reporting or backward compatibility
but are not active filter keys in the shipped config_descriptors.yml.
| Column | Meaning |
|---|---|
charged_mol / is_neutral | Boolean neutral-molecule flag; legacy charged_mol=true means neutral |
clogP | Additional calculated logP-style descriptor |
Structural Constraints
| Constraint | Config Key | Description | Default |
|---|---|---|---|
| Typed Atom Class Limits | structural_constraints.type_limits | Maximum allowed counts for configured aliases such as .=O, Car, Nd+, O_a, SO2, Hal | See YAML |
| Element Limits | structural_constraints.element_limits | Maximum allowed counts for N, O, and S | N=6, O=4, S=1 |
| Max N or O Atoms | structural_constraints.max_n_or_o_atoms | Maximum combined count of nitrogen and oxygen atoms | 10 |
| Max 3/4-Atom Rings | structural_constraints.max_small_rings_3_4 | Maximum allowed count of 3- and 4-membered rings | 0 |
| Max Acyclic Chain Length | structural_constraints.max_acyclic_chain_length | Maximum allowed longest acyclic chain length | 4 |
All structural constraints behave as upper bounds. A molecule passes a constraint only if the computed count is less than or equal to the configured value.
Why These Constraints Exist
type_limitscontrol specific atom-level motifs that can remain hidden behind acceptable global descriptors. For example, two molecules can have similarTPSAandhba, but very different counts of aromatic carbons, sulfonyl sulfur atoms, or protonated donor nitrogens.element_limitsandmax_n_or_o_atomsput a direct cap on heteroatom load. This is useful when you want to constrain polarity and heteroatom density with a hard ceiling instead of relying only on aggregate properties.max_small_rings_3_4controls strained ring systems. A value of0disallows all 3- and 4-membered rings; higher values allow a limited number of such motifs.max_acyclic_chain_lengthcontrols the longest simple chain in the non-ring heavy-atom graph. It is useful for reducing long linear appendages and excessive flexibility.
Typed Atom Alias Glossary
The type_limits map works on the following aliases:
| Alias | Meaning | Typical reason to cap it |
|---|---|---|
.=O | sp2 acceptor oxygen, typically carbonyl-like | Reduce dense carbonyl chemistry |
C2r | Non-aromatic ring carbon with sp2 hybridization | Limit unsaturated non-aromatic ring content |
C3r | Ring carbon with sp3 hybridization | Limit saturated ring carbon load |
Car | Aromatic carbon atom | Control aromatic density directly |
Cs2 | Non-ring, non-aromatic sp2 carbon | Limit non-ring unsaturation |
Cs3 | Non-ring sp3 carbon | Limit long aliphatic appendages |
Csp | Carbon with sp hybridization | Limit linear or triple-bond motifs |
Nac | Neutral acceptor nitrogen | Control neutral acceptor-type nitrogens |
Nd+ | Positively charged donor nitrogen with at least one hydrogen | Limit protonated donor nitrogens |
Nd0 | Neutral donor nitrogen with at least one hydrogen | Control neutral donor nitrogen abundance |
O_a | sp3 acceptor oxygen with no hydrogen | Control ether-like oxygen load |
O_d | sp3 donor oxygen with at least one hydrogen | Control hydroxyl-like donor oxygen load |
SO2 | Sulfur atom with at least two double-bonded oxygens | Limit sulfonyl-like sulfur motifs |
Sul | Sulfur atom with total valence 2 | Limit low-valence sulfur motifs |
Hal | Halogen atom (F, Cl, Br, I) | Cap halogenation level |
How to Choose type_limits
If you are unsure which typed limits to enable or tighten, use this practical sequence:
- Start from the shipped defaults in
config_descriptors.yml. They already cover the full supported alias set and give you a balanced baseline for drug-like chemistry. - Tighten only the motifs that matter for your project goals. Typical first candidates are:
Carwhen you want to reduce aromatic densityCs3when you want to limit long aliphatic appendagesHalwhen halogenation is getting too aggressiveNd+andNd0when donor/basic nitrogens are overrepresentedSO2when sulfonyl chemistry should stay rare
- Inspect
filtered/descriptors_failed.csvandfiltered/pass_flags.csvafter a run. If many otherwise acceptable molecules fail on one alias, that alias is a good candidate to relax. If broad descriptors still allow chemotypes you do not want, tighten the relevant alias instead of globally narrowingmolWt,TPSA, orhba.
As a rule of thumb:
- Use
bordersfor broad property space. - Use
type_limitsfor concrete motif selection. - Avoid tightening many aliases at once unless you are intentionally creating a strict medicinal-chemistry profile.
The shipped defaults are:
structural_constraints:
enabled: true
type_limits:
".=O": 4
C2r: 6
C3r: 6
Car: 12
Cs2: 6
Cs3: 8
Csp: 2
Nac: 3
Nd+: 1
Nd0: 2
O_a: 4
O_d: 2
SO2: 1
Sul: 1
Hal: 3How Structural Constraints Combine with Borders
structural_constraints do not replace borders; they add another filtering
layer.
bordersapply range-based filtering to descriptor columns such asmolWt,logP,TPSA,hbd,hba, andn_rings.structural_constraintsare translated internally into additional*_maxchecks on derived columns such asn_O_atoms,n_NO_atoms,Car,n_small_rings_3_4, andmax_acyclic_chain_length.element_limits.N,element_limits.O, andelement_limits.Smap directly ton_N_atoms,n_O_atoms, andn_S_atoms.max_n_or_o_atomsis a separate cap onn_NO_atoms, so it complements per-element limits rather than replacing them.type_limitsare orthogonal to aggregate atom counts: they constrain subtypes, not just totalN,O,S, or carbon counts.
How Threshold Filtering Works
Each descriptor has a _min and _max threshold defined in config_descriptors.yml. A molecule passes the descriptor filter only if all of its descriptor values fall within their respective [min, max] ranges.
# Example: only keep molecules with MW 200-550 and logP -0.4 to 5.6
borders:
molWt_min: 200
molWt_max: 550
logP_min: -0.4
logP_max: 5.6Molecules that fail any single descriptor threshold are written to failed_molecules.csv with detailed per-descriptor pass/fail flags in pass_flags.csv.
Structural constraints are evaluated alongside the border filters. They write
their own count columns and pass flags, so failures can be traced in the same
pass_flags.csv and descriptors_failed.csv outputs.
Pass flags are emitted per checked column. In practice this means you can trace why a molecule failed by inspecting:
- the computed value columns in
descriptors_failed.csv - the corresponding boolean flags in
pass_flags.csv
For example, a molecule can pass n_O_atoms_pass but fail n_NO_atoms_pass, or
pass hba_pass but fail Car_pass. The strict generative-design checks also
emit fraction_ring_system_pass and has_spider_side_chains_pass. This is
useful when generic descriptors are acceptable but a structural cap is
intentionally stricter.
Configuration
The full configuration lives in config_descriptors.yml:
run: true # Enable/disable this stage
batch_size: 1000 # Molecules processed per batch
filter_data: true # Apply threshold filtering (false = compute only)
preprocess:
# Descriptor-level preprocessing (fallback). Prefer MolPrep stage.
remove_charges: false
remove_radicals: false
remove_stereochemistry: false
structural_constraints:
enabled: true
# See the configuration reference for the full nested schema and defaults
borders:
# ... threshold values for each descriptorSet filter_data: false to compute descriptors without filtering, which is useful for analysis-only runs.
Starter Profiles
These example snippets are not presets in the codebase; they are starting
points for tuning the structural_constraints block.
# Conservative: tighter aromaticity, no small strained rings, short non-ring tails
structural_constraints:
enabled: true
type_limits:
Car: 10
Cs3: 6
Hal: 2
Nd+: 0
SO2: 0
element_limits:
N: 5
O: 4
S: 1
max_n_or_o_atoms: 8
max_small_rings_3_4: 0
max_acyclic_chain_length: 3# Exploration: allow broader chemistry while still bounding extremes
structural_constraints:
enabled: true
type_limits:
Car: 14
Cs3: 10
Hal: 4
Nd+: 1
SO2: 1
element_limits:
N: 7
O: 5
S: 2
max_n_or_o_atoms: 11
max_small_rings_3_4: 1
max_acyclic_chain_length: 5Output Files
| File | Description |
|---|---|
metrics/descriptors_all.csv | All computed descriptor values for every molecule |
metrics/skipped_molecules.csv | Molecules skipped during preprocessing or SMILES parsing |
filtered/filtered_molecules.csv | Molecules passing all descriptor thresholds |
filtered/failed_molecules.csv | Molecules failing one or more thresholds |
filtered/descriptors_passed.csv | Detailed descriptor values for passed molecules |
filtered/descriptors_failed.csv | Detailed descriptor values for failed molecules |
filtered/pass_flags.csv | Per-descriptor pass/fail boolean flags |
plots/descriptors_distribution.png | Distribution plots for all descriptors |
failed_molecules.csv is a lightweight list of failed molecule identifiers (SMILES/model_name/mol_idx). descriptors_failed.csv includes the same set of failed molecules but with the computed descriptor values (useful for debugging which borders were violated).
Usage
# Run descriptors as part of the full pipeline
uv run hedgehog
# Run descriptors stage only
uv run hedgehog --stage descriptors
# Short alias
uv run hedge --stage descriptors