Skip to Content
Pipeline StagesDescriptors

Descriptors

The descriptors stage computes physicochemical descriptors for each molecule using RDKit (including an optional MCE-18 complexity score) and can filter molecules whose descriptor values fall outside configurable threshold ranges.

This stage runs twice in the pipeline: once early (initial filtering) and once at the end (recompute descriptors on the surviving molecule set). Filtering behavior is controlled by filter_data in config_descriptors.yml and applies to both runs unless you disable it.

In addition to the generic descriptor borders, the stage can also enforce structural constraints such as typed atom class limits, element count limits, maximum 3/4-membered rings, and maximum acyclic chain length.

Use generic borders when you want to shape broad physicochemical space (molWt, logP, TPSA, hbd, hba, fsp3, and related ranges). Use structural_constraints when you need direct control over concrete structural motifs that can be underconstrained by broad descriptors alone, such as aromatic carbon load, neutral donor nitrogens, sulfonyl sulfur motifs, strained small rings, or long non-ring appendages.

Descriptor Reference

The stage computes more columns than it filters by default. The tables below separate configured default filters from computed-only columns so the docs match config_descriptors.yml.

Molecular Properties

DescriptorConfig KeyDescriptionDefault MinDefault Max
Allowed Charactersallowed_charsSet of allowed atom types in SMILESC,N,S,O,F,Cl,Br,I,P,H
Number of Atomsn_atomsTotal atom count including hydrogens10100
Heavy Atomsn_heavy_atomsNon-hydrogen atom count1050
Heteroatomsn_het_atomsCount of non-carbon, non-hydrogen atoms215
Nitrogen Atomsn_N_atomsCount of nitrogen atoms012
Nitrogen FractionfN_atomsFraction of heavy atoms that are nitrogen00.22
Nitrogen+Sulfur FractionfNS_atomsFraction of heavy atoms that are nitrogen or sulfur00.3

Chemical Features

DescriptorConfig KeyDescriptionDefault MinDefault Max
Molecular WeightmolWtDaltons200550
logPlogPOctanol-water partition coefficient (Wildman-Crippen)-0.45.6
SwswWater solubility estimate-201

Structural

DescriptorConfig KeyDescriptionDefault MinDefault Max
Ring Sizering_sizeList of all ring sizes (each must fall within min—max)312
Number of Ringsn_ringsTotal ring count06
Aromatic Ringsn_aroma_ringsCount of aromatic rings05
Small 3/4-Membered Ringsn_small_rings_3_4Count of 3- or 4-membered rings0via structural_constraints.max_small_rings_3_4
Max Acyclic Chain Lengthmax_acyclic_chain_lengthLength of the longest acyclic chain0via structural_constraints.max_acyclic_chain_length
Fused Aromatic Ringsn_fused_aromatic_ringsCount of fused aromatic ring systems02
Rigid Bondsn_rigid_bondsNon-rotatable bonds030
Rotatable Bondsn_rot_bondsFreely rotatable bonds08
H-Bond DonorshbdHydrogen bond donor count04
H-Bond AcceptorshbaHydrogen bond acceptor count19
Fraction sp3fsp3Fraction of sp3-hybridized carbons0.150.8
Spider Side Chainshas_spider_side_chains1 when a molecule has at least two long scaffold appendages00
Fraction Ring Systemfraction_ring_systemFraction of heavy atoms in the Murcko scaffold0.251
TPSAtpsaTopological Polar Surface Area (A^2)20140

Drug-likeness

DescriptorConfig KeyDescriptionDefault MinDefault Max
QEDqedQuantitative Estimation of Drug-likeness (0—1)0.31
MCE-18mce18Molecular complexity estimate (J. Med. Chem. 2019)20140

Computed-Only Columns

Some descriptor columns may be computed for reporting or backward compatibility but are not active filter keys in the shipped config_descriptors.yml.

ColumnMeaning
charged_mol / is_neutralBoolean neutral-molecule flag; legacy charged_mol=true means neutral
clogPAdditional calculated logP-style descriptor

Structural Constraints

ConstraintConfig KeyDescriptionDefault
Typed Atom Class Limitsstructural_constraints.type_limitsMaximum allowed counts for configured aliases such as .=O, Car, Nd+, O_a, SO2, HalSee YAML
Element Limitsstructural_constraints.element_limitsMaximum allowed counts for N, O, and SN=6, O=4, S=1
Max N or O Atomsstructural_constraints.max_n_or_o_atomsMaximum combined count of nitrogen and oxygen atoms10
Max 3/4-Atom Ringsstructural_constraints.max_small_rings_3_4Maximum allowed count of 3- and 4-membered rings0
Max Acyclic Chain Lengthstructural_constraints.max_acyclic_chain_lengthMaximum allowed longest acyclic chain length4

All structural constraints behave as upper bounds. A molecule passes a constraint only if the computed count is less than or equal to the configured value.

Why These Constraints Exist

  • type_limits control specific atom-level motifs that can remain hidden behind acceptable global descriptors. For example, two molecules can have similar TPSA and hba, but very different counts of aromatic carbons, sulfonyl sulfur atoms, or protonated donor nitrogens.
  • element_limits and max_n_or_o_atoms put a direct cap on heteroatom load. This is useful when you want to constrain polarity and heteroatom density with a hard ceiling instead of relying only on aggregate properties.
  • max_small_rings_3_4 controls strained ring systems. A value of 0 disallows all 3- and 4-membered rings; higher values allow a limited number of such motifs.
  • max_acyclic_chain_length controls the longest simple chain in the non-ring heavy-atom graph. It is useful for reducing long linear appendages and excessive flexibility.

Typed Atom Alias Glossary

The type_limits map works on the following aliases:

AliasMeaningTypical reason to cap it
.=Osp2 acceptor oxygen, typically carbonyl-likeReduce dense carbonyl chemistry
C2rNon-aromatic ring carbon with sp2 hybridizationLimit unsaturated non-aromatic ring content
C3rRing carbon with sp3 hybridizationLimit saturated ring carbon load
CarAromatic carbon atomControl aromatic density directly
Cs2Non-ring, non-aromatic sp2 carbonLimit non-ring unsaturation
Cs3Non-ring sp3 carbonLimit long aliphatic appendages
CspCarbon with sp hybridizationLimit linear or triple-bond motifs
NacNeutral acceptor nitrogenControl neutral acceptor-type nitrogens
Nd+Positively charged donor nitrogen with at least one hydrogenLimit protonated donor nitrogens
Nd0Neutral donor nitrogen with at least one hydrogenControl neutral donor nitrogen abundance
O_asp3 acceptor oxygen with no hydrogenControl ether-like oxygen load
O_dsp3 donor oxygen with at least one hydrogenControl hydroxyl-like donor oxygen load
SO2Sulfur atom with at least two double-bonded oxygensLimit sulfonyl-like sulfur motifs
SulSulfur atom with total valence 2Limit low-valence sulfur motifs
HalHalogen atom (F, Cl, Br, I)Cap halogenation level

How to Choose type_limits

If you are unsure which typed limits to enable or tighten, use this practical sequence:

  1. Start from the shipped defaults in config_descriptors.yml. They already cover the full supported alias set and give you a balanced baseline for drug-like chemistry.
  2. Tighten only the motifs that matter for your project goals. Typical first candidates are:
    • Car when you want to reduce aromatic density
    • Cs3 when you want to limit long aliphatic appendages
    • Hal when halogenation is getting too aggressive
    • Nd+ and Nd0 when donor/basic nitrogens are overrepresented
    • SO2 when sulfonyl chemistry should stay rare
  3. Inspect filtered/descriptors_failed.csv and filtered/pass_flags.csv after a run. If many otherwise acceptable molecules fail on one alias, that alias is a good candidate to relax. If broad descriptors still allow chemotypes you do not want, tighten the relevant alias instead of globally narrowing molWt, TPSA, or hba.

As a rule of thumb:

  • Use borders for broad property space.
  • Use type_limits for concrete motif selection.
  • Avoid tightening many aliases at once unless you are intentionally creating a strict medicinal-chemistry profile.

The shipped defaults are:

structural_constraints: enabled: true type_limits: ".=O": 4 C2r: 6 C3r: 6 Car: 12 Cs2: 6 Cs3: 8 Csp: 2 Nac: 3 Nd+: 1 Nd0: 2 O_a: 4 O_d: 2 SO2: 1 Sul: 1 Hal: 3

How Structural Constraints Combine with Borders

structural_constraints do not replace borders; they add another filtering layer.

  • borders apply range-based filtering to descriptor columns such as molWt, logP, TPSA, hbd, hba, and n_rings.
  • structural_constraints are translated internally into additional *_max checks on derived columns such as n_O_atoms, n_NO_atoms, Car, n_small_rings_3_4, and max_acyclic_chain_length.
  • element_limits.N, element_limits.O, and element_limits.S map directly to n_N_atoms, n_O_atoms, and n_S_atoms.
  • max_n_or_o_atoms is a separate cap on n_NO_atoms, so it complements per-element limits rather than replacing them.
  • type_limits are orthogonal to aggregate atom counts: they constrain subtypes, not just total N, O, S, or carbon counts.

How Threshold Filtering Works

Each descriptor has a _min and _max threshold defined in config_descriptors.yml. A molecule passes the descriptor filter only if all of its descriptor values fall within their respective [min, max] ranges.

# Example: only keep molecules with MW 200-550 and logP -0.4 to 5.6 borders: molWt_min: 200 molWt_max: 550 logP_min: -0.4 logP_max: 5.6

Molecules that fail any single descriptor threshold are written to failed_molecules.csv with detailed per-descriptor pass/fail flags in pass_flags.csv.

Structural constraints are evaluated alongside the border filters. They write their own count columns and pass flags, so failures can be traced in the same pass_flags.csv and descriptors_failed.csv outputs.

Pass flags are emitted per checked column. In practice this means you can trace why a molecule failed by inspecting:

  • the computed value columns in descriptors_failed.csv
  • the corresponding boolean flags in pass_flags.csv

For example, a molecule can pass n_O_atoms_pass but fail n_NO_atoms_pass, or pass hba_pass but fail Car_pass. The strict generative-design checks also emit fraction_ring_system_pass and has_spider_side_chains_pass. This is useful when generic descriptors are acceptable but a structural cap is intentionally stricter.

Configuration

The full configuration lives in config_descriptors.yml:

run: true # Enable/disable this stage batch_size: 1000 # Molecules processed per batch filter_data: true # Apply threshold filtering (false = compute only) preprocess: # Descriptor-level preprocessing (fallback). Prefer MolPrep stage. remove_charges: false remove_radicals: false remove_stereochemistry: false structural_constraints: enabled: true # See the configuration reference for the full nested schema and defaults borders: # ... threshold values for each descriptor

Set filter_data: false to compute descriptors without filtering, which is useful for analysis-only runs.

Starter Profiles

These example snippets are not presets in the codebase; they are starting points for tuning the structural_constraints block.

# Conservative: tighter aromaticity, no small strained rings, short non-ring tails structural_constraints: enabled: true type_limits: Car: 10 Cs3: 6 Hal: 2 Nd+: 0 SO2: 0 element_limits: N: 5 O: 4 S: 1 max_n_or_o_atoms: 8 max_small_rings_3_4: 0 max_acyclic_chain_length: 3
# Exploration: allow broader chemistry while still bounding extremes structural_constraints: enabled: true type_limits: Car: 14 Cs3: 10 Hal: 4 Nd+: 1 SO2: 1 element_limits: N: 7 O: 5 S: 2 max_n_or_o_atoms: 11 max_small_rings_3_4: 1 max_acyclic_chain_length: 5

Output Files

FileDescription
metrics/descriptors_all.csvAll computed descriptor values for every molecule
metrics/skipped_molecules.csvMolecules skipped during preprocessing or SMILES parsing
filtered/filtered_molecules.csvMolecules passing all descriptor thresholds
filtered/failed_molecules.csvMolecules failing one or more thresholds
filtered/descriptors_passed.csvDetailed descriptor values for passed molecules
filtered/descriptors_failed.csvDetailed descriptor values for failed molecules
filtered/pass_flags.csvPer-descriptor pass/fail boolean flags
plots/descriptors_distribution.pngDistribution plots for all descriptors

failed_molecules.csv is a lightweight list of failed molecule identifiers (SMILES/model_name/mol_idx). descriptors_failed.csv includes the same set of failed molecules but with the computed descriptor values (useful for debugging which borders were violated).

Usage

# Run descriptors as part of the full pipeline uv run hedgehog # Run descriptors stage only uv run hedgehog --stage descriptors # Short alias uv run hedge --stage descriptors
Last updated on