When the experiment matters as much as the molecule: multi-modal property prediction

Quantitative structure-property relationship (QSPR) modelling has a quiet framing problem. The standard pipeline — featurise a molecule, regress against a target property — works beautifully when the property in question is genuinely a function of the molecule alone. Molecular weight is. Logarithm of the octanol-water partition coefficient very nearly is.

A lot of the chemical properties we actually care about predicting are not in that category. HPLC retention time is the canonical counter-example: the same compound on a different stationary phase, with a different mobile phase composition, at a different temperature, will retain for a different amount of time. Bulk-fitting retention times from a literature corpus without conditioning on the experimental setup that produced them is one of the most common ways to produce a QSPR model that doesn’t generalise.

mol-modalist is an open-source framework I built to address this directly: a multi-modal property predictor that takes the molecule, its tabular descriptors, and the experimental context as three separate input modalities, fuses them explicitly, and lets you compare a handful of graph backbones against each other under the same fusion. This note is the architectural argument.

Two classes of property

A useful binary distinction up front:

Molecule-only properties. Functions of the structure that don’t meaningfully depend on the experimental conditions you’d measure them under at typical lab scale. Molecular weight, topological polar surface area, theoretical $\log P$ , atom counts. The standard single-input QSPR pipeline is correct for these.
Molecule × conditions properties. Functions of the structure and the experimental setup. HPLC retention time, GC retention index, solubility in a specified solvent system at a specified temperature, bioactivity against a specified target, formulation stability under stated storage conditions. The standard pipeline is wrong for these — or rather, it’s predicting a marginal over conditions that may or may not match the deployment setting.

The interesting modelling question is the second class. If your inputs don’t include the conditions, the best your model can do is predict the training distribution’s conditional mean, and any downstream prediction inherits the bias of whichever conditions dominate your dataset.

Architecture: three branches

The framework has three separately-featurised input branches, each with its own encoder:

Branch	Input	Encoder choices
Graph	Molecular structure as nodes + edges with features	GAT, Sparse Graph Transformer, Full Graph Transformer
Tabular	RDKit/Mordred-style descriptors per molecule	MLP
Context	Experimental conditions (column, mobile phase, T, pH …)	MLP

Each branch produces a fixed-dimensional embedding, and a fusion module combines them into a single representation for the property head. The split is deliberate: it lets the network learn that molecular structure dominates some target dimensions, that bulk descriptors carry information the graph branch is missing (charge state at a given pH, for instance), and that the context features modulate the contribution of the first two.

Choosing the graph backbone

Three options are wired up because each one earns its compute in different regimes:

GAT is the right default for small to medium drug-like molecules (~30–50 atoms). Local attention along bonds, cheap, well-understood, strong baseline. If you can’t beat GAT on your task, the rest of the framework is over-engineered.
Sparse Graph Transformer keeps the inductive bias of the bond graph but replaces local message passing with attention computed only along existing edges (plus a small set of virtual edges for long-range information). Useful when the molecule is big enough that pure GAT depth doesn’t reach across the structure.
Full Graph Transformer does global self-attention over all nodes. Best expressiveness, $\mathcal{O}(n^2)$ cost. The honest guidance is: try GAT first, escalate only when you have a concrete failure mode it can’t fix.

The framework’s benchmarking harness compares the three under identical fusion and training settings, which is the right comparison to make — most of the published claims about “graph transformers beating GNNs” only hold once you control for backbone width, depth, and the auxiliary modalities.

Tabular descriptors aren’t redundant

A common objection to a tabular branch in the presence of a graph branch is that the descriptors are computable from the molecule, so the graph encoder should learn them itself. In principle, yes. In practice:

Some descriptors are nontrivial to learn from a graph (counts of rotatable bonds, predicted $\log P$ from a calibrated model, formal charge under a specified pH). The graph encoder can learn them, but it costs depth and parameters you might not have.
Tabular features give the optimiser a useful regularisation prior: a property that visibly correlates with $\log P$ will partly route through the tabular branch, leaving the graph branch free to learn the structural residual.
They make the model easier to debug. If removing the tabular branch collapses performance, the graph encoder isn’t picking up what it should.

The context branch is the point

For molecule × conditions properties, this is the branch that earns the multi-modal architecture. The HPLC dataset class makes the wiring explicit:

dataset = HPLCDataset(
    smiles_list=smiles,
    retention_times=rt_values,
    column_types=columns,
    mobile_phase_compositions=mobile_phases,
    flow_rates=flow_rates,
    temperatures=temps,
    ph_values=ph_vals,
)

The context features are a small, structured vector per measurement. Encoded with a modest MLP, they enter the fusion module as a third modality with the same dimensional footprint as the graph and tabular branches. The property head then sees a representation that knows what column the measurement was taken on.

Fusion strategies

Three fusion modes are exposed because they correspond to different assumptions about how the modalities interact:

Concatenation. Concatenate the three embeddings; let an MLP figure out the interaction. The fewest assumptions, the most parameters, the hardest to interpret. Works well when you have enough data.
Attention-based. Compute learned attention weights over the three modalities for each prediction. Useful when the relative importance of modalities varies meaningfully across the dataset (e.g. some targets care a lot about context, others don’t), and gives you per-prediction attention weights as a debugging surface.
Gated fusion. A gating mechanism that scales each modality’s contribution before combining. Sits between concatenation and attention in flexibility, and tends to be the most robust on small-to-medium datasets where attention can overfit.

The honest answer to “which fusion should I use?” is “all three on your dataset, then pick by held-out validation”. The framework lets you swap them with one config line:

model = MultiModalFusion(
    graph_model=graph_model,
    graph_embedding_dim=512,
    tabular_dim=10,
    context_dim=5,
    fusion_strategy="attention",   # or "concat", "gated"
)

Limits worth naming

The framework addresses the “the conditions matter” problem at the input level. It does not magically solve the harder data problems sitting behind real chemical datasets:

Lab and instrument effects. Two labs reporting retention times for the same compound on nominally identical columns will disagree. Conditioning on the conditions doesn’t capture differences in pump pressure stability, dead volume, or sample preparation. A lab_id feature can help but isn’t a substitute for actually controlling for those effects.
Transfer between column chemistries. A model trained on C18 reversed-phase data will not extrapolate to HILIC or normal-phase separations just because the column type is one of the input features. Different physics, different relevant descriptors, often different molecular subspaces represented in training.
Conditional distribution, not point prediction. For molecule × conditions properties the relationship is genuinely noisy: identical conditions still produce a distribution of observed values across replicates. A model that emits a single point estimate is, at best, predicting the mean of that distribution. Calibrated uncertainty estimates are a separate problem from the multi-modal architecture and the right place to spend effort if you want to use these predictions for decision-making.
Scaffold-aware splits, again. All the warnings about random splits overstating generalisation apply here. The framework’s benchmarking harness uses scaffold splits by default; the honest numbers come from there.

Where this is going

The interesting open questions for a multi-modal QSPR framework, in roughly the order they’re worth working on:

Calibrated uncertainty over (molecule, conditions). A Bayesian or ensemble layer on the property head that gives a credible interval, not a point. This is the single change that would make model outputs usable for method-development decisions rather than ranking.
Joint pretraining of the graph branch on a large unlabelled molecular corpus, then fine-tuning with the other two branches on labelled context-specific data. Most QSPR datasets are small; the graph backbone is the right place to absorb structural priors from a bigger corpus.
A genuinely cross-dataset HPLC benchmark. Most published retention-time datasets are single-lab and single-method. Combining them honestly — with a lab/instrument feature and proper cross-source evaluation — is more useful than any individual architectural change.
Per-modality ablation as a first-class output. Reporting attention weights or gating values per prediction, with sensible defaults to log, so users can see which modality drove a given prediction without rebuilding the framework around it.

Multi-modality is not architecturally exciting on its own — it’s three networks and a fusion layer — but it’s the modelling decision that turns “predicting HPLC retention time” from a vaguely-defined regression into a problem with a well-posed conditional. Most of the value is in taking the conditions seriously, not in the choice of fusion module.