The most transformative successes in artificial intelligence have a common pattern: they solved structural problems before attempting to tackle quantitative ones.

This document presents a thesis: retrosynthesis—the ability to design valid synthetic pathways to molecules—is the paramount structural challenge in machine learning for chemistry.

Two Classes of Problems

We distinguish between two fundamental classes of scientific problems to which machine learning is applied: quantitative and structural. Quantitative problems predict scalar targets from limited labeled data (drug toxicity, binding affinity, solubility). Structural problems generate complex objects governed by underlying grammar (language modeling, protein folding, image generation).

The most significant AI breakthroughs (GPT-4, AlphaFold) emerged from solving structural problems. Foundation models trained on structural tasks solve quantitative problems as emergent capabilities: GPT-4 achieves state-of-the-art sentiment analysis without being trained on sentiment labels; AlphaFold enables structure-based drug design as a side effect of mastering protein structure prediction.

Mastery of structure is a prerequisite for solving downstream quantitative tasks.

If this principle holds then chemistry's path to foundation models runs through its paramount structural challenge.

Retrosynthesis as the Structural Challenge

What is chemistry's equivalent to language modeling or protein structure prediction? We contend it's retrosynthesis: designing multi-step synthetic pathways to molecules of interest. It checks all the criteria: it requires generating complex objects (synthetic routes), adherence to grammar (reaction rules, stereochemistry, chemical validity), and deep chemical understanding (mechanisms, functional group compatibility, protecting group strategies).

A model that masters multi-step retrosynthesis must internalize a rich representation of chemical space. It must learn which transformations are feasible, which sequences are efficient, and which structural motifs require protection or activation. This knowledge is transferable: a model capable of planning complex syntheses should excel at forward prediction, reaction condition optimization, and property prediction—just as GPT excels at sentiment analysis without explicit training.

Retrosynthesis may be chemistry's pre-training task—the structural problem whose mastery unlocks downstream capabilities.

Synthesis-Aware Virtual Screening

One critical downstream application of retrosynthetic planning is synthesis-aware virtual screening. Modern drug discovery generates millions of candidate molecules, then filters them by synthetic accessibility before forwarding to medicinal chemists. This workflow demands the transferable chemical knowledge that retrosynthesis mastery provides.

Current practice relies on heuristics like SAScore, RAscore, or SCScore to rank these candidates. These methods are valuable high-throughput filters, but they rest on a precarious assumption: that synthetic accessibility is an intrinsic property of a molecule, akin to molecular weight or LogP.

We posit that synthetic accessibility is not a scalar property, but a conditional state. It relies entirely on context: which starting materials are in your stockroom? Do you have access to high-pressure hydrogenation? What is your cost tolerance? A molecule that is "easy" (0.9) for a well-stocked pharma lab may be "impossible" (0.1) for a budget-constrained startup. By collapsing this complexity into a single number, we may be asking a fundamentally ill-posed question.

The path forward requires explicit route generation. To meaningfully assess feasibility, a model must first generate a concrete pathway, which can then be evaluated against local realities: step count, reagent costs, reaction reliability, and stock availability. Perhaps a sufficiently capable model could eventually predict difficulty scores directly (analogous to chain-of-thought reasoningwhere the reasoning step is internal) but even this "fast" prediction requires the underlying capacity to articulate the route.

A model cannot judge the difficulty of a journey it cannot first articulate.

This is why retrosynthetic capability is foundational: it transforms an ill-defined ranking problem into a transparent planning problem.

Why We Cannot Answer “Which Model Is Best”

If retrosynthesis is the strategic challenge for chemistry AI, a natural question follows: Which of the existing models should we use?

This question is currently unanswerable. Not because the models are similar, but because we lack the infrastructure to compare them. Three fundamental barriers prevent rigorous model comparison.

1. The Babel of Formats: Different tools output fundamentally incompatible data structures (bipartite graphs, precursor maps, nested dictionaries, node-edge lists, linear recipe strings). Comparative analysis requires writing bespoke parsers for every model.

2. Stock Set Chaos: The definition of a “solved” route depends on which molecules are considered available. Stock sets vary by over three orders of magnitude:

Curated catalogs (Enamine, MolPort):~300k compounds
eMolecules made-to-order virtual library:~230M compounds
Variance:~1000×

A model reporting 99% solvability against a 230M made-to-order virtual library and another reporting 30% against a 300k off-the-shelf catalog are incomparable. The metric conflates model capability with stock definition.

3. Validity Blindness: The dominant metric (stock-termination rate, STR) validates only that a route's terminal nodes exist in commercial stock. It provides no guarantee that intermediate transformations are chemically feasible.

For detailed examples of chemically invalid “solved” routes, see Figure 2 in the paper. Interactive versions are available on SynthArena: USPTO-082, USPTO-114, USPTO-169, USPTO-93, USPTO-16, USPTO-181.

The Infrastructure Pattern

Scientific breakthroughs rarely emerge in isolation. They are preceded (often by years) by the construction of shared evaluation infrastructure.

2009
ImageNet provides standardized image classification benchmark
→ ResNet, VGG, Inception (2014-2016)
2018
GLUE/SuperGLUE standardize NLP evaluation tasks
→ BERT, GPT-2, GPT-3 (2018-2020)
1994
CASP competition for protein structure prediction
→ AlphaFold (2018-2020)

Retrosynthesis is pre-ImageNet. The field lacks standardized output formats, consistent benchmark definitions, reproducible evaluation protocols, agreed-upon stock sets, and chemical validity checks. Infrastructure must precede breakthroughs.

RetroCast provides this foundation: a universal translation layer with adapters for 10+ models that cast all outputs into a canonical schema; curated, stratified benchmarks with two evaluation tracks (market-based and reference-based); multi-ground-truth evaluation that rewards valid sub-routes; and cryptographic provenance with SHA256 manifests for computational verifiability. SynthArena provides interactive visualization, stratified metrics with confidence intervals, and a living leaderboard.

RetroCast Documentation · RetroCast Source Code · SynthArena Source Code

What Remains Unsolved

Developing metrics that can distinguish chemically sound novel routes from implausible artifacts is perhaps the most important open question in retrosynthesis evaluation. Accuracy of reference route reproduction penalizes discovery of novel pathways; forward model confidence scores and round-trip accuracy inherit predictor biases; expert human annotation does not scale to millions of predictions.

We release the complete, standardized prediction database from 10+ models as a substrate for developing plausibility metrics.

Download the latest database:

curl -fsSL https://files.ischemist.com/syntharena/get-db.sh | bash -s

SynthArena could be extended into active, distributed error annotation: expert chemists flag implausible reaction steps, categorize invalidity types (mass balance, mechanism, stereochemistry), and build curated datasets of chemical bugs. This would transform evaluation from a top-down benchmark into a bottom-up, adversarial process. If there's interest from the community, we are prepared to build this out with authenticated expert access.

Evaluation as a Research Track

In the early stages of a scientific discipline, it is natural for researchers to define both the method and the metric. However, as retrosynthesis transitions from proof-of-concept to practical utility, this coupling becomes a bottleneck. Even assuming universal good faith, the continuous introduction of ad-hoc metrics makes it impossible to distinguish genuine architectural progress from variance in evaluation protocols.

Other fields addressed this through institutional separation. GLUE and SuperGLUE were developed by independent teams; CASP has operated as an independent biennial competition since 1994; ImageNet and COCO challenges were organized by academic labs separate from competing teams.

We propose that retrosynthesis—and computational chemistry more broadly—formally recognize evaluation as a distinct research track responsible for maintaining stable benchmarks, developing consensus metrics, and ensuring that progress claims are measured against community-vetted standards.

This means versioned, frozen evaluation sets; shared data splits with detectable leakage; living leaderboards that provide standardized metrics for all submissions; and independent research on plausibility scoring, diversity measures, and cost estimation.

Progress begins with measurement.