How It Works

SynthArena uses a standardized evaluation framework that ensures fair, apples-to-apples comparisons across different retrosynthesis models.

The Adapter Layer

Different retrosynthesis models produce output in vastly different formats. Comparing them directly would require writing custom evaluation code for each model—a recipe for inconsistency and bugs.

The Problem: Each retrosynthesis model represents synthesis routes differently—some use string-based precursor maps, others use recursive dictionaries or explicit graph structures. Without a common format, comparing these models fairly would require custom evaluation code for each format, leading to bugs and inconsistent metrics.

The Solution: RetroCast introduces an adapter layer that sits between the model and the evaluation pipeline:

Model Output

(Native Format)

→

Adapter

(Translation Layer)

→

Canonical Routes

(Standard Schema)

→

Evaluation

(Metrics & Analysis)

Key insight: The evaluation pipeline only sees standardized Route objects. It has no knowledge of the original model format, ensuring identical metric calculations across all models.

What This Means for You:

Model Developers: You don't need to change your model's output format. Just write an adapter (or use an existing one) to translate to the canonical schema. Extra data can be preserved in metadata fields.

Users & Researchers: All models are evaluated using identical logic, ensuring fair, apples-to-apples comparisons across the entire leaderboard. No hidden biases from format differences.

The Data Pipeline

SynthArena processes predictions through a structured, four-stage pipeline that ensures reproducibility and traceability.

1. Raw

2. Processed

3. Scored

4. Results

1. Raw Data: Model predictions in their native output format. This stage is immutable—files are never modified after generation.

data/2-raw/<model>/<benchmark>/predictions.json.gz

2. Processed Data: Standardized Route objects generated by running the adapter. This stage performs canonicalization, deduplication, and optional sampling.

data/3-processed/<model>/<benchmark>/routes.json.gz

SMILES canonicalization ensures consistency

Duplicate routes removed using cryptographic signatures

Optional top-k sampling limits routes per target

3. Scored Data: Routes annotated with evaluation metrics. Each route is checked against the stock set and ground truth (if available).

data/4-scored/<model>/<benchmark>/<stock>/scores.json.gz

is_solved: All leaves in stock

matches_ground_truth: Route matches reference

Structural properties (length, convergence) computed

The same processed routes can be scored against multiple stock sets without re-processing.

4. Results: Aggregate statistics with confidence intervals. This is what you see on the leaderboard.

data/5-results/<benchmark>/<model>/report.md

Overall solvability with 95% CI (bootstrap)

Top-K accuracy (K ∈ {1, 3, 5, 10, ...})

Stratified performance by route length and topology

Reproducibility & Verification: Every file generated by the pipeline has a companion manifest that records SHA256 hashes of input/output files, command parameters, timestamp, and software version.