The Evaluation Crisis
The Babel of Formats: AiZynthFinder outputs bipartite graphs; Retro* outputs precursor maps; DirectMultiStep outputs recursive dictionaries. Comparing them requires bespoke parsers for every model.
Inconsistent Stocks: Starting material definitions vary by over 1000×—from curated catalogs of 300k molecules to speculative screening libraries of 230M+ compounds—making reported solvability scores incomparable.
Solvability ≠ Validity: Routes marked as "solved" are validated only by endpoint availability, with no guarantee that intermediate transformations are chemically feasible.
The Solution
RetroCast: A universal translation layer providing adapters for 10+ models (AiZynthFinder, Retro*, ASKCOS, DirectMultiStep, and more), casting all outputs into a canonical schema with cryptographic manifests for reproducibility.
Curated Benchmarks: Stratified evaluation sets fixing PaRoutes' distribution skew. The mkt- series uses commercial stocks (Buyables) for practical utility; the ref- series uses standardized stocks for fair algorithmic comparison.
SynthArena: This platform provides side-by-side route comparison with diff overlays, bootstrapped confidence intervals, and a living leaderboard—turning evaluation from a static exercise into an ongoing community process.
Platform Statistics
Models & Predictions
| Metric | Count |
|---|---|
Stock Molecules
| Stock | Molecules |
|---|---|
Benchmark Series
| Benchmark | Targets | Runs |
|---|---|---|
| Benchmark | Targets | Runs |
|---|---|---|