Designing a molecule from scratch is one of chemistry's hardest problems. It's not just about knowing what atoms to connect—it's about knowing the right order of reactions, when to protect sensitive parts of the molecule, and how to avoid dead ends that could ruin months of lab work. Traditionally, that knowledge lives in the heads of experienced chemists. Now, a team at EPFL wants to put it into a language model.
Researchers led by Philippe Schwaller published a paper this week in the journal Matter describing Synthegy, a framework that uses large language models (LLMs) as reasoning engines for chemical synthesis planning. The key insight is subtle but important: rather than asking AI to generate molecules, the team uses AI to evaluate synthesis routes that traditional software already produces.
Here's how it works: a chemist types in a goal in plain English—something like "form the pyrimidine ring in the early stages." Existing retrosynthesis software, which works by breaking target molecules into simpler pieces, then generates dozens or hundreds of possible synthesis routes. Synthegy converts each route into text and hands it to an LLM, which scores every route on how well it matches the chemist's instruction. The best ones float to the top, with written explanations of why.
"When making tools for chemists, the user interface matters a lot, and previous tools relied on cumbersome filters and rules," said Andres M. Bran, lead author of the study, in a statement from EPFL.
The system was validated in a double-blind study involving 36 independent chemists who reviewed 368 route pairs. Their selections matched Synthegy's choices 71.2% of the time—a figure roughly in line with how often expert chemists agree with each other. Senior researchers, including professors and research scientists, agreed with Synthegy more often than PhD students, suggesting the system captures the same strategic intuitions that come with experience.
The researchers tested several AI models, including GPT-4o, Claude, and DeepSeek-r1. Gemini-2.5-pro scored highest in the benchmark, while DeepSeek-r1 emerged as a strong open-source alternative capable of running locally.
AI has been making inroads in drug discovery for years, but most approaches focus on narrowly trained models for specific tasks. Synthegy is designed to be modular—it can plug into any retrosynthesis engine on the backend and any capable LLM on the reasoning side.
The framework also tackles a second problem: reaction mechanism elucidation. This addresses the question of why a chemical reaction happens—what electron movements take place at each step. Synthegy breaks reactions into elementary moves and has the LLM assess each candidate step for chemical plausibility. On simple reactions such as nucleophilic substitutions, the best models achieved near-perfect accuracy.
The potential use cases are broad. Drug discovery is the most obvious application, and the same approach applies anywhere chemists need to design new materials or optimize industrial reactions. One practical detail: evaluating 60 candidate routes with Synthegy takes roughly 12 minutes and costs approximately $2–$3 in API fees.
The paper acknowledges current limitations. LLMs sometimes misread the direction of a reaction in its text representation, leading to incorrect feasibility assessments. Smaller models perform no better than random guessing, and routes longer than 20 steps are harder to track coherently. The code and benchmarks have been made publicly available by the research team.
Why it matters
Synthegy's design separates the reasoning layer (the LLM) from the route-generation layer (existing retrosynthesis software), meaning improvements to either component can be adopted independently—a structural choice that affects how labs would integrate or update the tool over time.
The 71.2% agreement rate between Synthegy and human chemists is meaningful because it is benchmarked against inter-expert agreement, not a theoretical ideal—giving researchers a concrete way to interpret the system's reliability relative to human review panels.
The publicly released code and benchmarks allow other research teams to test Synthegy against different retrosynthesis engines or LLMs, which is relevant for labs that operate under data-privacy constraints and may prefer a locally run open-source model such as DeepSeek-r1.