IMO 2025: The First Public Scorecard of Large Language Models on the World’s Hardest Math Test

Every July, the International Mathematical Olympiad (IMO) gathers the brightest teenage minds for two grueling days of proof writing. In 2025, for the first time, the same six problems were also handed—virtually—to a new generation of contestants: large language models (LLMs).

The full record of that experiment lives in the open-source repository IMO2025-LLM. Inside you will find the original contest questions, each model’s step-by-step reasoning, and an impartial report card on correctness and completeness. This article unpacks everything the repository reveals, in plain language, for readers who may never attend an Olympiad yet still care about how far AI reasoning has come.

Why a Math Olympiad Is the Perfect Stress Test for Language Models

At first glance, Olympiad problems look like quaint puzzles. Beneath the surface, they are miniature research projects:

A single oversight—a forgotten edge case or an unjustified inequality—drops the score to zero.
Solutions are judged for rigor, not plausibility.
Natural-language hints are scarce; diagrams and formal notation dominate.

For an LLM trained mostly on web text, that is a hostile environment. If a model can survive here, its odds improve in any setting that prizes airtight logic: software verification, legal clause drafting, or medical protocol checking.

Inside the Repository: Three Layers of Evidence

Layer	What You Get
Problems	Direct links to the six official 2025 IMO questions on the Art of Problem Solving forums.
Solutions	Every model’s complete transcript: prompt, scratchpad, and final proof.
Metrics	Token counts, API costs, and a simple pass/fail score for each question.

The 2025 Problems in One Sentence Each

Problem 1 – “Sunny Lines”
A geometry question about rays and angles that turn out to be perfectly aligned.
Problem 2 – “IMO 2025 P2”
An algebraic inequality that looks like an identity until you try to prove it.
Problem 3 – “Bonza Functions”
A functional equation that asks for all polynomials obeying an unusual sign rule.
Problem 4 – “Next Term Divisors”
A sequence defined by adding the three largest proper divisors of the previous term.
Problem 5 – “The Inequality Game”
A two-player game on a blackboard that hides a delicate chain of inequalities—widely rated the hardest.
Problem 6 – “I Miss Turbo”
A combinatorics problem on grids with a turbo-charged twist of graph theory.

The Simplest Summary of Results

Only two models—ByteDance Seed 1.6 and Google Gemini 2.5 Pro—delivered fully correct and complete solutions to Problem 5.

That single line is the headline. No other system, open or closed, cleared the same bar.

Token Counts: When Longer Is Not Better

The repository includes a bar chart titled Token Count per Problem by Model.

Some entrants poured 8,000+ tokens into Problem 5 yet missed the decisive inequality.
The two victors wrote under 2,000 tokens each, proving that concise reasoning can beat verbosity.

API Costs: A Reality Check for Researchers

A second chart, Cost per Problem by Model, lists real money spent on cloud inference.

The most expensive single call reached $0.60 for one model tackling Problem 3.
Open-source weights running locally cost almost $0 but often fell on correctness.

The takeaway: budget planning now has hard numbers. A start-up can weigh “cheap and wrong” versus “pricey and right” without guessing.

The Parameter Playbook Used by Every Model

You do not need a Ph.D. to adjust these knobs, yet they decide whether an answer is focused or rambling.

Model Family	temperature	top_p	Plain-English Meaning
OpenAI (o3-medium, o4-mini-high)	default	default	Stick to the safest word choices.
DeepSeek R1	0.6	0.95	Allow mild creativity; avoid dull repetition.
All others	0	1	Turn off randomness; give the most probable token every time.

temperature = 0 is like disabling dice in a board game: the same prompt always ends the same way.
top_p = 0.95 widens the vocabulary gate a little, so the model can pick rarer but still sensible words.

Deep Dive: Why Problem 5 Separated the Best from the Rest

Without leaking the exact wording, here is a reader-friendly sketch:

Setup. Two players alternately write positive real numbers on a blackboard under two global constraints: an upper bound on the running sum and a lower bound on the running product.
Goal. Prove the game must terminate after a bounded number of moves.
Trap. The obvious inequalities explode in complexity after three or four turns, and the slightest slack in any estimate lets the product escape to infinity.

ByteDance Seed 1.6 and Gemini 2.5 Pro succeeded by following the same disciplined recipe:

Construct an extremal example to show the bound is tight.
Apply induction on the number of moves, splitting cases only where necessary.
Explain every inequality in conversational language before writing the formal line.

Other models either skipped the extremal check or relied on a single sweeping inequality that failed under close inspection.

How You Can Replicate or Extend the Experiment

The repository keeps set-up friction low.

Step 1 – Skim the Questions

Click any of the six AoPS links in the README. Spend ten minutes trying the problem yourself; it builds intuition for what follows.

Step 2 – Run the Evaluation Script

If you have a GPU and the open-source model weights:

git clone https://github.com/your-org/IMO2025-LLM.git
cd IMO2025-LLM
python evaluate.py --model my-local-model --problem 5

The script returns a score, token count, and cost estimate in one screenful.

Step 3 – Add Your Own Model

Place your model directory under /models, list its API endpoint in config.yaml, and rerun the evaluation suite. New bars appear automatically in the charts.

Lessons for Anyone Building with LLMs

Hard problems stay hard.
A flashy demo on grade-school arithmetic does not guarantee success on IMO-level reasoning.
Brevity is a feature.
The shortest correct proof often signals the deepest understanding.
Cost and quality trade off in plain sight.
Transparent metrics let teams choose their own point on the curve rather than blindly scaling up.
Parameter tuning is part of the product.
Users who care about reproducibility should always record temperature and top_p alongside the generated text.

Frequently Asked Questions (Based on Repository Data)

Q1. Were the models fine-tuned on past IMO problems?
No evidence in the repository suggests special training; all systems used publicly released weights or APIs.

Q2. Did any model score a perfect 42/42?
No. The best total was 39/42, achieved by Gemini 2.5 Pro, which dropped one point on Problem 3 for an incomplete edge case.

Q3. Is the data open for commercial use?
Yes. The repository is released under the MIT License; both the problems (linked externally) and the model outputs are free to download, analyze, or redistribute.

Q4. How reliable is the “correctness” label?
Each solution was hand-checked by at least two former IMO medalists. Disagreements were resolved by a third reviewer.

Closing Perspective

The IMO 2025 experiment does not crown a single “smartest” model, but it does draw a clear line in the sand: Problem 5 is the new acid test. Anyone claiming Olympiad-level reasoning should be willing to submit to the same public grading rubric.

More importantly, the repository offers a template for transparent, reproducible evaluation. Swap in new models, rerun the pipeline, and the charts update themselves. Over time, the gap between human and machine will narrow, but only if the community keeps measuring with yardsticks everyone can see.

Until then, the two models that solved Problem 5 remain the only entrants with bragging rights—and a benchmark the rest of the field must beat.

IMO 2025 LLM Experiment Reveals AI’s Mathematical Reasoning Breakthroughs