DeepSeek’s self-correcting AI model aces tough maths proofs

Credit: Nikolas Kokovlis/NurPhoto via Getty

Chinese artificial intelligence company DeepSeek has released a mathematical reasoning model that can identify and correct its own errors. The model beat the best human score in one of the world’s most prestigious undergraduate maths competitions.

The model, DeepSeekMath-V2, scored 118 out of 120 points on questions from the 2024 William Lowell Putnam Mathematical Competition, beating the top human score of 90. The model also performed at the level of gold-medal winners in the International Mathematical Olympiad (IMO) 2025 and the 2024 China Mathematical Olympiad. The results are described in a preprint1 posted on arXiv on 27 November.

“We are at a point where AI is about as good at maths as a smart undergraduate student,” says Kevin Buzzard, a mathematician at Imperial College London. “It is very exciting.”

In February, AlphaGeometry 2, an AI problem solver created by Google DeepMind in London, also achieved a gold-level performance in the IMO. The feat was repeated in July by Gemini’s Deep Think, which is owned by DeepMind.

Reasoning over answers

Early approaches to training large language models for mathematical reasoning focused on the accuracy of final answers, the preprint authors write. But a correct answer does not guarantee correct reasoning. At times, a correct final answer might just be a result of a fortunate error. Moreover, an exclusive focus on the end result is not useful in proving mathematical laws or formulae, when the logical reasoning is more important than the final answer.

Tong Xie, a chemist specializing in AI-driven discoveries at UNSW Sydney in Australia, says the researchers behind DeepSeek, as well as those developing Gemini’s Deep Think, have been working on overcoming this problem by rewarding reasoning over the final answer.

DeepSeekMath-V2 introduces self-verifiable mathematical reasoning for the first time. The model consists of a verifier trained to evaluate mathematical proofs — which are built on a series of step-by-step deductions — to identify logical flaws and assign scores according to how rigorous the proof was. A meta-verification system then checks whether the verifier’s critiques are accurate, reducing the likelihood of hallucinations and improving trustworthiness. These components work with a proof generator that constructs solutions and evaluates its own work, refining arguments until no further issues can be found.

The design creates a feedback loop: the verifier improves the generator, and as the generator produces more-challenging proofs, these become new training data to strengthen the verifier.

The system was able to solve five out of six problems, scoring 83.3%, in the 2025 IMO. It was, however, unable to solve the hardest problems set in 2025 and in past IMOs.

Math-V2 relies on self-verification using natural language in the model itself, Xie says. This reduces human involvement and makes the model more cost-effective and scalable.

Gemini’s Deep Think, by contrast, verifies mathematical reasoning using an external, symbolic language called Lean, and its verification process requires extensive expert input. The method is nearly free of hallucination, but it is computationally expensive and resource-intensive, Xie says.

What's Hot

‘I felt sick that he was near my mum’: a top amputation surgeon had his own legs removed due to a fetish. Were his patients safe? | Crime

Andy Burnham signals he will scrap plans to curb jury trials | Trial by jury

Hobbies can improve your health – here are four tips for choosing one | Well actually

The Lasting Cost of Graduating Into a Tough Job Market

‘Suicidal’ model of capitalism leading to war and fascism, climate summit told | Climate crisis

Facing AI and a tough job market, gen Z turns to entrepreneurship: ‘I have to prove myself’ | US work & careers

The science influencers going viral on TikTok to fight misinformation

Watch Lady Gaga’s Perform ‘Vanish Into You’ on ‘Colbert’

Advertisers flock to Fox seeking an ‘audience of one’ — Donald Trump

At Chile’s Vera Rubin Observatory, Earth’s Largest Camera Surveys the Sky

SpaceX Starship Explodes Before Test Fire

How the L.A. Port got hit by Trump’s Tariffs

Most Popular