The experts gave the AI 10 math problems to solve in a week. OpenAI, researchers and amateurs all gave the best of themselves
By Joseph Howlett edited by Claire Cameron

Interim Archives / Contributor via Getty Images
The verdict seems to be in: artificial intelligence is not about to replace mathematicians.
This is the immediate conclusion of the challenge of the “First Proof”— perhaps the most robust test yet of the ability of large language models (LLMs) to perform mathematical searches. Determined by 11 top mathematicians on February 5, the test results were released early on Valentine’s Day morning. It’s too early to say with certainty how many of the 10 math problems included in the challenge were solved by AIs without human help. But one thing is clear: none of the LLMs managed to solve them all.
The mathematicians behind First Proof introduced the 10 “lemmas” of AI, a mathematical term for minor theorems that point the way to a larger outcome. These problems are the stock-in-trade of the working mathematician, the kind of mini-problems that might be assigned to a talented graduate student. The mathematicians were aiming for problems that would require some originality to solve, not just a mix of standard techniques, according to Mohammed Abouzaid, a professor of mathematics at Stanford University and a member of the First Proof team.
On supporting science journalism
If you enjoy this article, please consider supporting our award-winning journalism by subscribe. By purchasing a subscription, you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
The challenge, while highlighting the limitations of AI, also highlights a burgeoning subculture passionate about AI within the mathematics community. Online discussion forums and social media accounts dedicated to mathematics have been flooded with purported evidence from top mathematicians and rogue students. And it highlighted how AI startups, including ChatGPT creator OpenAI, are taking on the challenge of teaching an LLM math.
“We did not expect such activity,” explains Abouzaid. “We didn’t expect AI companies to take this seriously and put so much work into it.”
The First Proof team revealed the solutions to the 10 challenges early Saturday, and job about their own experiences trying to get LLMs to solve problems. They found that AIs could provide reliable proofs for every problem, but only two were correct: those for the ninth and tenth problems. And an almost identical proof to the ninth problem turned out to already exist. The first problem was also “contaminated” – a sketch of a proof was archived on the website of its author, team member and 2014 Fields Medal winner Martin Hairer – but LLMs still failed to fill in the gaps.
The style of proof proposed by the LLMs was particularly surprising, says Abouzaid. “The correct solutions I have seen in AI systems have the flavor of 19th century mathematics,” he says. “But we are trying to build 21st century mathematics.”
Outside submissions don’t seem to fare much better. Some submissions appeared to involve varying degrees of human input, with several appearing to be the result of week-long dialogues vetted by mathematicians. Above all, the Rules of first evidence prohibit human mathematical input or prompting.
“Once there are humans involved, how can we judge the extent to which there is human and AI?” says Lauren Williams, the Dwight Parker Robinson Professor of Mathematics at Harvard University and one of the mathematicians who created First Proof.
OpenAI released its work on Saturday, the result of a week-long sprint using its latest in-house AI models working with “expert feedback” from human mathematicians. The company’s chief scientist, Jakub Pachocki, said in a statement social media post that they believe that six of their ten solutions “have a good chance of being correct.” Mathematicians have already pointed out potential holes in at least one of these six.
Aside from the amount of human assistance the AIs received, the vast majority of submissions appear to consist of very convincing nonsense. Even before the challenge ended, a number of so-called solutions that initially seemed credible were already being called into question by experts.
Submissions will take days for experts to properly review. And judging whether a piece of evidence is truly “original” is even more difficult than judging whether it is correct. “Nothing in mathematics is completely unprecedented,” says Daniel Litt, a mathematician at the University of Toronto who was not part of the First Proof team.
“We view this as an experiment. Our goal was to get feedback,” says Abouzaid. The team writes that it plans a second round with stricter controls and that more details will be released on March 14.
For some mathematicians who have followed advances in AI, the mixed results match their expectations. “I expected maybe two or three unambiguously correct solutions from publicly available models,” says Litt. “Ten would have been very surprising to me.”
Yet even getting a few valid solutions to research problems from an AI would probably have been impossible just a few months ago. “I’ve already heard from colleagues that they are in shock,” says Scott Armstrong, a mathematician at Sorbonne University in France. “These tools are going to change math, and it’s happening now.”
But for those who closely follow AI’s achievements, it’s not a great achievement.
“The models seem to have struggled,” says Kevin Barreto, an undergraduate at the University of Cambridge, who was not part of the First Proof team. He recently used AI to solve one of Erdős’ problemsa number of challenges posed by the Hungarian mathematician Paul Erdős. “To be honest, yes, I’m a little disappointed.”
It’s time to defend science
If you enjoyed this article, I would like to ask for your support. Scientific American has been defending science and industry for 180 years, and we are currently experiencing perhaps the most critical moment in these two centuries of history.
I was a Scientific American subscriber since the age of 12, and it helped shape the way I see the world. SciAm always educates and delights me, and inspires a sense of respect for our vast and magnificent universe. I hope this is the case for you too.
If you subscribe to Scientific Americanyou help ensure our coverage centers on meaningful research and discoveries; that we have the resources to account for decisions that threaten laboratories across the United States; and that we support budding and working scientists at a time when the value of science itself too often goes unrecognized.
In exchange, you receive essential information, captivating podcastsbrilliant infographics, newsletters not to be missedunmissable videos, stimulating gamesand the best writings and reports from the scientific world. You can even give someone a subscription.
There has never been a more important time for us to stand up and show why science matters. I hope you will support us in this mission.






























