Why experts wonder if AI will change math forever or just help it

Kendra Pierre-Louis: For Scientific AmericanIt is Science quicklymy name is Kendra Pierre-Louis, I’m replacing Rachel Feltman.

In 1997, Deep Blue, a supercomputer built by IBM, did the unexpected: It defeated chess giant Garry Kasparov at his own game, leading to a flurry of headlines about whether Deep Blue was truly intelligent and whether computers could now outperform humans. The answer, at least at the time, was pretty much no.

But now it’s 2026 and we have a growing number of generative AI models that once again lead us to ask, “Can machines outperform us?” To explore this question further, a group of researchers this time is not turning to chess, but to mathematics.

On supporting science journalism

If you enjoy this article, please consider supporting our award-winning journalism by subscribe. By purchasing a subscription, you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

To find out more, I spoke to Joe Howlett, a reporter here at SciAm covering mathematics. Thanks for joining us today, Joe.

Joe Howlett: Thank you for having me.

Pierre-Louis: So you wrote an article that talks about the challenges of AI and mathematics. Before we get into the meat and potatoes of this piece, perhaps I have a more basic question to ask you.

Howlett: Yeah.

Pierre-Louis: For those of us who may have peaked in high school algebra, when you talk about AI and math problems, what kinds of math problems are we actually talking about?

Howlett: In fact, that’s a lot of what this story is about, is that the kinds of questions that mathematicians ask and spend their time thinking about don’t really resemble or have anything in common with the problems we work on for homework in math class.

Pierre-Louis: Mm-hmm.

Howlett: If you’ve taken a math class recently, you’re used to problems that have answers, right?

Pierre-Louis: Mm-hmm.

Howlett: And the answer is, like, a number…

Pierre-Louis: Yeah.

Howlett: Or something like that. And you turn in your homework, and the teacher can check this number [Laughs]if it’s the right number or the wrong number, and they give you a rating.

But what mathematical researchers do is try to prove that statements about the mathematical universe are true or false. So what does that mean? For example, you know triangles, squares and basic shapes, but there are…

Pierre-Louis: I graduated from kindergarten, yes. [Laughs.]

Howlett: [Laughs.] That’s right, exactly. That’s about all I did too.

There are far more complex shapes that exist in many dimensions and have strange curvatures that you can’t even imagine in your mind. But mathematicians are able to say things about them, right? Using equations and proofs, they are able to learn more about these objects that we cannot actually see or imagine.

Pierre-Louis: So now that we know what mathematics is, [one of your pieces] you note that LLMs have had some victories in mathematics, such as Google Gemini Deep Think achieving a gold level score in the International Mathematical Olympiad and AI solving several “Erdős problems”. Why is this not enough to demonstrate the mathematical prowess of AI?

Howlett: Yeah, I mean, the problem with most of these so-called benchmarks is what they call them – for many of the reasons AI companies have focused on math as, like, the next thing to prove…

Pierre-Louis: Mm-hmm.

Howlett: That LLMs can think, or take a step towards intelligence. But most of these examples, as you said, have more in common with the kind of test questions and homework problems we just talked about, not really resembling…

Pierre-Louis: Mm-hmm.

Howlett: Doing research in mathematics, yes, is more about proving statements about the world and exploring that world, asking interesting questions.

In a way, all of these accomplishments are very impressive. [Laughs.] It’s crazy that a computer can win gold in math IMO…

Pierre-Louis: Mm-hmm.

Howlett: But that doesn’t say much about whether and to what extent a computer can advance mathematics, alone or even with the help of a human.

Pierre-Louis: A bit like the difference between a very good calculator and a mathematician.

Howlett: Exactly! Yeah. For example, mathematicians have discovered that in the history of mathematics, new tools have been invented time and again that have been useful to mathematicians and made things happen faster. And one of the big questions here [is]: Is this just another one of those tools, or will this fundamentally revolutionize the way math is done to a level we’ve never seen before? And it’s a little too early to tell.

Pierre-Louis: And one of the ways people seem to be trying to figure out whether AI is just a giant calculator or whether it can really advance math is this first proof challenge that was organized by a group of 11 mathematicians. Can you explain what this challenge was?

Howlett: Yeah, so these mathematicians who are luminaries in their different mathematical fields – and they cover a wide range of mathematical subfields – wanted to rectify this situation where we don’t really have a good idea of how effective AI is in posing and solving real research mathematical problems.

All of them have had this anecdotal experience where LLMs have gotten much better in just the last few months at interrogating mathematical questions much like a mathematician would and at coming up with proofs and proof methods that seem to hold true in certain situations. But then they also hallucinate a lot and come up with a lot of very confident nonsense.

So these mathematicians – who, by the way, don’t work for AI companies, do they…

Pierre-Louis: Mm-hmm.

Howlett: They decided to get together and ask real research questions that they are trying to solve their own mathematical inquiry, aren’t they? So each of them has articles that contain evidence, and each of them took a small part of it. Proofs: The way mathematicians do proofs is to break them up into smaller theorems, right? So if you wanted to prove that seven is greater than three, you could first prove that seven is greater than five, then prove that five is greater than three, right? And that’s sort of how mathematicians work. And these little proofs are called “lemmas”.

What these mathematicians did was they each took a lemma from an upcoming paper that they proved as part of their larger proof and chose it from that paper, posed it as a problem for an LLM, and did all of that before uploading that paper to anywhere online so it wouldn’t be in the LLMs’ training data, right?

Pierre-Louis: Mm-hmm.

Howlett: Because any math problem I might ask in an LLM has probably already been asked and an answer probably exists on the internet. These are therefore truly cutting-edge research questions, and if an LLM could resolve them, then it would be able to contribute substantially to the practice of mathematics.

Pierre-Louis: So, what are the first results of this kind of challenge?

Howlett: Yes, so for this first round, different AI companies, using their best models and many mathematicians on staff, have tried their hand at the problems, and we don’t really see the practice that they’ve put in place. We cannot see, in some cases, their full transcription with chatbots.

Pierre-Louis: Mm-hmm.

Howlett: We don’t know to what extent they consulted human mathematicians.

And as a member of the First Proof team [members]Lauren Williams, told me, once humans are involved in the process, it becomes very difficult to tell what the humans are doing and what the AI is doing. So the team really wanted it to, originally, just boil down to: you ask an AI the question; see if that answers the question.

So they did it before the challenge with publicly available chatbots. And chatbots were able to answer two of those 10 questions, which is impressive, but to some extent it shows that this is a real and difficult challenge that we’re giving to AIs.

This little corner of the internet that only I pay attention to has gone really crazy trying to solve these problems. This shows that there is a growing online community of mathematicians and math enthusiasts, who may not be research mathematicians, who are trying to use LLMs to do pure mathematics. And this community really tried their hand at these issues and produced a lot of evidence, posted on social media and Discord servers.

The First Proof team asked these questions, uploaded the answers in encrypted form, and told the community they would decrypt them in a week. So they gave the world a week to try to answer as many questions as possible. And this online community went crazy trying to do it, and produced lots of evidence. From my reporting, a lot of it was clearly bullshit. The mathematicians I talked to said, “Yes, most of these proofs are nonsense. » But some of them were promising.

So OpenAI initially claimed to have solutions to six of the problems. Soon enough, a mathematician found a problem with one of them, so there were only five left. The others seem to have held up, so OpenAI seems to have gotten five correct answers with its unknown process. Google Gemini also published its results, and it did the same: it got six correct results out of 10. And some of them were different from OpenAI’s.

The active online community and some research mathematicians who were trying their hand were also given a few questions, questions nine and 10, which the researchers said the AI could answer. Other people have produced these responses.

There are a few things that struck me about these results. One is that there was a huge gap between what people with publicly available models can do and the internal efforts of these giant companies, right? It’s a big difference to get one or two answers right versus getting six.

The other thing is that people don’t use just one LLM; they use what they call a “scaffold ge”. So they’ll have one LLM, and then a bunch of other LLMs will systematically query its answer and go back and forth with it, right? It’s allowed – there’s no human in the loop – but it’s a group of AIs all talking to each other in one way or another. And it seems to be a way to increase the performance of these LLMs. They are much better at sorting out some of the nonsense and producing real proof.

Pierre-Louis: There was a quote in [one of the pieces] what I found interesting was that he said that when they got the answers right, LLMs were using almost 19th century style mathematics. And I was wondering about this quote and, like, what 19th century mathematics means.

Howlett: Yes, this is a really important point. AI seems, at least for now, to do math a little differently and in a way that’s a little less impressive to at least some mathematicians. In many cases, the AI will produce a proof that reaches the same conclusion as the mathematician’s proof…

Pierre-Louis: Mm-hmm.

Howlett: It was deciphered that Friday, but it does it in a much more roundabout, roundabout way and with a lot of brute force, in a way that is not as aesthetically pleasing to mathematicians.

Sometimes mathematicians, when they describe what they do, sound more like artists than scientists, don’t they? They really like to have what they call a “nice” proof, something that when you read it you really understand why that statement at the end has to be the case.

Pierre-Louis: Mm-hmm.

Howlett: And AI tends to produce these proofs where every step makes sense and you get to the end and you see the statement, so you believe it, but you don’t see the whole picture. And maybe the AI never saw the whole picture.

Pierre-Louis: Where do you think it goes from here?

Howlett: One of the researchers, Mohammed Abouzaid, said this about 19th century mathematics because when mathematicians prove something, they often do so by coming up with a new mathematical concept that distills the truth and is easier to use than anything that existed before.

Pierre-Louis: Mm-hmm.

Howlett: It is therefore an abstract object, like a tesseract. AIs don’t seem to prefer doing this. They’re very happy to work with existing tools and simply put them together in the new MacGyver-y ways, but it’s not clear whether this will lead to new discoveries. Often, the tools that mathematicians invent along the way to a proof give them a deeper understanding of the mathematical universe and lead to more results. So, at this point at least, it’s unclear whether AI is capable of this kind of creative mathematical style.

But there are counterexamples: there is at least one other proof on one of the servers where people are discussing these results: several mathematicians looked at it, not only said it was correct but also very beautiful, and it accomplished the proof in a way they would never have thought of.

So it’s not clear that this is always the case for AI. Maybe it just needs to keep getting better.

Pierre-Louis: It’s interesting and a little scary, I think. [Laughs.]

Howlett: [Laughs.] The next round will tell us a lot more. The First Proof team works with AI companies to establish controls over how they answer questions.

Pierre-Louis: Mm-hmm.

Howlett: So whatever answers we get, we won’t have to take them so cautiously. And that will really tell us where the models are and whether these internal systems are actually that much better than the public market. And also, the fact that we now have this system of iterated rounds, we can see LLMs evolve over time.

So where does this leave us? I don’t know. Some mathematicians will tell you that math will never be the same, that AI will solve some of the biggest math problems in the next few years. And there are mathematicians I speak to who were even convinced…

Pierre-Louis: Mm-hmm.

Howlett: Thanks to this first round of first evidence, this timeline is moving forward faster than they previously thought.

Pierre-Louis: What I hear is that [The] Terminator was a documentary.

Howlett: [Laughs.] Yeah, about the future, I guess. Yeah.

Pierre-Louis: [Laughs.]

Howlett: Many mathematicians will also tell you that AI will never be able to do what humans do in mathematics, which is direct curiosity in new directions, and that the best it can be is a tool that mathematicians use, just like a calculator.

I struggle not to be disappointed when I imagine a future in which AI solves big math problems. Like, isn’t part of the excitement that humans are problem solvers? But several mathematicians have given up on this idea.

Pierre-Louis: Mm-hmm.

Howlett: They will say no, they just want to know things about the mathematical universe. They don’t care if an AI tells them or if they do it.

A mathematician used this example, this thought experiment of a [Jorge Luis] Borges’ tale, “The Library of Babel”. So he says: “Imagine a world in which we could have access to any mathematical truth: we had a giant library containing all the proofs you could possibly have.” And what he meant was that any mathematician he knew would be thrilled to be in this library and would immediately get to work trying to figure things out. The fact is that a mathematician’s work goes nowhere; This may be an exciting time for mathematicians.

For me, it’s hard to imagine a future in which I don’t have the human side of the story. Certainly, as a report on a great mathematical proof…

Pierre-Louis: Mm-hmm.

Howlett: It will be less exciting if I don’t hear about the person who was stuck late at night at their desk, struggling with a problem, banging their head against the floor until they found that moment of enlightenment. And also collaboration, like the stories of mathematicians coming together at conferences and having that key discussion over coffee that leads to a fundamental breakthrough. So I hope humans stay informed. [Laughs.]

Pierre-Louis: Me too, for what it’s worth.

Howlett: [Laughs.]

Pierre-Louis: Thank you very much for taking the time to speak with us today.

Howlett: Thank you so much for having me, Kendra.

Pierre-Louis: That’s all for today! See you Friday, when we explore the science of pain.

Science quickly is produced by me, Kendra Pierre-Louis, with Fonda Mwangi, Sushmita Pathak and Jeff DelViscio. This episode was edited by Alex Sugiura. Shayna Posses and Aaron Shattuck check in on our show. Our theme music was composed by Dominic Smith. Subscribe to Scientific American for more recent and in-depth scientific news.

For Scientific American, This is Kendra Pierre-Louis. See you next time!