As AI continues to improve, mathematicians struggle to predict their own future

as-ai-continues-to-improve,-mathematicians-struggle-to-predict-their-own-future

As AI continues to improve, mathematicians struggle to predict their own future

In the ongoing campaign by artificial intelligence companies to capture pure mathematics, a new cycle is beginning.

The team behind First Proof, an effort to assess the ability of large language models (LLMs) to contribute to research-level mathematics, has announced its upcoming review. For this second round, which it plans to roll out over the coming months, the team is demanding access and transparency from any AI company wishing to participate.

This occurs against a backdrop of radical change in mathematics research. In just the last few months, the best publicly available models have begun to generate valid proofs of minor theorems that are actually useful to working mathematicians. For some experts, the first round of First Proof was a pivotal moment in this ongoing story.


On supporting science journalism

If you enjoy this article, please consider supporting our award-winning journalism by subscribe. By purchasing a subscription, you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


“We were very impressed with the performance of the AI ​​models,” says Lauren Williams, a Harvard University mathematician and member of the First Proof team. “The problems we proposed are really at the forefront of what AI models, perhaps in collaboration with experts, can solve.”

First Proof was born from its 11-person team’s eye-opening, if sometimes frustrating, experiences with AI. No pre-existing benchmark seemed sufficient to test LLMs as a mathematician’s assistant. In principle, an LLM could save time by proving smaller “lemmas” – intermediate propositions on a mathematician’s path to developing larger, more interesting theorems. In practice, however, these AI assists have tended to go awry.

So for their initial “experimental” test, the First Proof team chose 10 lemmas from papers members had written but not yet published, then set a one-week deadline for AI companies (and anyone else) to try to prove these propositions using their favorite models.

Groups from OpenAI and Google have published their LLMs’ answers to all the problems. Five of the OpenAI model proofs appeared correct. And Google Deepmind’s agent Aletheia seems to have obtained six (even if experts are not unanimous on the validity of any of these proofs). Comparing the performance of the two models, Williams was surprised to find that each solved several problems that the other could not. “It’s interesting to see that their abilities are different,” she says.

“The performance was better than I expected,” says Daniel Litt, a mathematician at the University of Toronto who is not directly involved in the First Proof effort. In total, no fewer than eight out of ten problems appear to have been at least partially solved by AI. “It’s clear that capabilities have improved very quickly,” says Litt.

A future unclear but full of hope

Litt isn’t afraid of AI’s growing mathematical prowess. “I don’t expect that in five years it will be useless,” he says. “In fact, I expect to do the best work I’ve ever done because I’ll have these incredible tools.” In fact, the results of the first proof inspired him to write an essaywhich has circulated widely among mathematicians in recent weeks. It presents a speculative and optimistic view of the AI-infused future of the field.

For the sake of argument, Litt imagines a hypothetical library generated by superintelligent AIs and containing all possible proofs in the mathematical universe. A simple human mathematician wandering among its innumerable shelves could browse all its volumes but could not create any new proofs himself.

But that doesn’t mean mathematicians would be paralyzed by boredom, Litt says. Far from it. “They would be incredibly excited and get to work right away,” he wrote in the essay. The mathematical universe is so vast, he says, that the joy lies in exploring it, whether reading and digesting a proof or writing a new one. “My job wouldn’t even change at all,” he says. “The job now is to try to figure things out.”

Even if all mathematicians agreed with Litt’s decidedly utopian vision of this thought experiment, the current situation falls far short of this lofty ideal, as evidenced by the first round of First Proof. “Together, the models solved maybe eight of the problems,” he says. “But they also produced thousands and thousands of pages of garbage.”

It turns out that current AIs are often fake but convincing. They will cite a result in the literature but claim it is stronger than it is. Or they’ll bury a crucial error deep in a tedious calculation, where it’s easy to miss. “Students make mistakes, but this is definitely not the case. while trying make mistakes,” Litt says. “Models aren’t very honest.”

This qualitative difference in the types of quantitative errors produced by LLMs can make it very difficult to evaluate their responses. “One of the things we learned from this first round is how difficult it can be to verify the accuracy of the results,” says Mohammed Abouzaid, a member of the First Proof team and a mathematician at Stanford University. “You would almost say, ‘No human who knows what all these words mean would make this mistake!’ » »

For the second round, the team plans to give the task of evaluating each application to mathematicians hired as anonymous evaluators, funded by a combination of grants and donations from AI companies. But with no sign of slowing the massive mathematical assault, a deluge of subtly false proofs written in LLMs could soon overwhelm human resources. “People need to start thinking about it,” Litt said. “Our institutions and the profession are not adapting to what is coming. »

An unexplained gap

The first round seemingly revealed a glaring chasm between public and private efforts. This would seem to challenge the idea that AI usurping human skills would democratize them, for example by expanding the number of people able to contribute meaningfully to the progress of mathematics.

In the team’s internal testing before releasing the first round’s 10 lemmas, even the best publicly available models were only able to prove two. During the week-long testing period, various groups of amateur and professional mathematicians attempted to do better by building “scaffolds,” collaborative networks of LLMs that talked to each other to detect errors. But all these efforts only solved one more problem.

Several different factors could explain why Google and OpenAI managed to solve (at least partially) eight problems compared to the public’s three. Companies could use improved, novel versions of their LLMs or more robust internal scaffolding. Or the answers could rely on undisclosed contributions from human mathematicians. (the Google team published an explanation of its methodology. The team said this approach included “absolutely no human intervention” – the kind of claim that First Proof’s new requirements would verify in the second round.)

That’s what the second round is supposed to solve, Williams says. “This was an experiment,” she says, “to get community feedback to determine how to run a more formal cycle.”

In addition to more robust human judgment, this round will require participants to package models so that the First Proof team can prompt them directly. “If it’s not a public model, then we have to run it,” says Abouzaid, “because otherwise it’s not clear what we’re testing.”

It remains to be seen whether OpenAI and Google will comply, or whether the many other LLM companies and math AI start-ups that were conspicuously absent in the first round will do so.

In the months to come, First Proof and other AI benchmarks could help predict the still-unclear fate of mathematics – a small niche in the scientific world that some of the richest eyes on Earth are suddenly turning to.

“One of our main motivations is to be able to tell young people what the field will look like in a few years,” explains Abouzaid. “And that requires understanding what these systems are actually capable of.”

Exit mobile version