Teaching AI advanced mathematical reasoning

November 3, 2022

The scientific community has long regarded mathematical theorem proving as a crucial step in building intelligent machines. Demonstrating that a particular conjecture is true or false requires using symbolic reasoning and navigating an infinite number of possible choices — skills that stretch the abilities of even the most advanced AI systems today.

We are excited to share a significant advance in the field of AI and mathematics. Meta AI has built a neural theorem prover that has solved 10 International Math Olympiad (IMO) problems — 5x more than any previous AI system. Our AI model also improves upon the current state of the art by 20 percent on miniF2F, a widely used mathematics benchmark, and by 10 percent on the Metamath benchmark.

Our method, HyperTree Proof Search (HTPS), is trained on a dataset of successful mathematical proofs and then learns to generalize to new, very different kinds of problems. It was able to deduce a correct proof for an IMO problem that involved some arithmetic reduction to a finite number of cases.

We’ve detailed our work in a new research paper on HTPS to be presented at NeurIPS 2022, and we’ve made our model available through the Lean Visual Studio Code (VSCode) plugin, which will allow other researchers to explore the capabilities of our AI model within the popular Lean environment. We hope that others can build on our work as we have built on previous research and that collectively we can continue to make rapid progress in this exciting field.

Something Went Wrong
We're having trouble playing this video.

The challenge of International Math Olympiad problems

Experts have long believed that creating a system that competes at the highest level at the IMO is a grand challenge for the field of AI. The IMO is the world’s premier high school mathematics competition. Since 1959, students have competed to solve challenging problems in algebra, combinatorics, number theory, and geometry. To do well, competitors need both creativity and advanced reasoning skills. Some questions are so difficult that the majority of students score zero points on them.

In many ways, theorem proving is more challenging than building AI to play board games like chess. When trying to prove a theorem, the action space of possible moves is not just large but infinite. And in chess or Go, exploring a chain of possible moves can be useful even if that decision tree ultimately doesn’t lead to the best possible move. In theorem proving, a dead end is just a dead end, and the computational effort the solver used is simply wasted effort.

We have been able to solve the following problem:

Let a and b be natural numbers both prime with 7, and such that 7 is also prime with a + b, if we assume that 77 divides (a + b)7 - a7 - b7, we need to show that a + b is at least 19.

The proof involves using the binomial formula and then checking different cases. Although it is clear from a human perspective that this approach will work in this particular case, solving a problem like this by exhaustive checking is a rarely used, rarely effective strategy.

The model proceeds by contraposition and simplifies the equation:

contrapose h₄,
simp only [nat.dvd_iff_mod_eq_zero, nat.add_zero] at *,
norm_num [nat.mod_eq_of_lt, mul_comm, nat.add_mod] at h₄,

And then checks the different cases.

This problem shown is not the most difficult that IMO competitors face; in fact, it might be one of the most approachable. But what makes a problem like this particularly difficult from an AI perspective is that it requires a somewhat unusual approach compared to proofs found in standard training data, such as the formal mathematical corpus Mathlib. And because the problem space is infinite, a brute force search is not feasible. The solver – whether person or machine – must rely on the kind of creative reasoning that has been hard for AI to “grasp.”

This is in large part because previous approaches have usually relied only on language models. While these systems can produce very impressive results, they can sometimes lack the capacity to explore different approaches. This skill is critical to solve challenging mathematical problems that may require a bit of creativity.

Getting closer to human reasoning

Mathematical reasoning is difficult to codify and even more difficult to quantify. Current approaches in AI have focused on building machines that can solve a problem “at once” by generating a complete solution to the problem in a single step. But this is not how people tackle these challenges. We use intuition, break a complex problem into component parts, and look for ways to make incremental progress.

To mimic a more human approach, we needed the neural theorem prover to associate a particular “state” with our current (incomplete) understanding of the problem. We first started with a reinforcement learning approach that was tightly coupled with existing proving assistants such as Lean.

Using proving assistants makes this approach possible as they implement a step-by-step reasoning mechanism. We can then interpret the “current state” of the (incomplete) proof as a node in a graph and each new step as an edge. Using this approach, we can leverage techniques that have proved tremendously efficient for two-player games such as Go or chess.

We then needed a way to evaluate the quality of a proof state — similar to how a chess-playing AI one needs to evaluate a particular position in the game. To do this, we used an approach inspired by a Monte Carlo tree search (MCTS) in which the model cycles between two tasks: 1) the prior estimation of reasonable arguments to use in a given proof state, and 2) the outcome of the proof after a certain number of arguments are given.

The HTPS is a variation on the standard MCTS approach, where, to explore a graph, one leverages prior knowledge about the graph to choose a set of leaves to expand, then refines the initial knowledge via a backup correction. The graph is progressively explored, and knowledge about the graph structure gets refined with iterations.

This makes it possible to use an online-training procedure that considerably enhances the performance of the initial pretrained model on a specific type of problem, namely those similar to the ones used in the IMO.

The result is that our approach was able to solve 10 unseen IMO problems and achieves 67 percent accuracy on the Minif2f validation set accuracy — a full 20 percent better than the current published state of the art.

Real world impact, from software verification to aerospace

Building machines capable of solving advanced math problems will lead to real-world impact, notably in the field of software verification. Today, many companies (including Meta) use formal proofs to verify software. In fact, the tools and formal systems used to verify software and to prove theorems are the same. The main difference lies in what type of data the models are trained on: either a dataset of functions or mathematical theorems. Beyond software verification, there are a number of industrial applications, especially as complexity continues to increase and automation permeates critical tasks. Some examples include cryptography and aerospace, where operating conditions can vary and testing and simulation are critical.

To learn more about how AI is disrupting mathematics, please join us at the Formal Languages, AI and Mathematics (FLAIM) workshop in Paris on November 3–4, where there will be speakers from Meta AI, DeepMind, OpenAI, and many others. You can also read our paper on HTPS, which will be presented at NeurIPS 2022, and interact with our model through the Lean Visual Studio Code (VSCode) plugin.

Suggestions from the plugin to prove a basic lemma about group theory: Here, the suggestion rw inv_mul_self at key (in red) is a correct step in the proof (in yellow).

We’d like to acknowledge the contributions of a broad team to this work: Guillaume Lample, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Thibaut Lavril, Aurelien Rodriguez, Hugo Touvron, Fabian Gloeckle, Albert Jiang, Yongyi Hu, Amaury Hayat (CERMICS École des Ponts ParisTech), Gabriel Ebner (Vrije Universiteit Amsterdam), Alexander Miller.

Written By

Aurelien Rodriguez

Software Engineering Manager

Joe Spisak

Product Director

Faisal Azhar

Technical Program Manager