Introduction of Deep Learning for Automatic Theorem Proving

Basic mathematics is finally waiting for a day when it can be computer-aided to massively increase productivity. In recent years, the field of automatic theorem proving (ATP) has generated an amazing amount of work through the introduction of deep learning and reinforcement learning, especially with the rapid progress of Large language models (LLM). Although we are still a little far away from having computers prove big propositions like the Riemann conjecture on their own, we can already talk predictably about how automatic theorem proving will help human mathematicians do their job. This paper pays tribute to the influential @Peng Keyao “Introduction to Computer-Aided Proof”. The formal mathematical language introduced in this paper is also the theoretical background of this paper. Modern ATP system is basically carried out on the basis of formal language. It is also shown as tactic in Lean or hammer in Isabelle, so it is suggested that readers who have not read this article should refer to it first.

This article focuses on the current progress in automatic theorem proving with deep learning and reinforcement learning (abbreviated DL+ATP below, but note that RL is also key). DL+ATP is developing rapidly, but I don’t want this article to be out of date, so I won’t cover too much technical content, and I will keep up with the field with another review that will be updated continuously. The purpose of this article is to present ideas, implications, clarify misconceptions, and answer doubts, with the aim of exposing this new, vibrant and significant field to enthusiasts and researchers in mathematics, computing, and beyond.

This article begins with the following questions (but the answers are intertwined, so the table of contents is not organized there) :

  1. Why theorems can be proved automatically;

  2. How DL+ATP works;

  3. What achievements have been made in the field of automatic proof;

  4. What mathematicians hope to gain from automatic proof;

  5. What are the effects of automatic theorem proving;

  6. Beginners how to get started with DL+ATP.

Here are some of the results so far: As of the writing time of this paper on July 1, 2023.1, the most representative and influential DL+ATP work HyperTree Proof Search for Neural Theorem Proving (2022.5.23, see Meta Blog, paper, HTPS), automatically solved 10 International Mathematical Olympiades (IMO) problems, such as one of them:

(IMO1964p1_2)7 can’ t divide $2^n+1$ exactly.

The formal expression of this problem in Lean and the automatic proof process given by AI are as follows:

import data.nat.prime
theorem imo_1964_p1_2 (n : ℕ) :
  ¬ 7 ∣ (2 ^ n + 1) :=
begin
  rw nat.dvd_iff_mod_eq_zero,
  rewrite [nat.add_mod, nat.mod_eq_of_lt],
  obviously,
  apply nat.strong_induction_on n,
  induction n,
  {
    intros n IH,
    cases n,
    norm_num,
    cases n,
    norm_num,
    rw [nat.succ_eq_add_one, pow_succ],
    rw [nat.succ_eq_add_one, pow_succ],
    induction n,
    norm_num,
    rw [nat.succ_eq_add_one, pow_succ],
    norm_num [nat.mul_mod, ←mul_assoc],
    contrapose! IH,
    refine ⟨n_n, nat.lt_succ_iff.mpr _, IH⟩,
    exact nat.le_succ_of_le (nat.le_succ _),
  },
  exact n_ih,
end

This automatically generated proof (after a lot of simplification) translates into natural language roughly as follows:

The original problem is equivalent to $2^n+ the remainder of $1 divided by 7 is not equal to 0, so either $2^n divided by the remainder of 7 plus the remainder of 1 divided by 7 is not equal to 0, which is obvious, or $2^n divided by the remainder of 7 plus the remainder of 1 divided by 7 is less than 7. This proposition is proved by strong induction ∀(n :ℕ),(∀(m :ℕ),m < n → 2 ^ m % 7 + 1 < 7) → 2 ^ n % 7 + 1 < 7 inductive 1)∀ (m : ℕ), m < 0 → 2 ^ m % 7 + 1 < 7 pf. 2 ^ 0 % 7 + 1 < 7 : obviously 2)∀ (m : ℕ), m < n + 1 → 2 ^ m % 7 + 1 < 7 pf. 2 ^ (n + 1) % 7 + 1 < 7 inductive 1.0) ∀ (m : ℕ), m < 1 → 2 ^ m % 7 + 1 < 7 pf. 2 ^ 1 % 7 + 1 < 7 : obviously 1.1) ∀ (m : ℕ), m < n + 1 + 1 → 2 ^ m % 7 + 1 < 7 pf. 2 ^ (n + 1 + 1) % 7 + 1 < 7 inductive 1.1.0) ∀ (m : ℕ), m < 1 + 1 → 2 ^ m % 7 + 1 < 7 pf. 2 ^ (1 + 1)% 7 + 1 < 7 : obviously 1.1.1) ∀ (m : ℕ), m < n + 1 + 1 + 1→ 2 ^ m % 7 + 1 < 7 pf. 2 ^ (n + 1 + 1 + 1) % 7 + 1 < 7 So far, 2 ^ (n + 1 + 1 + 1) % 7 + 1 = 8 * 2 ^ n % 7 + 1 = (7 + 1) * 2 ^ n % 7 + 1 = 2 ^ n % 7 + 1, obviously.

If we look closely, we will find that this proof through extremely violent, with a strong mechanical wind means, in fact, also in disguised realization of the human proof “first observed” this “fantastic idea” (TL; DR, mainly reflected in the last step). It should be pointed out that this proof is not generated by a fixed algorithm of symbolism, such as SAT or Wu method, but is indeed actively searched out in the proof tree by thinking in a similar way to human mathematicians (algorithmic ideas will be briefly introduced in the following article). However, although the principle of this algorithm is close to human thinking, the result of this proof is really not like human writing, tedious and unelegant, mathematicians may feel that the rice bowl is still secure after reading. But this result also makes us reflect that we may indeed have come to the moment when we can discuss AI automatic theorem proving, although the “observation” of this problem is relatively simple, slightly trained students can master, but is the “art” of the theorem proving considered more profound really a kind of unattainable light of human wisdom, which is not allowed to be touched by AI?

Note: You may have heard of LeanDojo (the work of the new intelligent Yuan scam), but the paper does not see that it is more powerful than Hypertree. We’ll talk about that later.

自动证明是如何可能的

Automatic theorem proving is one of the earliest and most traditional problems in artificial intelligence, as far back as 1954, Martin Davis developed the first theorem proving program, proving that “the sum of two even numbers is even”; The Logic theorist program presented by Allen Newell and Herbert Simon at the Immortal Dartmouth Conference in 1956 proved more than half the theorems in Russell and Whitehead’s Principia Mathematica, and some were even more concise than the authors’ proofs. Thus the school of symbols was founded. The influence of these early works on computer science is numerous, such as formal verification, expert systems and knowledge graphs are its descendants, and even the study of the SAT opened the discipline of computational complexity. The pinnacle of the symbolist line was the Vampire program, which is still being updated today and continues to play a role in areas such as formal verification. However, for the original goal of automatic proof of mathematical theorems, the symbolist route is inevitably limited by combinatorial explosion, so it can not be applied to the practice of mathematicians for a long time. In the statistical learning era, support vector machines and other models have been used to construct heuristics for proof search, but the results are not good. In recent years, with the addition of tools such as deep learning to automatic theorem proving, we finally have the hope that deep learning can develop heuristics good enough to overcome the combinatorial explosion and even achieve theorem proving algorithms that are close to the way humans think.

Before we look at how DL+ATP is possible, let’s look at how proof is structured as a computational task. There is no need to deal with the writing of successive calculus in proof theory, and we can ignore the concrete rules of calculus such as the resolution principle, and just from the most abstract task structure, we can think of the proof as a search tree. In analogy to the game of Go, each step has a number of points to choose, and eventually there will be several paths leading to victory, theorem proving can also be seen as in each propositional state (in the proof theory and ITP system, the states are context and goal) can apply a number of theorems or proof skills to the current state, so as to make it become the next state. And eventually there will be several paths leading to the completion of the proof. We use an example to show how search trees and the simplest heuristics work: the proof that “sqrt(2) is irrational”. If the reader is familiar with how ITP systems like Lean work, this example may be better understood.

Given this proposition, we have several options, for example, we can try induction and find that the proposition contains nothing that can be induced, so there is no way to perform induction. So maybe we can try proof by contradiction? Then we find that the proof by contradiction does change the state of the problem, and after restoring the definition back (that is, rewriting the definition in Lean) we get a new proposition “There are m,n, m,n mutual primes and sqrt(2)=m/n” and the goal becomes the derivation contradiction. Then there will be many choices, such as “both sides of the equation can be +1 at the same time” to apply, or “both sides of the equation can be multiplied by 2” to apply, or “both sides of the equation can be squared at the same time” to apply. If we don’t know the answer in advance, there are actually an infinite number of choices, that is, the search tree is infinitely wide. (In Lean, we have a lot of alternative tactics and theorems that can be rw) At this time, we have begun to see the “combinatorial explosion”, and the automatic proof algorithm cannot go through so many theorems, so some heuristics are needed to help select the next search branch. For example, you can use the number of symbols contained in the proposition to define a “proposition complexity”, and try to prevent the complexity of the proposition from rising when searching. In this way, the algorithm may give up trying to “+1 on both sides” because it makes the proposition more complicated. A less naive and more targeted heuristic at the moment might be “familiarity with symbols”, such as when the algorithm finds that it knows very little about sqrt, that only a few theorems can deal with it, and that more theorems can deal with integers and rational numbers, then it will tend to find a way to eliminate SQRT, at the moment using “squares of both sides”. Thus we (skipping a step) turn the original statement into “There are m,n, m,n mutual primes and 2=m²/n²”. Then the algorithm decides that it might not be that familiar with fractions, so it multiplies n² again, so the heuristic may stop there, and the algorithm will consider other search techniques.

Due to Godel’s incompleteness theorem, Tarski’s first-order real numbers are decidable, but the super-exponential algorithm, and the NPC property of SAT, it can be argued that there is no “ultimate heuristic” that can guide all proofs, and Wu’s method that can completely solve elementary geometric problems is only a few special cases. In a general problem a heuristic can only be applied to a very narrow set of problem patterns. Theorem-prover in the era of symbolism is a big competition of heuristics, whose heuristic design is more delicate, who can better coordinate multiple heuristics, who can achieve better performance. But math problems are so varied that, even after decades of hard work, these provers are prohibitive about high school math, let alone keeping up with modern math.

In fact, other areas of AI have encountered similar situations, such as image recognition, which also relies on humans to manually design features, but performance on the Imagenet dataset has been hovering around 50%. In 2014, deep learning was born and instantly changed the face of the image recognition field, and now deep learning has also begun to emerge in the field of theorem proving.

Why is all this possible? You may feel that symbolic heuristic design may be in line with some human ideas in some cases, but it does not capture the most crucial part of human mathematical practice: humans have gone through a considerable amount of mathematical training, accumulated a lot of experience, and sometimes these experiences are difficult to describe, “I don’t know how I came up with it, but I just thought of it.” Deep learning can precisely allow the algorithm to accumulate experience, when it has done many many problems, when it encounters a new problem, it will recall whether it has seen a similar “problem type”, and find a similar practice with the training at that time.

We can use Monte Carlo search trees to describe a statistically enhanced theorem proving model. The Monte Carlo search tree gives a probability for each branch, such as 0.9 for “proof by contradiction”, 0.09 for “induction”, and 0.01 for some branches. The search is prioritized by probability, which avoids traversing all branches. In the symbolist era, in fact, the combined action of multiple heuristics would also use this model, but under a fixed combination of symbolic heuristics, the probability is usually fixed, but now we can adjust these probabilities through training (this is actually a reinforcement learning technique), For example, after searching for a branch with 0.9 probability, the algorithm finds that the descendants of this branch cannot successfully prove the theorem, indicating that the algorithm is wrong in estimating the success of this method, and then it will appropriately reduce this probability to avoid repeating the mistake (only reduce rather than completely kill it). The descendants of this branch may not be completely traversed and therefore it is not entirely certain that it is not possible). The HTPS algorithm introduced at the beginning of this article goes a step further on this idea, and it finds that people will have relatively fixed problem-solving routines when they encounter some problem types, that is, a series of theorem application steps may appear repeatedly, and only a few parameters may change according to specific circumstances. It then packages these routines into trees that become alternatives at each search step, eliminating the need to search one by one for theorems to be applied next. This “tree on tree” idea is the origin of “Hypertree”.

The probabilistic feedback mechanism of the Monte Carlo search tree alone is not enough, because the tree is too large (infinite), most of the solutions can not be found, so it is necessary to accumulate experience from training through deep neural networks, which are given a good enough initial probability in the Monte Carlo tree model. What deep learning does best is capture features, such as features of cats and dogs in images, or features common to English and Chinese. In the most abstract sense, analytical theorems are more likely to be proved analytically, and algebraic theorems are more likely to be solved with algebraic tools. Even if algebraic theorems are really proved analytically, we can also feel some “analytical flavor” in them. These are actually a kind of restriction on the direction of search. Deep neural networks can “accumulate experience” by discovering the features of proposition or proof process in the data, and adopting the corresponding proof process when encountering propositions with similar characteristics.

Features in different task data often have their own uniqueness, and we can better extract them by designing networks with different structures accordingly. For example, we use convolutional neural networks to capture features in images. So what kind of deep neural network should we design to extract the features in the proof? Some earlier work had treated proofs as sequences or graphs, and accordingly used RNNS or GNNS to process them, but with little success. The recent success of large language models has led us to think of proofs as a language, and to use language models to capture features. The GPT-f family of work represented by HTPS uses a language model similar to GPT2, while the more recent LeanDojo uses a T5 model. The details of how these language models work and how different language models differ in proving theorems are outside the topic of this article. Readers are free to inquire for their own information, which will be covered in a technical overview later in this article.

But we also see that training theorem proving on LLM is not good enough. Part of LeanDojo’s work shows that GPT4 cannot prove complex mathematical theorems such as Stirling’s formula, even with sufficient prompt. The reason may be that there are too few theorem-proof data, and the most suitable method of LLM training has not been found, resulting in the lack of full use of LLM capabilities; It is also possible that the task of theorem proving itself has a richer complex and fine structure than language, and the violent aesthetics of LLM are not enough to conquer the arduous task of theorem proving. But while the task is daunting, it is not as mysterious as it was a few decades ago, and the mechanisms of proof and even reasoning have accumulated a wealth of knowledge, but there is still a lot of work to be done.

This section concludes with a response to some common questions:

Because of Godel’s incompleteness theorem (or for some other reason), a machine can never prove a mathematical theorem, because people are not bound by Godel’s theorem, so theorem proving can only be done by people.

Response: Godel actually has this view himself. But this view implies that man is an oracle Turing machine, that is, a Turing machine that can run for an infinite amount of time, and can travel through time to inform the present moment of the results of its operation after an infinite amount of time. According to the Church-Turing thesis, the Oracle is not a computational model that can be implemented in our universe. Corresponding to the task of proving theorems, this means that one can go through all the possibilities and find the right theorems to apply to without wasting time. It seems that human “inspiration” has a similar quality, but it actually means that one can find a proof without mathematical training, just by dreaming of a fairy. I prefer to believe that inspiration comes from some specific computational mechanism, even though it is currently only incompletely described by theories such as deep learning and reinforcement learning.

But people can write axiom systems such as ZFC, they can also know that these axiom systems are not complete and then add axioms, they also invented the Turing machine and know that the Turing machine cannot solve the halting problem, people can set the foundation for mathematical proofs and Turing machines, and the machine itself has no way to set the foundation for itself.

Response: This is to describe the level at which one can work on a metalinguistic level, that is, we study the properties of another language by virtue of language and only by virtue of language. A metalanguage is also a language, subject to the rules of computation, and presenting an axiomatic system and a halting problem is no different from other Turing-computable problems. Besides, people can’t solve the shutdown problem. There is no reason to think that Turing machines cannot observe themselves, just as modern computers can also check their own memory for errors, it is simply engineering.

Present and future of DL+ATP

In December 2021, a paper titled “Advancing mathematics by guiding human intuition with AI” was published in Nature, It tells the story of Deepmind researchers who used traditional algorithms such as clustering to refine some laws in knot theory and representation theory, helping human mathematicians prove two new theorems. In June 2023, Tao evaluated how language models such as GPT4 helped him in his mathematical research, such as helping him quickly extract the content of other people’s papers. AI is popping up in and out of mathematics, helping mathematicians with all sorts of work, but AI writing its own mathematical proofs seems a long way off. On the one hand, mathematics itself is very complex, and on the other hand, it is limited by the lack of data. Humans do write many papers and textbooks, accumulate a large amount of mathematical corpus, and LLMS do train on mathematical corpus such as Wikipedia. If you ask ChatGPT about math, it can answer a lot of valuable information, but interestingly, when you ask it to prove a theorem it has never seen before, it will output sentences that look like math but are full of errors, as if a writer who doesn’t know math is trying to create lines for a mathematician character. Deep learning is good at capturing features, but it can’t use propositions and verify the correctness of propositions, so it can only “look” like mathematics rather than actually implement mathematics. There is actually no machine that can verify the correctness of natural language mathematical proofs, and sometimes not even humans themselves, because every now and then there are papers that no human being can understand. At present, only formal languages such as Metamath, Isabella, Coq, and Lean can be used as the basic data for AI to understand mathematics, because they can be compiled and run to verify correctness.

In 1973, Andrzej Trybulec developed Mizar, the first formal language specifically for mathematics, as well as an associated mathematics library, in order to verify the correctness of his doctoral thesis (although fifty years later, we have not been able to realize his ideal). This theorem library is still under maintenance and has accumulated 1w+ definitions 65k+ theorems. There are still ATP algorithms tested on Mizar. In recent years, type theory languages such as Coq and Lean have built larger theorem libraries, which have become more and more popular, and test question libraries such as MiniF2F (F2F is short for formal to formal) have been established to measure the performance of ATP algorithms. MiniF2F contains 488 questions described in four formal languages (Coq, Lean, Isabelle, Hol light, but Hol light only contains 330 questions) and natural language, collected from various mathematical competitions, including many questions from the famous mathematical Odyssey IMO. For example, the question shown at the top of the article. The researchers also launched the IMO Grand Challenge, hoping to bring AI to the IMO gold medal.

(There is a hole to fill here, Mizar is very readable, very natural language like, and there is a recent rewritten version of Rust, I don’t know why it hasn’t caught on like Coq or Lean, wait for anyone who meets Mizar group to ask) The original motivation for mathematicians to develop formal languages was to test proofs programmatically and mechanically, but perhaps the most profound change this tool brought to the study of mathematics was that it clearly divided the study of mathematics into two parts: the form, the strict statement that defines the theorem and the calculus of the proof; And ideas, that is, why the definition theorem is written the way it is, why a proposition is used in a proof. Ideas are the most mysterious and confusing things in mathematics, excellent ideas are often named genius, and even regarded as miracles, ordinary people can only admire, let alone use scientific means to study and reproduce these ideas. Category theory, for example, may be one of those attempts that belongs to mathematicians. Now we have the tools of cognitive science and the theory of artificial intelligence, and as the first section shows, the laws of mathematical ideas are being revealed step by step. In order to more clearly demonstrate the dichotomy between form and idea in mathematics, I have been thinking of an experimental project to write a complete mathematics textbook in a formal language. What can be written in formal language is all formalized, and there must be something left that must be written in natural language, and these are the parts of the idea. These natural languages can be a great source of research for ideas.

The most pressing problem facing DL+ATP research is that mathematical propositions expressed in formal language are still a drop in the ocean compared to the entire mathematical edifion of mankind. Fortunately, the development of LLM technology has led to “pre-training” and “fine-tuning” tools that can circumvent the lack of data to some extent. The idea is that many of the abilities and knowledge reflected in the data are universal, such as reasoning ability; The ability required for a particular task may be specialized, such as the specific syntax of a formal language. If the LLM has a general knowledge of the language before learning the domain-specific task, it is much better than the traditional method of directly letting the AI do the specific task from random initialization. Specifically, ATP, natural language mathematics, and formal language mathematics express the same mathematical meaning despite their differences in rigor and grammar, so the LLM pre-trained on a large number of human mathematical languages, learned some mathematical knowledge first, and then specialized in fine-tuning the scarce formal languages, should have a much better effect than only learning on formal language data. The classic work of DL+ATP, GPT-f, is to pretrain GPT3 on the natural language data set CommonCrawl, 23GB of Github code, 10GB of Arxiv Math papers, and 2GB of Math StackExchange forum articles. The motivation for training on code is that code also reflects human reasoning, much like proof. However, the subsequent improvement work of GPT-f HTPS believes that it is better to only pre-train on Arxiv, and who can say the configuration of optimized performance in engineering. HTPS has achieved a pass rate of 58.6% on the verification set and 41.0% on the test set of Lean language MiniF2F, which is the best result of pure proof search relying only on formal language. More information on evaluating data sets and algorithm performance can be found on the PaperWithCode page, but the information on this page is incomplete and has not been maintained for a long time.

Note 1: One of the drawbacks of fine-tuning is that it causes the LLM to focus only on domain-specific tasks and forget about general tasks, so it cannot be mathematically fine-tuned by tools such as ChatGPT to give it strong mathematical capabilities, which would make it lose other capabilities. Mathematical data sets were already part of ChatGPT’s pre-training, although the pre-training data did not give it enough mathematical power. Although LLMS can be pre-trained to learn mathematics beyond the available theorem proof data, the lack of theorem proof data is always a major hindrance to advancing the field. Readers who have the experience of using formal language will have a deep understanding that the process of mathematical proof in formal language is much more difficult than that of writing proof in natural language. Many details that we usually think are very ordinary and natural need to be completed, so the establishment of formal language theorem library needs a lot of human labor. If writing a formal language is such a hassle, why not let AI do it? The first attempt was made in 2018, followed by a report by Christian Szegedy in 2020, which formally raised the issue of autoformalization, the automatic translation of natural language mathematical propositions and proofs into formal language, like machine translation. The best performance of automated formalization tasks to date is also achieved by LLM, with a May 2022 test reporting that the best programming language model, Codex, could correctly translate 38 problems from natural language to Isabelle out of 150 problems extracted from the MATH dataset (which contains 12,500 natural language expressed math problems). It can be imagined that an algorithm that can translate two different mathematical languages well will inevitably involve an understanding of the nature of mathematical semantics, so the study of automatic formalization will also feed into the study of proof.

In terms of the concrete realization of the two problems, automatic formalization and ATP are also complementary tasks, because natural language proof is full of leaps, these natural omissions for humans can not be handled in the formal system, then ATP can be used to automatically complete. In turn, there is a clever practical idea that since formal language data is scarce, the search process for ATP can also be guided by natural language propositions. Representative work in this direction is the 2022 Draft, sketch, and prove: Guiding formal theorem provers with informal proofs (DSP), which uses Codex on Isabelle language to map natural language proofs into formal language proofs “draft”, that is, there are some leap-forward formal proofs. These hops are then automatically completed with Isabelle’s symbolic autoprover Sledgehammer (Isabelle’s symbolic autoprover is much stronger than Lean’s). The natural language proofs could have been written by a human expert, who achieved 39.3% test set accuracy, or they could have been produced by another large model of natural language mathematical questions, Minerva, which achieved 38.9% test set accuracy. If Minerva is used, then both the natural language template and the formal language proof are produced by the algorithm itself, which can also be seen as the algorithm itself doing the ATP task. the best subsequent results from this method came from Decomposing the Enigma: Subgoal-based Demonstration Learning for Formal Theorem Proving, which is an improvement of DSP, also achieves 45.5% accuracy of the test set on Isabelle, which is also the best result on MiniF2F dataset. However, it is based on a different formal language from HTPS, and relies on natural language and cannot be directly indexed.

Note: One possible challenge to the DSP is that the Minerva training data contains mathematical text collected on the web page and the Arxiv text, which does not necessarily exclude the MiniF2F questions and is not “clean” enough, but the large language model can be regarded as a compression of the data, and generally cannot “recite” the solution, but can be regarded as a re-generated solution. However, Microsoft Copilot has done a job of reproducing irrelevant program comments, and the LLM’s internal data processing principles are still harder to understand. These results are far from ideal, but they have now translated some ATP tools that mathematicians may not think they can use. 22 years LEAN-GPTF as a proof tactic in Lean open test for a period of time, can prove some simple theorems, now closed test. HTPS has been integrated into Lean’s VSCode editor extension, and when you write a proof you click a button to get the next sentence of a dozen AI recommendations, most of which are wrong, but some are right. VSCode’s Copilot programming assistant can occasionally translate natural language math statements you write into Coq or Lean, and ChatGPT can occasionally write correct formal language proofs. The current situation looks bleak, but remember that this direction has only begun to develop in the last two or three years.

If you are willing to give a few more years of patience, what will DL+ATP bring to the world? The most immediate is to accelerate the development of formal mathematics. Introduction to Computer-Aided Proofs shows us a reliable and fair picture of mathematical research, and ATP can greatly reduce the cost of realizing this vision. We can feed a human math paper or a student’s answer to a test question into an automated formal algorithm, and then have it generate code to verify correctness. Alternatively, you can translate the code into human language to make it easier to read, or extract concise ideas from complex proofs to help you understand them. We may not be able to make ATP system prove some big theorems for a long time, but it is very likely to assist in the process of human thinking about mathematics, for example, an important possibility is the auxiliary retrieval of mathematical content: An idea came to my mind, and I wondered if anyone else had come up with a similar idea, maybe there was a precise theorem that expressed the idea in a way that traditional keyword matching searches could not find, but AI could understand the idea at an abstract semantic level to find relevant data.

Finally, a little sci-fi fantasy. If one day mathematical AI can prove the Riemann conjecture, it is not difficult to imagine that AI can also solve scientific problems such as room-temperature superconductivity, controlled nuclear fusion, eliminate various diseases, and even solve all social production problems. Mathematics represents the peak of human abstract reasoning ability, and when AI also has abstract reasoning ability, it may be able to become the complete general artificial intelligence. What will society look like then?

Learning Map

If you are an enthusiast who wants to learn more about this field, with the goal of generally reading the leading papers in this direction, then you will need some introduction to deep learning and a preliminary understanding of LLM, at least familiar with the principles of the encoder-decoder architecture. It is more important to have experience with at least one formal language, and I recommend Lean here because modern DL+ATP is increasingly using Lean.

If you want to pursue research in this field…… Dissuade warning! You’re dealing with the most complex and comprehensive field in AI and the second most distant from money after people who do theory. (If any friend is not satisfied, please contact me to help ATP improve the ranking, thank x in advance)

DSP instructor Albert Jiang pointed out that DL4Math people have a lot of money. HTPS writers Guillaume, Tim, Thibault, Marie-Anne went to mistral ai, Google’s Yuhuai Wu, Christian went to xai, jesse founded morph, stanislas founded dust, markus is starting a business. But I refuse to admit it until someone gives me money (x). The first is the need for more comprehensive knowledge of deep learning and reinforcement learning. LLM Needless to say, the main paradigm today is based on LLM, but CV knowledge is also helpful, because geometric propositions and the geometric intuitions of some propositions also need to be taken into account; Graph-based approaches have potential, so an understanding of GNN-related approaches can also be helpful; The latest work even uses the popular Diffusion method. In addition to these modern techniques, as seen before, DL and RL are still developed under the framework built in the pre-deep learning era, so it is important to understand the traditional methods. Outside of AI, you’ll be working on the basis of a type theory language, so you’ll need to know some “common sense”, such as mathematical logic and various type theories, functional programming, category theory, and so on. Some subproblems with a strong symbolic tradition will emphasize this knowledge more, but work like HTPS will use it as background knowledge. Overall, the road ahead for ATP is not clear, and the accumulated knowledge in this field is not deep enough to get started quickly, but researchers must have a fairly wide range of knowledge to find the next breakthrough, although they may not be directly applicable, but can provide more insight or at least not cause you to make a simple mistake.

The main event in the field of DL+ATP is the Artificial Intelligence and Theorem Proving conference AITP, and the Intelligent Computer Mathematics Conference CICM is also noteworthy. In addition, the MATH-AI workshop has been held in ICLR2021 and NeurIPS2022 for two sessions, and the third session will be held in NeurIPS2023 at the end of this year (2023.12.15/16). Welcome to your attention. At the NeurIPS conference, Albert Jiang, Kaiyu Yang and Emily First will also provide tutorial on machine learning theorem proving. Welcome to join us! In addition to conferences, if you’re in the Lean zulip community, check out stream Machine-Learning-for-Theorem Proving.

Finally, post a list of papers for further study.

A survey of deep learning for mathematical reasoning (2022). An overview is always the best place to start.

Holophrasm: a neural automated theorem prover for higher-order logic (2016). DL+ATP started with quite advanced ideas, and there was even no transformer at that time. Unfortunately, the author has retired now.

Generative language modeling for automated theorem proving (2020). The famous GPT-f OpenAI is in with a lot of money.

Proof artifact co-training for theorem proving with language models (2021). Another classic work PACT, which uses a new training method and is more robust than GPT-f, is the main control group for future work.

Formal mathematics statement curriculum learning (2022). The highlight is the use of curriculum learning, which is stronger than PACT.

Hypertree proof search for neural theorem proving (2022). The strongest HTPS, GPT-f scale +Holophrasm search algorithm. OpenAI’s big money turned around to hit ChatGPT, Meta took over OpenAI, but it’s not open source.

DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function (2023). The latest achievement, produced by Sun Yat-sen University, Peking University and Huawei. Slightly better than PACT.

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models (2023). I do not want to list it, but this article has been on the public number for a long time, in fact, the main work is to establish a data set. The structure of the model proposed in the paper is simple, and the Reprover model feels like a matching validation of the data set.

There are also Skip-tree Training, TacticToe, LISA and other equally wonderful jobs.

The above is the most basic ATP work using pure formal language, and there are more natural language components below.

Thor: Wielding hammers to integrate language models and automated theorem provers (2022) This is Thor, Go beyond PACT by having a language model that is not fine-tuned on a formal language dataset work with the formal language editor’s automatic search function.

Solving quantitative reasoning problems with language models (2022). This is Minerva, natural language math, and even physics and chemistry.

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs (2022). This is DSP, using natural language proofs to guide the standard work of formal language proofs, going beyond Thor.

Decomposing the Enigma: Subgoal-based Demonstration Learning for Formal Theorem Proving (2023). in this paper, the Subgoal method of reinforcement learning is introduced, and the optimal result on MiniF2F is achieved by using Diffusion on the graph to optimize in context learning.

Autoformalization with large language models (2022). This article is enough for the direction of pure breed automatic formalization, and for early research you can check the Related Work section of this article.

Evaluating Language Models for Mathematics through Interactions (2023). A detailed survey report on the mathematical ability of ChatGPT class large language models.

This list is extensive, interested friends can follow the guide in the above articles. There may be a lot of important work I can’t think of right now, please add in the comments section.