Luis’ blog

Artificial Mathematical Intelligence in 2025

2025-12-01T00:00:00+00:00

Artificial Intelligence can be applied to many math-related tasks, including formal and informal theorem proving, formalization and informalization, and math discovery. In order to make long term progress, it is necessary to have a good understanding of each task, of their relationships, and of what is currently within reach.

I believe that math formalization will play a central role in Artificial Mathematical Intelligence. If you are not familiar with formalization, you can check my previous post on Why math should be formalized.

Acknowledgements. I took the term “Artificial Mathematical Intelligence” from a talk by Emily Riehl, which I recommend. I thank Leni Aniva and Justin Asher for conversations about these topics.

Contents

What is Artificial Mathematical Intelligence
The semantic gap
Theorem Proving
- Automated Formal Theorem Proving
- Automated Informal Theorem Proving
Autoformalization

What is Artificial Mathematical Intelligence

Mathematicians perform many tasks, such as coming up with new statements and theories, proving results, typing results and proofs in LaTex, and, more and more, formalizing statements and results in a proof assistant. Any of these tasks can potentially be automated, and this automation is what I refer to as Artificial Mathematical Intelligence (AMI). In this post, I will focus on the following three tasks, but I say a bit about other tasks below.

Informal Theorem Proving: Given a mathematical statement (either formal or informal), provide an informal proof of it.
Formal Theorem Proving: Given a mathematical statement formalized in a proof assistant, provide a formal proof which is accepted by the proof assistant.
Formalization: Given a mathematical theory described in informal language, formalize this theory in a proof assistant in a way that is accepted by the proof assistant. By mathematical theory, I mean any collection of definitions, result statements (ie the statement of a theorem, proposition, or lemma), and proofs.

When performed automatically using, for example, a combination of Machine Learning and Formal Methods, I’ll refer to these tasks as Automatic Informal Theorem Proving, Automatic Formal Theorem Proving, and Autoformalization, respectively. There are important connections between these three tasks that are leveraged in practice, and I’ll discuss this below.

The three Artificial Mathematical Intelligence tasks considered in this post, and some of the main relationships between them.

Other tasks. There are many other important tasks. These include specific tasks such as solving exercises (similar to theorem proving, but might just involve a computation and not a proof), performing calculations or simplifications (such as simplifying a complicated integral), finding counterexamples or interesting examples, as well as more complex tasks such as mathematical discovery, that is, coming up with new theories, conjectures, constructions, or proof strategies, formal or informal. For more on this, see for example [Bengio, Malkin. Bull. AMS], [Riehl. Talk], [Frieder et al.], and [Novikov et al.]. Another important task is informalization: given a formalized mathematical theory, provide an informal version of it. This task will become especially relevant with formalization at scale, since it will aid in understanding formalized proofs that were not produced by humans.

Before getting into more details, we need to discuss a fundamental problem that needs to be addressed when developing AMI.

The semantic gap

A semantic gap is a disconnect between two semantic frameworks. For instance, a semantic gap manifests whenever any two agents (human or AI) communicate in such a way that there exists the possibility that one agent is misrepresenting the intention of the other¹. A down-to-earth example of a semantic gap is when your colleague in a different timezone misinterprets the time for your meeting.

The semantic gap affects two of the main mathematical tasks that I described earlier: Autoformalization and Automated Informal Theorem Proving. How the semantic gap affects Autoformalization is straightforward: the formalized output may or may not represent the intention of the agent (human or AI) that produced the informal input to the Autoformalizer. For example the mathematical term “prime” in the figure below might mean “prime integer” or “prime element” of some other ring:

The semantic gap affects Automated Informal Theorem Proving in two ways: one, the Automated Informal Theorem Prover could misinterpret the input; and two, the output of the prover could be misinterpreted by another agent. This makes evaluating an Automated Informal Theorem Prover especially delicate.

The semantic gap does not affect Automatic Formal Theorem Proving; and in a sense, this is exactly the point of formalization! It is important to recognize this, since it can be leveraged in practice: In Machine Learning terms, an agent can be trained to produce formal proofs using verifiable rewards, since each proof step can be objectively checked by the proof assistant.

Unfortunately, we can’t only work with formal math (at least for now), so in many applications we need to mitigate the semantic gap. I’ll discuss this more when describing each of the main tasks below.

Theorem Proving

Let us start with Formal and Informal Theorem Proving, and, more generally, by being more specific about what the goal is. There are at least two tasks that one can refer to theorem proving.

In Novel Theorem Proving the goal is to provide a proof for a conjecture or for a new result that hasn’t been considered before. Novel Theorem Proving can be seen as the holy grail of AMI, since it could involve providing proofs for long-standing mathematical conjectures.
A simpler task is Known Theorem Proving, where the goal is to provide a proof for a result for which a proof strategy is known and believed to work. Known Theorem Proving can be useful, for example, for understanding complicated or imprecise proofs, for filling in details, and for making sure that there are no gaps in an argument.

For a 2024 survey on ML methods for Theorem Proving, you can check [Li et al.].

Automated Formal Theorem Proving

Classical approaches and Formal Methods. Classical approaches to AFTP rely on Formal Methods, a set of techniques and tools used to mathematically specify, design, and verify complex systems (typically software). For example, modern proof assistants allow for specification via formal mathematical definitions, design via programs, and automated machine-checked verification of formal proofs. Before the current wave of AI integration, automation was based on search achieving dramatic efficiency gains through key algorithmic breakthroughs:

The Resolution Principle and the Unification algorithm^{[Robinson. JACM]}, which reduce complex logical inference to a single, efficient rule of refutation.
Conflict-Driven Clause Learning^{[Moskewicz et al. DAC]}, a SAT solving method that prunes vast regions of the search space.
The Superposition Calculus^{[Bachmair, Ganzinger. J. Log. Comput.]}, a powerful framework for equational reasoning.

These techniques are integrated into specialized tools such as Satisfiability Modulo Theory solvers (SMT solvers), which represent the state-of-the-art of search-based formal deduction. On top of these, hammers act as a translator between these automated tools and proof assistants. Hammers are examples of tactics, which are proof assistant functions that a user can call to automate part of the formalization process. For more about the subject, you can check, for instance, the textbook [Kroening, Strichman. TTCS] on formal reasoning, and the papers [Limperg, From. CPP] and [Norman, Avigad. ITP] on a SOTA search-based tactics for AFTP in Type Theory.

Current approaches. Recently, significant progress has been made in AFTP using Machine Learning. While specific implementations are often complicated in the details, they usually rely on some of the following techniques:

Agents based on Large Language Models (LLMs). This approach builds an AI agent consisting of an LLM that interacts with a proof assistant essentially like a human does, by directly producing code and using the proof assistant as a tool (eg, using [Dressler. Git]). By treating formalization as a coding task, one can reuse infrastructure developed for AI coding agents. There are at this point many implementations that use this approach. For example [Goedel-Prover-V2] fine-tunes a Qwen3 model, while [Delta Prover] is based on an agentic framework around an off-the-shelf LLM with no further training.
LLMs for informal reasoning. Here, the informal reasoning capabilities of LLMs are leveraged in several ways, including first producing an informal proof and then attempting to formalize it^{[Jiang et al. ICLR]}, generating possible useful lemmas and intermediate steps, guiding a formal theorem prover^{[Seed-Prover][Harmonic Team][Varambally et al.]}, or interleaving formal and informal reasoning^{[Kimina-Prover Preview]}.

Figure from [Harmonic Team]. A high-level illustration of the approach to AFTP implemented in the Aristotle agent. Notice in particular that it leverages AITP to generate possibly useful lemmas.

Reinforcement Learning (RL). This approach interprets theorem proving as an iterated decision problem, and uses techniques from RL to train or fine-tune an agent that, at each step, proposes a set of candidate tactics to try next, and tries to predict the expected return from applying these tactics. For more details on this point of view, you can check AlphaProof’s paper [Hubert et al. Nature] and [Kimina-Prover Preview].
Monte Carlo Tree Search (MCTS). This is a stochastic search algorithm that incrementally builds and explores a search tree using randomized simulations to estimate promising paths, and it is key in the success of ML models for games such as chess and Go. The use of MCTS for theorem proving was proposed in [Färber et al. CADE], and SOTA approaches use MCTS with various modifications; see, eg, [DeepSeek-Prover-V1.5. ICLR] and [Harmonic Team]. It is important to recognize that search in the context of theorem proving has several subtleties, such as metavariable coupling; you can read more about this in [Aniva et al. TACAS].
Test Time Training. This is also described in [Hubert et al. Nature], and the idea is roughly the following: given a target statement, produce a set of variations and fine-tune the model using RL on those variations. The fine-tuned model is then used to attempt to prove the original statement.
Domain-specific solvers. The typical examples are geometry solvers such as AlphaGeometry^{[Trinh et al. Nature]}, which is a specialized theorem prover for Euclidean plane geometry.

In all of the approaches above, Formal Methods are incorporated quite indirectly, by allowing the model to invoke proof assistant tactics that implement the methods outlined earlier, such as Lean’s SMT solver tactic smt.

Training data: math as a game, zero human knowledge, self-play. A big difficulty when training AFTP models is gathering enough good quality data, and this might be the main current bottleneck. Formal math libraries such as Lean’s mathlib, while extremely impressive and useful, are orders of magnitude smaller than what a large scale model needs (at least with the current models and training techniques). Another issue is data contamination: approaches relying on LLMs are often tested on data the LLM was (potentially) trained on, which can make evaluation difficult.

In practice, datasets are produced in large part using Autoformalization (hence the arrow from Autoformalization to AFTP in the figure at the beginning of the post). This comes with several issues: for one, Autoformalization is imperfect, but also it is not known if algorithms trained exclusively on human data will generalize. It is also unclear if human knowledge is useful for Formal Theorem Proving, or if there exist learning algorithms that can find better, more efficient proofs without human knowledge. This would be similar to how algorithms such as AlphaZero are able to learn how to play board games at a superhuman level without any human input.

Still in the context of board games, it is also interesting to recall that lack of data was overcome using self-play^{[Samuel. IBM J. Res. Dev.]}: such games are symmetric, so a single model can be trained by playing against itself. Unfortunately, theorem proving has no such simple symmetry: coming up with a mathematical statement seems, a priori, like a very different task from proving a statement. Nevertheless, it is plausible that a useful analogue of self-play for theorem proving exists, and this could have a big impact. Self-play in the context of AMI is explored in [Dong, Ma]. It is also possible that true superhuman AFTP agents can only be trained at the same time as superhuman math discovery agents.

Do we need language models and informal reasoning? The usage of LLMs for informal reasoning works well in practice for simple proofs (hence the arrow from AITP to AFTP in the figure) but it is not clear that such approaches will scale to truly difficult proofs. Going back to the comparison between math and games, note that it is actually not straightforward to fine-tune a SOTA LLM to play chess or Go to the same level as, say, AlphaZero^{[Schultz et al. ICML][Ma et al. NeurIPS]}. One could read this as saying that those games are fundamentally not about language, so an LLM may not be the best tool for the job. The same could be true for math, which I would guess is also not fundamentally about language. Personally, I think that at a certain point informal reasoning will become a bottleneck when it comes to AFTP, but it is not clear at what point this will happen, and it is likely that a lot of progress can be made with LLMs.

What can current agents prove? So far, and to the best of my knowledge, AFTP agents have performed well in limited contexts, such as olympiad-type problems. Their performance in this context is impressive: for example, Aristotle^{[Harmonic Team]} achieved a gold-medal-equivalent performance on the 2025 International Mathematical Olympiad by providing correct formal solutions to five out of six problems. For an up-to-date comparison of different models, you can check [PutnamBench Leaderboard], which includes the performance of SOTA models on an olympiad-type dataset. As an example of Novel Theorem Proving, Aristotle provided a proof for an “easy version” of a conjecture by Erdős, although this is still regarded as “olympiad-style” math. To the best of my knowledge, we have not yet reached the point when an AI agent proves or disproves a long-standing mathematical conjecture.

Automated Informal Theorem Proving

State-of-the-art Automated Informal Theorem Provers are based on LLMs. Most commonly, the starting point is a general purpose LLM that has been trained on large portions of the internet. The LLM is then fine-tuned on informal mathematical corpora such as text books, mathematical articles, and specialized datasets. The main technique that enables these models to solve mathematical problems, and to reason more generally, is chain-of-thought (CoT)^{[Wei et al. NeurIPS]}, where the model is trained to generate intermediate steps before a final answer. CoT can be used for self-reflection, self-verification, and correction.

Examples of agents that rely on CoT to solve mathematical problems and provide mathematical proofs include [DeepSeekMath-V2], and many of the general purpose models such as ChatGPT, Gemini, and Claude. The FrontierMath benchmark project keeps track of the performance of many of these models on a series of math problems. Some care has to be taken when interpreting these results: first, only final answers to problems are evaluated, so this may not reflect the performance of the models when producing proofs; second, to avoid data contamination, the problems are not public, so we don’t exactly know what is being evaluated. The potential issues of evaluating models solely based on a final answer are considered in [Petrov et al. ICML AI4MATH workshop].

Figure from [DeepSeek-AI], where caption reads as follows The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. The paper demonstrates that emergent CoT-like behavior can arise in RL-trained reasoning agents, where more complex tasks require longer reasoning traces.

AITP can be useful for guiding AFTP, informalization, and explanation of known proofs, and it may or may not be the case that AITP is in necessary for AFTP. Regardless of this, I believe that AITP by itself cannot scale to Novel Theorem Proving, due to the semantic gap and the human-AI asymmetry of informal math, that is, the fact that both a human and an AI can produce informal math, but only a human expert can verify it (which I observe in Why math should be formalized). This asymmetry puts a hard limit on how much AITP-produced math can reliably be verified.

Autoformalization

There are lots of proposed approaches to Autoformalization (for early references, see, eg, [Wang et al. CPP][Wu et al. NeurIPS][Szegedy. CICM]). I will not say too much about this since, in my experience, this problem is significantly easier to address than AFTP and AITP.

I find that current SOTA LLMs are actually pretty decent at Autoformalization, if including the right references in the context, and if incorporated into an agent that can use the proof assistant’s output as feedback to iteratively fix the formalization until it compiles. Two other techniques that can make a significant difference are using informal math-guided AFTPs for formalizing proofs, like for example Aristotle^{[Harmonic Team]} (hence the arrow from AFTP to Autoformalization in the figure), and using blueprints^{[Lean blueprints]}, that is, first organizing the mathematical theory into definitions and statements with dependencies, and then formalizing these one by one.

The main thing one needs to watch out for is the semantic gap, which can be addressed to a reasonable extent by using an LLM-as-a-judge^{[Zheng et al. NeurIPS]}, in this case by adding one or more agents to the pipeline whose job is to identify and point out to the formalizer potential semantic gaps. In order to use these ideas to reach a satisfactory pipeline, one or more mathematicians need to be consulted, since, as I mentioned previously, only a mathematician (in fact, an expert in the field) can function as a reliable safeguard against the semantic gap.

I believe that, in the short term, more and more math will be formalized using technologies that we already have. Large scale formalized libraries will most likely enable the training of AI agents that can really make progress in Formal Novel Theorem Proving.

In the context of AMI with black-box models, I have also seen this concept referred to as the intention problem^{[Aniva. Blogpost]}. ↩

Why math should be formalized - and resources to get started

2025-09-28T00:00:00+00:00

Proof assistants enable us to formalize mathematics: to express arbitrary mathematical statements and proofs without ambiguity, and to automatically verify their correctness. Here I give some arguments for why math formalization might soon become inevitable. I also include some resources on formalization for the interested reader.

In a future post I cover how Artificial Intelligence will help us in scaling up math formalization (but here I do cover how it will probably force us to).

Contents

TL;DR
Too much human-made math
Too much AI-generated math
Will formalization be enough?
Resources to get started

TL;DR

I believe that most of the new, main mathematical claims in the nearby future will have to be formalized. I make three main arguments, which, in my opinion, go in increasing level of strength:

There already exists a lot of math (perhaps too much).
Humans make mistakes.
There will soon be an explosion of AI-generated math.

Too much human-made math

A portion of the awesome arXiv Data Map interactive visualization.

The first claim is simple: there already exists a lot of math. There already are instances (and there will be many more) in which a single person cannot verify the validity of a statement, since it depends on more results than what they can possibly verify. The typical example is the classification of finite simple groups, which, although believed to be correct, spans over 10,000 pages, scattered across hundreds of papers by many authors, and uses many deep sub-theories, which no single human masters. Moreover, some parts are only really understood by a few experts, which is a somewhat fragile situation. For more about the story of classification of finite simple groups, and how things piece together, you can check [Aschbacher. Notices AMS] and [Solomon. Bulletin AMS]. But at the end of the day, each piece has been peer-reviewed by experts, and gaps that have been found have been corrected.

Number of math papers posted on arXiv per year (1993-2021). Image from arXiv submission rate statistics.

Gaps… This takes us to the second point. When doing informal mathematics, humans can make mistakes that are very hard to spot. Indeed, there are many past and current instances of important mathematical claims whose validity was or is contested by experts. I will mention some well-known instances, but I want to emphasize that the point here is not about who’s wrong or right, but rather about the fact that, even though mathematics deals with absolute truths¹, in practice experts can disagree about the validity of important mathematical claims.

The abc conjecture. This is a conjecture in number theory from the 1980’s and the name comes from the fact that it involves three positive, relatively prime integers $a$, $b$, and $c$ satisfying $a+b = c$. The mathematical statement is not super technical, but it does not matter for this discussion. A positive answer to the conjecture implies a positive answer to several other conjectures in mathematics^{[Waldschmidt]}, making the problem important. Since it was originally stated, there have been several claimed proofs of the conjecture that have not been accepted by the mainstream math community. The most well-known instance is the one in the $\sim 600$ page series of papers [Mochizuki. PRIMS], whose validity has been the object of much debate, in particular by experts in the area who have published explanations for why they believe that the proof is incorrect^{[Scholze, Stix][Scholze]}.

An example from Machine Learning. The Adam optimizer^{[Kingma, Ba. ICLR]} is a staple of modern Machine Learning, being the go-to optimizer for lots of standard architectures. In the paper that introduces it, it was shown that the method converges when optimizing a convex function. It was shown a few years later^{[Reddi et al. ICLR]} that the original proof had a gap, and that convex functions exist for which Adam does not converge (although that did not stop anyone from using Adam).

A version of the homotopy hypothesis. This one’s a bit more abstract. It concerns a result, claimed in [Kapranov, Voevodsky. Cah. Topol. Géom. Différ. Catég.], published in 1991, which (very) roughly says that every space can be constructed using a special kind of recipe. In 1998, [Simpson] was made public, which contained a counterexample to the above claim. Voevodsky, one of authors of the first article (and a Fields medalist), writes in [Voevodsky. Institute Letter]:

Kapranov and I had considered a similar critique ourselves and had convinced each other that it did not apply. I was sure that we were right until the fall of 2013 (!!).

The letter is about how Voevodsky became interested in the foundations of mathematics and proof assistants, and I can’t recommend Voevodsky’s letter enough. It contains other interesting stories about incorrect claims such as the following (which concerns a different mistake):

This story got me scared. Starting from 1993, multiple groups of mathematicians studied my paper at seminars and used it in their work and none of them noticed the mistake. And it clearly was not an accident. A technical argument by a trusted author, which is hard to check and looks similar to arguments known to be correct, is hardly ever checked in detail.

Voevodsky went on to develop Univalent Foundations, an approach to the foundations of mathematics based on Type Theory and Homotopy Theory, which I hope to cover in another post.

Other examples. There exist more examples of incorrect or contradictory claims in the literature; Kevin Buzzard’s article Formalising mathematics: an introduction mentions an instance of two published papers that contain contradicting results, with both papers being published in the Annals of Mathematics, possibly the most well-reputed mathematical journal! The following related Math Overflow threads contain some examples as well:

The top answer in the last thread, by Shulman, makes a good point:

I think another reason that mathematics doesn’t collapse is that the fundamental content of mathematics is ideas and understanding, not only proofs. […] usually when a human mathematician proves a theorem, they do it by achieving some new understanding or idea, and usually that idea is “correct” even if the first proof given involving it is not.

Tao makes similar points in the blogpost There’s more to mathematics than rigour and proofs. However, the part that I left out from Shulman’s comment reads as follows:

If mathematics were done by computers that mindlessly searched for theorems and proof but sometimes made mistakes in their proofs, then I expect that it would collapse.

And this takes me to the third point.

Too much AI-generated math

Because of generative Artificial Intelligence, it will soon be possible to produce a huge amount of plausibly-looking mathematical claims and proofs; far more than what humans can verify. Will the arXiv get flooded with AI-generated math-looking content? Will math journals be flooded with AI-generated submissions? I think that this is entirely possible, which is a problem due to the human-AI asymmetry of informal math:

Both a human and an AI can produce informal math, but only a human expert can verify it.

And I do not expect this to change easily.

If one cannot verify the math itself, one could try to filter submissions by, say, the authors’ or institutions’ trustworthiness, or by using AI as a preliminary judge, but these approaches would likely be deeply unfair and so imperfect as to not really address the problem. Ethical questions of this kind are considered in Emily Riehl’s talk Testing Artificial Mathematical Intelligence², which I highly recommend.

It is not clear if there are good solutions. What is clear is that formal mathematics does not have this problem, since it can be checked automatically by a deterministic program.

Will formalization be enough?

Suppose that we do get the mainstream mathematical community to back up any important claims with a formalized proof. Will this overcome the issues presented so far? It seems to me that the points concerning errors or gaps (human or AI) will be addressed, but I see at least two remaining hard problems.

One is that a human expert would still have to be involved in interpreting the results that have been formally proven. A formal proof could look as if it is proving a breakthrough result, but after unfolding the definitions it may become apparent that what is proven is actually something much weaker. Relatedly, the talk by Emily Riehl that I referenced above proposes the following bar by which to judge AI-generated mathematics, which highlights the fact that human experts are still needed even for formalized results:

Any artificially generated mathematical text will not be considered as a proof unless:

It has been communicated in both a natural language text paired with a computer formalization of all definitions, theorems, and proofs.

The formalization has been accepted by the proof assistant and human expert referees have vetted both the formalization and the paired text.

The second problem is that we could still end up in a situation in which we are flooded by correct, formalized claims, and where it is not clear how to sort through all of these claims. In this case, there will have to be a radical reorganization of how the mathematical community produces and publishes math. This quickly leads us to the broader question of how the advancement of AI will force us to reconsider how lots of aspects of society work, an important example being education. There’s a lot of work to be done here, and it is outside the scope of this post. I just want to acknowledge the fact that the solutions we arrive at concerning a flood of AI-generated content will most likely go well beyond math.

Resources to get started

Here are some resources in case you are interested in math formalization. I’ll try to keep this concise, and I will not mention AI-related options for now: there are enough basics to learn as it is.

The first step is to choose a proof assistant, and there are many choices here. In fact, the real first step is to choose a foundation for mathematics! Many of the modern proof assistants are based on Dependent Type Theory (more specifically, the Calculus of Inductive Constructions), but not all of them.

At the moment, the de facto default choice of proof assistant to do math formalization is Lean 4, which is based on Dependent Type Theory; but if you want to learn about other choices, a good starting point is the series of talks Every proof assistant and a recent talk by Jon Sterling on developing modern proof assistants. I say “de facto default choice” because Lean seems to have the largest active community, especially when it comes to classically trained mathematicians who want to prove standard math results and don’t necessarily care about experimenting with different foundations or logical frameworks. Because of this, I will focus on Lean 4, but know that a lot transfers to other proof assistants based on Dependent Type Theory such as Agda or Rocq.

Lean 4 proof of Euclid’s theorem on the infinitude of primes, in Mathlib/Data/Nat/Prime/Infinite.lean.

In terms of tutorials, there are two excellent free online books: Mathematics in Lean and Theorem Proving in Lean 4. If you are wondering which one to start with, here’s an excerpt from the former one (clarification mine):

Theorem Proving in Lean is for people who prefer to read a user manual cover to cover before using a new dishwasher. If you are the kind of person who prefers to hit the start button and figure out how to activate the potscrubber feature later, it makes more sense to start here [Mathematics in Lean] and refer back to Theorem Proving in Lean as necessary.

There’s also The Natural Number Game, which is a game-like introduction to mathematical proofs in Lean. If you want to follow a course, I strongly recommend Stanford’s CS 99: Functional Programming and Theorem Proving in Lean 4, but keep in mind that formalization of mathematics is not the only focus of the course.

Lean 4 has an impressive mathematical library, the mathlib library, which includes a lot of fundamental and advanced mathematics. There exist search tools such as Lean finder and LeanSearch that allow for natural language, semantic search in the library, which can be extremely useful.

Other great resources include the Lean reference manual, a list of resources in the official Lean website, the Lean Community website, and the Official Lean blog.

Finally, in terms of getting help, there’s the highly active Lean Zulip channel, and keep in mind that Lean is a programming language, so tools like coding agents and general purpose LLMs can help with debugging, can provide examples, and can aid in digesting concepts.

But note that math as a human discipline deals with much more than mathematical statements. ↩
The recording is on YouTube. ↩

Cover Learning for Topological Inference

2025-04-01T00:00:00+00:00

The standard approach to Topological Inference is based on geometric complexes. Most commonly, geometric complexes scale cubically (and often worse) in the number of data points, which poses a big problem. Here I describe an alternative approach to Topological Inference.

For motivation and an intro to simplicial complexes, you can check my previous post on Topological Inference and Unsupervised Learning. The approach described in this post appears in our recent paper [Scoccola et al. ICML]; for more details and references, it’s best to check there.

Contents

Geometric Complexes
Covers and Nerve
Cover Learning

Geometric Complexes

Standard examples of geometric complexes include the Rips complex, the Čech complex, and the Alpha complex. All of these are filtered (or weighted) simplicial complexes defined given an input metric space (assumed to be a subset of $\mathbb{R}^n$ in the latter two).

Why geometric complexes? Two main results make them particularly appealing from a theoretical point of view. First, the geometric complexes of metric spaces that are similar (in the Gromov-Hausdorff sense) are necessarily similar^{[Chazal et al. Geom. Dedicata]}. Second, the geometric complex of a sufficiently well-behaved metric space (such as a compact manifold) encodes the topology of the space itself^{[Hausmann. Prospects in Topology]}. These two results imply a consistency result for geometric complexes: When applied to a sufficiently good sample of a sufficiently well-behaved space, they encode the topology of the space, and can thus be used for Topological Inference^{[Boissonnat et al. Cambridge Texts Appl. Math.]}.

Complexity and sparsification. Unfortunately, geometric complexes are huge. For example, to compute the homology of the Rips complex of a finite metric space $X$ up to dimension $m$, one needs to construct a simplicial complex with $\Theta(|X|^{m+2})$ simplices, and then perform Gaussian elimination on a matrix of that size!

Because of this, many sparsification techniques for geometric complexes (and simplicial complexes in general) have been proposed. While these do improve computation time and memory to an extent, scaling remains well above quadratic, and it is rare to be able to compute the homology of a high-dimensional point cloud with more than a few thousand points.

The problem. I believe that the main issue is that geometric complexes always have the data points as vertices. This will remain an issue as long as we use simplicial complexes that take data points as vertices.

Covers and Nerve

Luckily, there is a great (and classical) source of simplicial complexes that does not use points as vertices: the nerve complex of a cover.

A cover of a set $X$ is a set of subsets of $X$ whose union is $X$. The nerve of such a cover is a simplicial complex with vertices the subsets in the cover, and with higher dimensional simplices given by the non-empty intersections:

Cover Learning

In the context of Topological Inference, the nerve construction allows us to reduce the problem of constructing a simplicial complex to that of constructing a cover of the data. If we construct a cover with $k$ elements, the output simplicial complex will have $k$ vertices, and if we have control over $k$, we can make it much smaller than the number of data points. How to learn covers is an interesting problem in its own right, and is related to soft clustering.

In our paper, we approach it from the perspective of geometric optimization, by minimizing a certain loss function for covers that we design. Here’s an example from the paper:

Pretty cool tricks go into our approach, but what I want to emphasize here is not our particular method, but rather that:

Covers should be a central tool in Topological Inference.

I’ll conclude with some open questions:

Are there practical cover learning algorithms that are consistent even if one fixes the number of cover elements $k$? That is, I want a practical cover learning algorithm whose output nerve is topologically correct in the limit of infinitely many sample points, and is restricted to have no more than $k$ vertices (with $k$ depending on the space the data is being sampled from). This would stand in stark contrast to the consistency of geometric complexes, whose size is unbounded as the sample size grows.
Are there approaches to cover learning that are simpler and more robust than the one we propose?
What is the relationship between fuzzy covers and fuzzy clusterings? Are any of the standard fuzzy clustering algorithms topologically consistent? What is a fuzzy cover and why would one consider this is explained in Section 3 of our paper.

Topological Inference and Unsupervised Learning

2025-02-01T00:00:00+00:00

Fundamental problems in classical Unsupervised Learning, such as Clustering and Dimensionality Reduction, can be fruitfully interpreted from the point of view of Topological Inference. This is an introduction to this point of view.

Contents

Topology
Topological Inference and Persistent Homology
Beyond Persistent Homology: Topological Inference in Unsupervised Learning
- Topology and Clustering
- Topology and Dimensionality Reduction

Topology

Topology is the branch of mathematics that studies the properties of spaces that remain unchanged under continuous deformations. For example, the number of connected components of a planar graph, such as $G_1$ below, is independent of how the graph is drawn: it does not depend, for example, on how long the edges are. Mathematicians say that the number of connected components is a topological invariant of the graph. On the other hand, the distance between a pair of vertices is not a topological invariant: it can change if the graph is drawn differently.

The number of connected components of a topological space $X$ is one of the simplest topological invariants, and it is called the zeroth Betti number of the space, and denoted $\beta_0(X)$. The first Betti number, denoted $\beta_1(X)$, counts the number of “one-dimensional holes”. For example, the space $S^1$ given by the circumference of a circle has a single one-dimensional hole $\beta_1(S^1) = 1$ and a single connected component $\beta_0(S^1) = 1$.

Holes come in all dimensions: for instance, a two-dimensional sphere $S^2$, such as a basketball, has a two-dimensional hole $\beta_2(S^2) = 1$ (its “inner void”), no one-dimensional holes $\beta_1(S^2)=0$, and a single connected component $\beta_0(S^2) = 1$. Another cool topological space is the surface of a donut (usually called a torus) denoted as $T$ in the picture above.

Topological spaces are continuous entities with potentially infinitely many points. A simplicial complex is a special kind of topological space that can be described using a finite amount of data. A simplicial complex is essentially an undirected hypergraph: like a graph, it can have vertices (called $0$-simplices) and edges (called $1$-simplices), but it can also have $n$-simplices for $n \geq 2$. For example, a $2$-simplex is a filled triangle, and a $3$-simplex is a filled tetrahedron. Here are the three spaces above represented as simplicial complexes:

Topological Inference and Persistent Homology

Topological Inference is about estimating topological properties of spaces from incomplete information. A classical example is [Niyogi et al. DGC], which provides a statistically consistent algorithm for estimating the Betti numbers of a manifold from a finite sample. Topological Inference relates to the following keywords:

Computational Topology: the algorithmic computation of topological invariants.

Topological Data Analysis: theory and algorithms related to the usage of topological methods in data analysis, part of the broader fields of Geometric Data Science and Geometric Machine Learning.

Persistent Homology (PH): a particular construction with applications in Topological Inference and abstract Mathematics. Roughly speaking, PH provides a generalization of the concept of Betti number (sometimes called persistent Betti number or barcode) that applies to families of topological spaces, rather than to single topological spaces.

Here’s the typical toy example showcasing PH as a finite sample estimator for the Betti numbers of the circle $S^1$:

Persistent Homology has had several successes in the analysis of data coming from scientific applications. Cool examples include PH as a feature for the classification of cells in subcellular spatial transcriptomics^{[Benjamin et al. Nature]}, and PH as a means to detect and quantitatively describe center vortices in $SU(2)$ lattice gauge theory in a gauge-invariant way^{[Sale et al. Phys. Rev. D]}. More examples can be found at the DONUT database.

Beyond PH: Topology in Unsupervised Learning

Unsupervised Learning is the branch of Machine Learning concerned with unlabeled data, but exactly what this means will depend on whom you ask. For example, should self-supervision, as done in the pretraining phase of large language models, be regarded as Unsupervised Learning? For concreteness, let’s focus on a class of unsupervised learning techniques that is best represented by the following two classical problems:

The clustering problem: the problem of grouping data points of an unlabeled dataset into meaningful clusters.
The dimensionality reduction problem: the problem of representing an unlabeled dataset as a subset of a well-understood metric space, such as a Euclidean space, in a geometrically meaningful way.

The mathematical models underlying many of the approaches to these problems have a nice topological interpretation, as I now describe. I thank Leland McInnes for various conversations on these topics.

Topology and Clustering

Many clustering algorithms are in essence Topological Inference algorithms, designed to estimate the connected components of high-density regions of the sample distribution. This is done by estimating connectivity structure, typically with a graph, and then taking connected components.

A motivating example. As a starting point, let’s consider the most standard clustering algorithm: $k$-means. Informally, $k$-means assumes that the input data is sampled from a set of “blob-shaped” clusters, and seeks to assign each data point to its corresponding blob. More formally, the algorithm is an instance of the expectation-maximization procedure applied to a Gaussian mixture model with equal isotropic covariances, equal priors, and hard assignments. Mathematically, the model is the limit of the mixture $\frac{1}{k} \sum_{i=1}^k \mathcal{N}(\mu_i, \sigma^2\, \mathrm{I})$ as $\sigma^2 \to 0$. The model gives a precise mathematical meaning to the clusters that one seeks: these are represented by the means $\mu_i$, which correspond to the modes of the distribution. Thus, after estimating the means $\mu_i$, there is no need to estimate any connectivity structure directly, since points can be clustered by mapping them to their closest mode.

I like this interpretation of $k$-means because it naturally leads to density-based clustering, which can be roughly thought of as a non-parametric version of $k$-means.

Density-based clustering. In the simplest incarnation of density-based clustering, the data is assumed to be sampled from a distribution with probability density function $f : \mathbb{R}^n \to \mathbb{R}$, and the clusters are declared to be the “high-density regions” with respect to $f$. This can be made precise in a few ways. The simplest is to fix a threshold $\lambda \in \mathbb{R}$, and define the clusters at level $\lambda$ to be the connected components of the $\lambda$-superlevel set $\{x \in \mathbb{R}^n : f(x) \geq \lambda\}$, like so:

A standard density-based clustering algorithm based on this principle is DBSCAN ^{[Ester et al. KDD]}, which, as observed¹ in [McInnes, Healy. ICDMW], can be recast as the combination of two main tools:

A kernel density estimator, from Statistics.
A Rips complex, from Topological Inference.

The Rips complex at scale $\varepsilon$ has as vertices the data points, and puts an edge between each pair of data points at distance at most $\varepsilon$. The Rips complex already showed up earlier in the figure illustrating PH. Here’s the Rips complex of a tiny point cloud at a fixed scale $\varepsilon$:

DBSCAN has two parameters, often denoted $\varepsilon$ and $k$. The parameter $k$ plays the role of the threshold $\lambda$, above. The parameter $\varepsilon$ is more interesting, and has two uses: as the width parameter for the kernel density estimator, and as the scale parameter for the Rips complex. So $\varepsilon$ is used for both Statistical Inference (density properties of the data) and Topological Inference (connectivity properties of the data).

Hierarchical clustering. If one does not want to, or does not know how to choose the threshold $\lambda$, one enters the realm of hierarchical clustering. Going back to the probability density function $f : \mathbb{R}^n \to \mathbb{R}$, if one lets $\lambda$ vary from larger to smaller, the superlevel sets of $f$ are nested, meaning that the $\lambda_1$-superlevel set is included in the $\lambda_2$-superlevel set whenever $\lambda_1 \geq \lambda_2$. By considering the connected components of all the superlevel sets, one obtains a hierarchical clustering, sometimes known as a cluster tree, which keeps track of how the connected components of the superlevel sets of $f$ appear and merge. The cluster tree can be summarized using PH, allowing one to quantify the prominence of modes and high density regions, and to do cluster inference robustly.

Similarly, one can fix one of the two parameters of DBSCAN and let the other one vary, to obtain a hierarchical clustering algorithm that consistently estimates the cluster tree^{[Chaudhuri, Dasgupta. NeurIPS][Chazal et al. JACM]}. This is, in essence, a version of the HDBSCAN algorithm, in case you have heard of it.

Even more, one can let both parameters of DBSCAN vary and obtain a two-parameter hierarchical clustering algorithm, which is also consistent and contains many standard hierarchical clusterings as “one-dimensional slices”. This is introduced in [Rolle, Scoccola. JMLR] and is implemented in our Python package persistable-clustering. It makes use of a density-sensitive generalization of the Rips complex from Topological Inference, called degree-Rips.

Topology and Dimensionality Reduction

Many non-linear dimensionality reduction algorithms also start with a Topological Inference phase, where a graph or simplicial complex is constructed, which is then used as a proxy for the space the data is assumed to be sampled from.

A motivating example. A classical example is Laplacian Eigenmaps (LE)^{[Belkin, Niyogi. Neural Comput.]}, which starts by constructing a graph estimating the connectivity of the data. This graph is then weighted using distance information, since the next step is to use it as a proxy for a manifold to estimate the Laplacian eigenfunctions, which are then used to find a low-dimensional embedding. The weighing is strictly speaking Geometric Inference, since its goal is to estimate geometric properties. It is only natural that Topological and Geometric Inference often show up together.

Uniform Manifold Approximation and Projection (UMAP). A more modern non-linear dimensionality reduction algorithm is UMAP^{[McInnes et al.]}, which has become the go-to algorithm for lots of applications. UMAP shares some similarities with LE, and in fact relies on LE for initialization, but the theoretical justification for UMAP is more topological in nature, and uses weighted simplicial complexes in a crucial way. In essence, the data is modelled as a probabilistic simplicial complex for which a low-dimensional embedding is found using stochastic gradient descent and several clever computational shortcuts. I am skipping over lots of very interesting details here; besides the original paper, I recommend the documentation for the official implementation, and other online resources, such as the interactive blog post [Coenen, Pearce], from where I took the following illustrative example:

In practice, UMAP is robust to many types of noise and data imperfections, and it is an excellent starting point in many scenarios, including exploratory data analysis. However, as opposed to the clustering techniques that I described earlier, it does not come with theoretical guarantees (at least for the moment), and interpretability may not be straightforward.

Next, I’ll describe a topological dimensionality reduction procedure that is limited in scope and less robust, but that, thanks to this, admits theoretical guarantees and higher interpretability.

The circular coordinates algorithm. The motivating question is the following: Given a space $X$ with a one-dimensional hole, does there exist a function $f : X \to S^1$ from the space to the circle that parametrizes this hole?

Such function $f$ would provide us with a topologically faithful representation of $X$, at least when it comes to preserving its circularity. For example, here are two functions from a “double-torus” to the circle, which parametrize two of its one-dimensional holes:

A standard theorem from topology (sometimes called the representability theorem for cohomology) guarantees that such function $f$ always exists; moreover, if $X$ is a Riemannian manifold, then there is a canonical choice for such function: the one with minimal Dirichlet energy. The exact meaning of this is not crucial; what matters is that, given a finite sample of a space with a one-dimensional hole, a bit of Topological and Geometric Inference gives us back a map from the sample to the circle $S^1$. This procedure is known as the circular coordinates algorithm^{[de Silva, Vejdemo-Johansson. SoCG]}, and is implemented in our Python package DREiMac, which was used, for instance, in Neuroscience^{[Schneider et al. Nature]}. Another interesting source of circularity in data is periodicity and quasiperiodicity in time series^{[Perea et al. BMC Bioinformatics]}.

The toroidal coordinates algorithm. There exist examples where data can have more than one one-dimensional hole. For instance, [Gardner et al. Nature] shows that the population activity of grid cells (part of the neural system concerned with an individual’s position) resides on a topological torus. When dealing with a sample from a space $X$ with more than one one-dimensional hole $\beta_1(X) = n \geq 2$ the original circular coordinates algorithm often outputs circle-valued maps that are “geometrically correlated”, and which are thus harder to interpret. In [Scoccola et al. SoCG], we make formal the notion of geometric correlation for circle-valued maps using the Dirichlet form, which endows the set of circle-valued maps with the structure of a lattice:

In that paper, we describe the toroidal coordinates algorithm, which enhances the circular coordinates algorithm by “decorrelating” the circle-valued maps. Informally, this decorrelation is an analogue of PCA for maps into a product of circles (rather than maps into a product of real lines). The difficulty is that a product of circles is not a vector space, so the projection matrix is constrained to have integer entries. We addressed this using lattice reduction, a method from Computational Number Theory:

The toroidal coordinates algorithm is also implemented in DREiMac. There also exist other topological dimensionality reduction algorithms, which parametrize topological features other than circularity, such as the one in [Schonsheck, Schonsheck. JACT].

What’s next? In the world of Dimensionality Reduction, the gap between theoretical guarantees and practical methods is much wider than in Clustering, where, as described earlier, there exist practical methods with strong topology recovery guarantees. Non-linear dimensionality reduction methods with theoretical guarantees have limited scope and tend not to be very robust, while general purpose methods that are efficient and robust tend to be justified by heuristics. This is not surprising: Dimensionality Reduction is more complicated than clustering, and higher dimensional topological invariants are more complicated than the connected components.

The practical Dimensionality Reduction method with perhaps the strongest available guarantees is Laplacian Eigenmaps, for which geometric consistency has been established in [Belkin, Niyogi. NeurIPS]. To the best of my knowledge, the topological properties of the LE embedding are not well understood beyond connected components.

What would be really nice to see is any of the following:

Methods that lie in-between UMAP and the topological parametrization methods, and which enjoy the good properties of both. This could be, for example, making UMAP more interpretable or making the topological parametrization methods more widely applicable and robust.
Theoretical guarantees for at least some parts of the UMAP pipeline or a suitably modified pipeline.
In particular, I would like to see topological consistency results for the nearest neighbor graph construction. While a lot is known about the Rips complex, it’s sibling, the nearest neighbor graph (ubiquitous in non-linear dimensionality reduction methods) has been mostly neglected when it comes to Topological Inference. The only related work that I know of is [Berry, Sauer. FoDS], which establishes geometric consistency.

If you know of interesting work in these directions, feel free to contact me!

To be fully precise, it is the algorithm DBSCAN* that exactly has that form. DBSCAN* is a minor modification of DBSCAN introduced in [Campello et al. KDD]. ↩