Alpha Geometry: King of the Hill? (Part 2)

In my previous post, we explored symbolic deduction engine DD+AR. They excel at logical, step-by-step reasoning within existing information. But as any geometry student knows, some proofs hinge on those clever extra lines or points you add – auxiliary constructions. These are tough for deduction engines alone.

So what are auxiliary constructions?

Auxiliary Constructions

Sometimes, directly proving a theorem is tricky. Imagine trying to build a complex bridge – sometimes, adding temporary support structures makes it much easier. Auxiliary constructions act as these ‘support structures’ in theorem proving.

These are additional elements you introduce temporarily into your proof. They could be:

Extra Lines: Drawing a new line within a geometric figure can reveal hidden relationships.
Extra Points: Adding a point might create helpful triangles or other shapes.
Extra Variables: In algebraic proofs, a new variable could simplify complex equations.

Why Use Them?

Breaking It Down: A complex theorem might become simpler by proving smaller ones based on your auxiliary constructions.
Revealing Hidden Patterns: New lines, points, etc., can highlight relationships within the problem that weren’t obvious before.
Providing Stepping Stones: Sometimes, the auxiliary construction provides the crucial link connecting your starting point to the theorem you want to prove.

The Problem:

Our Current theorem prover (DD+AR) is good at logical deduction – making step-by-step inferences from existing information.
However, complex proofs (like those seen in Olympiads) often require adding new elements (‘auxiliary constructions’) that aren’t immediately obvious. This is hard for deduction-focused systems.

This is where AlphaGeometry gets innovative. It employs a large language model (LLM) to learn how to generate helpful auxiliary constructions. Up until the last blog a synthetic training example consisted of (Premise, conclusion, Proof), the key missing term was generating new proof terms.

Generating Auxiliary Construction Dataset

To focus on auxiliary connections, they fine-tuned the model on a specialised dataset. First, they ran a premise through the DD+AR engine to get a deduction closure (a graph from premise to proof). Then, a traceback algorithm found the dependency subgraph for each conclusion – this maps all the logical steps and premises needed for that conclusion. The key insight was to analyze the difference between the conclusion statement and the objects it mentions. This pinpoints pieces of the proof that aren’t explicitly defined in the conclusion, giving the model a ‘gap’ to fill with auxiliary constructions.

Let’s look at the example from the paper itself. Below is one of the sampled premise to prove and a graph to reach the conclusion.

These were the steps(roughly) to generate one training example:

They sample a random theorem premise to prove.
Use the DD + AR to obtain a deduction closure.
This returns a DAG. For each node they traceback and get the leaves i.e. which premise the nodes depend upon. (Dependency deductions)
The HA _ | _ BC returns the green subgraph.
In the above example the points E and D were a part of the proof, despite being irrelevant to the construction of HA and BC. Hence these points are learned by the model as auxiliary connections.

The auxiliary connections sampled this way was introduced into the proof and a single training example was created with the structure (P,N,G(N)) → <premise><conclusion><proof>.

Training the Model

The model was trained on a massive dataset of 100 million proofs, including simple ones solvable by straightforward deduction using the DD + AR engine. This builds a broad foundation of proof understanding.

Then, the model was further trained on a smaller subset (9 million proofs) specifically focused on problems requiring auxiliary constructions. This helps the model specialize in creating those critical ‘missing’ steps in proofs.

Model Architecture

Transformer-Based: At its core, they use a transformer model, a powerful language processing architecture.
Size: 12 layers, 151 million parameters (excluding certain parts)
Tokenizer: SentencePiece with ‘word’ mode: vocabulary of 757.

Training Setup

Hardware: TPUv3 hardware for acceleration.
Batch Size: 16 per TPU core.
Learning Rate: Cosine schedule, ranging from 0.01 to 0.001
Sequence Packing: As more than 90% of the proofs were less than 200 tokens.
Pretraining: Initial training on a massive dataset for 10,000,000 steps.
Fine-tuning: Further training on auxiliary construction data for 1,000,000 steps at a constant 0.001 learning rate.

Key Points

Meliad Library: They leverage an existing library for transformer training. This library is specifically designed for recursive and compositional structures, i.e. trees.
No Hyperparameter Tuning: Emphasis is placed on the training data itself, not intricate tweaking of settings.

Inference

This how the inference works in Alpha Geometry

The Problem: The system is given a geometry theorem to prove.

Creative Thinker (Language Model):

It reads the problem and suggests an auxiliary construction.
It gets smarter by looking at past successful constructions. The past constructions and the problem statement are fed into the transformer as context.

Logical Expert (Symbolic Deduction Engine):

It takes the new construction and tries to make deductions.
With new information, it might get closer to proving the theorem.

Repeat Until Solved (or Time’s Up): Both the symbolic engine and the transformer take turns in this setup. The transformer generates one extra sentence at each turn.

The language model returns K different sequences describing K different alternate auxiliary constructions. Hence beam search is performed over these K options.

How It Works

Multiple Ideas: The language model gives several different construction ideas(K).
Beam Search Begins: Instead of blindly trying them all, beam search maintains a sort of ‘leaderboard’ of the top ‘k’ most promising proof paths.
Ranking Paths: Each possible path (beam) has a score. This might be based on factors like:
- How well the construction fits past successful proofs
- If the deduction engine makes progress when this construction is added
- Unfortunately nothing more is discussed about this value function.
Expanding the Search: The top-ranked beams are taken, and more variations are considered, building on those paths. Lower-scoring options are abandoned.

The inference is done in a highly parallelised setup including 4 V100 GPU workers and 10000 CPU workers.

The Findings and Conclusion

The system was run to infer on the dataset named IMO-AG-30, a test set of 30 classical geometry problems taken from International Mathematics Olympiad(IMO) since the year 2000.

These are the key findings after running the inference:

1. AlphaGeometry Outperforms:

It solves the most problems (25) on the IMO-AG-30 benchmark.
This surpasses previous state-of-the-art and strong baselines that used combinations of symbolic deduction, algebraic reasoning, and even human-designed hints.
Without the transformer it is way worse. It manages to only solve 14 problems with the symbolic engine only.
However using the transformer for auxiliary constructions solves another 11 questions. The difference is night and day.

2. AlphaGeometry is Robust:

It remains effective with just 20% of training data.
Even with much smaller search options (beam size 8 vs. 512), performance stays strong.

3. Performance on Diverse Tests:

AlphaGeometry maintains its superiority on a larger test set (231 problems).
It solves an impressive 98.7% of those problems.
Interestingly, it conquers difficult IMO problems from 2000 and 2015.

5. Unexpected Discovery:

AlphaGeometry revealed a more general version of a translated IMO theorem by finding an unused premise.

I intended to write this blog in a single post, but midway I made it a two part series, Since there were many things to discuss. Alpha Geometry is a beautiful case study for how an LLM can be incorporated into traditional AI systems and it does perform quite well. I think this will be done way more going forward. We do live in truly exciting times.

Until Next time. Peace!

References:

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He & Thang Luong (2024) Solving olympiad geometry without human demonstrations. Nature

Deep Learning Blog