Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon | Work at a Frontier Lab
Free15 min read
Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon
How to make your LLM learn math and code using *no* data. A no-jargon deep dive into AZR — the paper that eliminates alignment data by having the model propose and solve its own problems.
Pretraining: Where LLMs learn to predict the next token using cross entropy loss. This creates a model that can only predict tokens — not useful for applications like chatting.
Finetuning: LLMs still predict next tokens, but now with chat-oriented data where the model learns to answer questions. This stage provides the "chatty-ness."
Alignment: Making sure the LLM doesn't say anything harmful and aligns with desired values. No ground truth exists here, so rewards train the model using RL algorithms. Sometimes stages 2 and 3 are combined. The recent sycophancy issue with OpenAI models stemmed from this stage.
This article covers stages 2 and 3. We'll explain what AZR does, why it's important, and walk through a minimal implementation on math problems.
Generates sequences from the LLM, then computes advantage over a baseline using rewards. Very unstable and not sample efficient. The classic approach, but painful to get right.
Similar goal to PPO but trained directly on human preference data — labels showing sequence A was preferred over sequence B. These labels provide sufficient signal without explicit rewards, reward model training, or advantage computation. Much simpler to implement.
A modification of PPO. Instead of a value head computing the average expected reward, GRPO averages the rewards in the current batch (or group) as a proxy for expected reward. This eliminates the value head entirely, making the whole process much cheaper.
RLVR provides rewards without any extra labeling work from a reward model or external signal. The approach is beautifully simple: just run the damn sequence.
There are three main ways to verify a sequence:
Format verification: If you asked for a number between XML tags, verify the format with regex. Did the model follow instructions? Binary check.
Math verification: Many math problems have a single float answer. Check if the LLM's output matches the ground truth answer from the dataset.
Code verification: Run the code with given inputs and check if the outputs match expected results. The code either works or it doesn't.
AZR removes the need for alignment data entirely by having the model create its own training data. The beauty of the approach: it only needs one simple identity function as a seed, then learns to solve problems and proposes increasingly complex ones.
The model plays two roles:
Proposer: Generates new problems
Solver: Attempts to solve them
This is self-play — the model improves by competing against itself.
Non-deterministic code is excluded. Code using modules like random is hard to verify — you can't check correctness if the output changes every run. Such code is considered invalid.
Programs must halt within a time limit. Code that runs too long (infinite loops, exponential complexity) gets killed. Valid code exceeding the time limit is considered valid for now — a pragmatic compromise.
First iteration is self-contained. Initial generated code shouldn't use external functions — implement everything from scratch. This constraint relaxes in later iterations, allowing some helper functions inside generated functions. See the config yaml for details.
This is the most interesting bit. All AZR needs is one identity function to start populating the seed dataset.
The identity function code is beautifully minimal — from this tiny seed, the model bootstraps an entire curriculum of increasingly complex problems.
The seed generation subroutine generates initial seed data using a specific prompt to create functions. Each output code is checked for validity — can it run? Are the inputs/outputs well-formed? The function is executed using given input via the constructor, and the output is verified to match expectations.
The main loop alternates between proposing problems and solving them. Each step populates batches with different problem types (deduction, induction, abduction) to ensure healthy samples from each group — this gives good average rewards per group for stable training.
Special care is taken for induction tasks (input/output given, need to find the function):
A message accompanies the code making it easier for the LLM to understand what's expected.
n test cases are generated but only n/2 are passed to the proposer — the rest are held out to test the proposed code. This prevents the model from memorizing test cases.
AZR uses REINFORCE++ with a task-specific advantage computation called Task-Relative REINFORCE++ (TR++).
The key idea: advantage computation needs healthy data for each task and role to produce signal-rich gradients. This gives us 6 groups total (2 roles x 3 tasks), and advantage is computed relative to each group's baseline rather than globally.
Anything verifiable can use RL. minAZR applies the same self-play concept to math: problems are proposed then solved using simple GRPO (not TR++).
The implementation is highly sequential and not optimized for performance — but that's not the point. The goal is conveying the key concepts and encouraging you to apply RL to anything where you can verify the output.
I hope this walkthrough helps you understand not just what AZR does, but why each design choice was made. The paper and codebase are both worth reading in full if you want to implement this yourself.
All images referenced come from the original paper and official code repository.