Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon | Work at a Frontier Lab

Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon#

Absolute Zero Reasoner cover

Introduction#

There are 3 stages of LLM training:

Pretraining: Where LLMs learn to predict the next token using cross entropy loss. This creates a model that can only predict tokens — not useful for applications like chatting.
Finetuning: LLMs still predict next tokens, but now with chat-oriented data where the model learns to answer questions. This stage provides the "chatty-ness."
Alignment: Making sure the LLM doesn't say anything harmful and aligns with desired values. No ground truth exists here, so rewards train the model using RL algorithms. Sometimes stages 2 and 3 are combined. The recent sycophancy issue with OpenAI models stemmed from this stage.

This article covers stages 2 and 3. We'll explain what AZR does, why it's important, and walk through a minimal implementation on math problems.

Current Way of Aligning LLMs#

The standard approach uses preference data — lots of it — showing which outputs were preferred by humans. Then you apply one of these algorithms:

PPO (Proximal Policy Optimization)#

Generates sequences from the LLM, then computes advantage over a baseline using rewards. Very unstable and not sample efficient. The classic approach, but painful to get right.

DPO (Direct Preference Optimization)#

Similar goal to PPO but trained directly on human preference data — labels showing sequence A was preferred over sequence B. These labels provide sufficient signal without explicit rewards, reward model training, or advantage computation. Much simpler to implement.

GRPO (Group Relative Policy Optimization)#

A modification of PPO. Instead of a value head computing the average expected reward, GRPO averages the rewards in the current batch (or group) as a proxy for expected reward. This eliminates the value head entirely, making the whole process much cheaper.

Recent Methods Where No Data is Needed#

Papers like AZR (Absolute Zero: Reinforced Self-play Reasoning with Zero Data) and Reinforcement Learning for Reasoning in Large Language Models with One Training Example address this bottleneck head-on.

The key enabler is RLVR — Reinforcement Learning with Verifiable Rewards.

What is RLVR?#

RLVR provides rewards without any extra labeling work from a reward model or external signal. The approach is beautifully simple: just run the damn sequence.

There are three main ways to verify a sequence:

Format verification: If you asked for a number between XML tags, verify the format with regex. Did the model follow instructions? Binary check.
Math verification: Many math problems have a single float answer. Check if the LLM's output matches the ground truth answer from the dataset.
Code verification: Run the code with given inputs and check if the outputs match expected results. The code either works or it doesn't.

Deep Dive of AZR#

What Does AZR Do?#

AZR removes the need for alignment data entirely by having the model create its own training data. The beauty of the approach: it only needs one simple identity function as a seed, then learns to solve problems and proposes increasingly complex ones.

The model plays two roles:

Proposer: Generates new problems
Solver: Attempts to solve them

This is self-play — the model improves by competing against itself.

AZR algorithm overview

The Algorithm#

The algorithm works for code data but the principle applies to anything verifiable (e.g., minAZR does the same for math).

Three tasks derive from a single piece of code:

Abduction: Given function and outputs, find the inputs. ("What input produces this output?")
Deduction: Given function and input, find the outputs. ("What does this function return?")
Induction: Given input and output, find the function. ("What function maps this input to this output?")

All three are verifiable by simply running the code. No human labels needed.

Three reasoning tasks: abduction, deduction, and induction

The Execution Environment#

A robust execution environment is crucial for AZR to work. The repository provides a strong Python execution environment that validates whether strings are valid Python programs, then executes and captures output.

Code executor implementation

Important considerations for verifiable programs:

Non-deterministic code is excluded. Code using modules like random is hard to verify — you can't check correctness if the output changes every run. Such code is considered invalid.
Programs must halt within a time limit. Code that runs too long (infinite loops, exponential complexity) gets killed. Valid code exceeding the time limit is considered valid for now — a pragmatic compromise.
First iteration is self-contained. Initial generated code shouldn't use external functions — implement everything from scratch. This constraint relaxes in later iterations, allowing some helper functions inside generated functions. See the config yaml for details.

Code validation checks

Generating Seed Data#

This is the most interesting bit. All AZR needs is one identity function to start populating the seed dataset.

The identity function code is beautifully minimal — from this tiny seed, the model bootstraps an entire curriculum of increasingly complex problems.

The identity function seed

The seed generation subroutine generates initial seed data using a specific prompt to create functions. Each output code is checked for validity — can it run? Are the inputs/outputs well-formed? The function is executed using given input via the constructor, and the output is verified to match expectations.

Seed data validation logic

Main Training Loop#

Main training loop diagram

The main loop alternates between proposing problems and solving them. Each step populates batches with different problem types (deduction, induction, abduction) to ensure healthy samples from each group — this gives good average rewards per group for stable training.

Training loop code

Special care is taken for induction tasks (input/output given, need to find the function):

A message accompanies the code making it easier for the LLM to understand what's expected.
n test cases are generated but only n/2 are passed to the proposer — the rest are held out to test the proposed code. This prevents the model from memorizing test cases.

Rollout generation handles building these batches and feeding them to the trainer.

Reward Function#

For the solver: Full reward if the output is correct. Simple and clean.

Solver reward formula

For the proposer: This is where it gets clever. The proposer's reward is linked to the solver's performance:

If ALL questions in a batch are solved → proposer gets 0 reward. (Problems are too easy — nothing to learn from.)
If NO questions in a batch are solved → proposer gets 0 reward. (Problems are too hard — also nothing to learn from.)
For everything in between: proposer's reward = 1 - solver's reward.

Proposer reward logic

Partial rewards also exist per role, penalizing more for formatting errors (wrong output structure) than for incorrect answers.

RL Algorithm#

AZR uses REINFORCE++ with a task-specific advantage computation called Task-Relative REINFORCE++ (TR++).

The key idea: advantage computation needs healthy data for each task and role to produce signal-rich gradients. This gives us 6 groups total (2 roles x 3 tasks), and advantage is computed relative to each group's baseline rather than globally.

The TR++ implementation normalizes advantages within each group, and is called during training to ensure each of the 6 groups contributes meaningful gradients.

TR++ implementation code

The results speak for themselves: this approach made Qwen2.5-Base outperform Qwen2.5-Coder, and made Qwen2.5-Coder even better.

minAZR#

Anything verifiable can use RL. minAZR applies the same self-play concept to math: problems are proposed then solved using simple GRPO (not TR++).

The implementation is highly sequential and not optimized for performance — but that's not the point. The goal is conveying the key concepts and encouraging you to apply RL to anything where you can verify the output.

Ending Notes#

I hope this walkthrough helps you understand not just what AZR does, but why each design choice was made. The paper and codebase are both worth reading in full if you want to implement this yourself.

All images referenced come from the original paper and official code repository.

Originally published on ML Research Engineer Substack.