Work at a Frontier Lab

The ML Knowledge That Never Gets Written Down

Every frontier lab has tribal knowledge that takes years to absorb — the memory math, the failure modes, the things senior engineers just know. It's also what they test for in interviews.

We wrote it all down.

Free interactive course with executable Python.

See What You're Missing

9 lessons2.7+ hoursRun code in browser

Built from knowledge earned at Meta AI, DeepMind, and Anthropic

Trusted by professionals from top institutions

You're closer than you think.

But there are gaps you can't see.

Your model OOMs during backward, not forward — and you don't know why backward doubles the memory
You can recite the attention formula, but can't explain why head_dim controls your inference memory budget
You've read the Chinchilla paper, but don't know that Kaplan got a different answer — or why
You're optimizing GPU utilization without knowing whether your operation is memory-bound or compute-bound

These aren't gaps in your intelligence. They're gaps in your exposure. This is the knowledge you'd absorb after two years inside a frontier lab — and exactly what comes up in their interviews. We compressed it into lessons you can run in your browser.

Things you'd learn in your first year at a frontier lab

Each of these has shown up in real interviews. Here's what's actually inside.

From: KV Cache & Memory

KV cache memory = batch × seq × kv_heads × head_dim × 2 × 2 × layers. That head_dim you chose during training? It just locked your inference budget.

See this lesson

From: The Roofline Model

Before you optimize a single kernel, check if your op is memory-bound or compute-bound. The roofline model tells you in 30 seconds. Most engineers skip this and waste weeks.

See this lesson

From: Distributed Training

A 70B model needs ~1,120 GB per GPU in vanilla DDP. With FSDP ZeRO-3, that drops to ~140 GB. If you can't do this math on a whiteboard, you'll fumble the systems design interview.

See this lesson

From: Inference Systems

Prefill and decode are completely different computational regimes — one is compute-bound, the other is memory-bound. Every serving optimization you build is a tradeoff between them.

See this lesson

Free Course

Track 0: Foundations

Start here. 9 lessons — each one ends with something that breaks.

The Memory Wall

Understand why your 70B model OOMs

Gradient Flow Under Pressure

Debug NaN losses like a pro

Adam, Warmup & Scheduling

Know when AdamW actually helps

The Debugging Flowchart

Systematic methodology, not guesswork

See all 9 lessons

Overheard in the Valley

Byte-sized insights on what actually matters in ML interviews and research engineering — straight from the trenches.

reinforcement-learningverl

A Gentle Introduction to verl — Part 1

Wrangle and implement RL algorithms with confidence. A deep dive into verl's architecture — from master-worker design to the PPO training loop — so you can go beyond config files.

12 min readRead →

reinforcement-learningreasoning

Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon

How to make your LLM learn math and code using *no* data. A no-jargon deep dive into AZR — the paper that eliminates alignment data by having the model propose and solve its own problems.

15 min readRead →

The questions that separate ‘ML engineer’ from ‘research engineer’

A frontier lab interviewer might ask any of these. Could you answer them on a whiteboard?

Your 70B model training is at 40% MFU. Walk me through where the other 60% is going.

We need to serve this model at 200 tokens/sec per user. What's your KV cache memory budget and how does it constrain batch_size?

Why does Chinchilla recommend a different compute-optimal ratio than Kaplan's original scaling laws?

Your model's loss spikes at step 50k. Here's the training log. Diagnose it.

Explain why speculative decoding gives exact samples from the target distribution, not approximate ones.

This operation runs at 2 TFLOPS on an A100 rated for 312 TFLOPS. Is that a problem? Why or why not?

Every one of these is covered in the course. Not as trivia — as understanding you build yourself.

What ‘depth’ actually means

The left column passes a tutorial quiz. The right column passes an interview.

Other Courses

Memorize the attention formula

Work at a Frontier Lab

Know that head_dim chosen at training time locks your inference memory budget

Other Courses

Here's how to use FSDP

Work at a Frontier Lab

Here's the memory math — 1,120 GB → ~140 GB with ZeRO-3, and why

Other Courses

Scaling laws say bigger is better

Work at a Frontier Lab

Kaplan and Chinchilla disagreed. Here's why, and when to break the rules

Other Courses

Watch this video about transformers

Work at a Frontier Lab

Run this 80-line attention implementation, then break it by removing the scaling factor

Other Courses

Speculative decoding speeds up inference

Every lesson follows the same arc

Here's what it looks like for one lesson — KV Cache & Memory.

Motivation

You deploy a chat model. Works at batch=1. At batch=32, it OOMs. Why?

Mental Model

A diagram of KV cache growing with sequence length. The formula: batch × seq × kv_heads × head_dim × 2 × 2 × layers.

Toy Code

80 lines of Python. You compute KV cache size for Llama 2 7B. Run it right here in your browser.

Break It

Double the sequence length. Watch the memory explode. Now you understand why long context is expensive.

Scale Thinking

What happens at 70B? At 128k context? When does KV cache dominate your entire GPU memory?

What Engineers Say

“I spent three months optimizing a training run before realizing the bottleneck was memory bandwidth, not compute. The roofline model lesson would have saved me those three months.”

Sourav Bose

Research Engineer, Observo AI (Now Acquired)

“My team burned $40k on a training run with the wrong parallelism strategy. The FSDP memory math in Track 2 is the kind of thing you only learn after making expensive mistakes — or taking this course.”

Osaid Rehman

Senior Research Engineer, LinkedIn

“I've taken every ML course on the internet. This is the first one where the 'Break It' exercise actually taught me something I couldn't have gotten from reading the docs.”

Tushar Kadam

Senior ML Engineer, EarnIn

Two years of tribal knowledge. Nine lessons. Zero signup.

The same understanding that frontier labs test for — now free and interactive.

See What You're Missing

Free. No account needed. Start reading in 30 seconds.

Work at a Frontier Lab

The ML Knowledge That Never Gets Written Down

Trusted by professionals from top institutions

You're closer than you think.

Things you'd learn in your first year at a frontier lab

Track 0: Foundations

The Memory Wall

Gradient Flow Under Pressure

Adam, Warmup & Scheduling

The Debugging Flowchart

Overheard in the Valley

A Gentle Introduction to verl — Part 1

Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon

The questions that separate ‘ML engineer’ from ‘research engineer’

What ‘depth’ actually means

Every lesson follows the same arc

What Engineers Say

Two years of tribal knowledge. Nine lessons. Zero signup.

You Have Been Doing ML System Design Interviews Wrong

Read the papers that actually matter.

Attention Residuals

Midtraining Bridges Pretraining and Posttraining Distributions

Attention Is All You Need

Learn together. Build together.

How do you estimate KV cache memory for a 70B model at 128k context?

FSDP vs DeepSpeed ZeRO-3 — which do you actually use in production?

What questions came up in your research engineer interview?

Learn it. Then go build it.

Research Scientist, Frontier Red Team (Emerging Risks)

Full TimeResearch Scientist Work with the Fundamental AI Research Team to produce high-impact AI research.

ML/Research Engineer, Safeguards