Every frontier lab has tribal knowledge that takes years to absorb — the memory math, the failure modes, the things senior engineers just know. It's also what they test for in interviews.
We wrote it all down.
Free interactive course with executable Python.
See What You're MissingBuilt from knowledge earned at Meta AI, DeepMind, and Anthropic
But there are gaps you can't see.
These aren't gaps in your intelligence. They're gaps in your exposure. This is the knowledge you'd absorb after two years inside a frontier lab — and exactly what comes up in their interviews. We compressed it into lessons you can run in your browser.
Each of these has shown up in real interviews. Here's what's actually inside.
KV cache memory = batch × seq × kv_heads × head_dim × 2 × 2 × layers. That head_dim you chose during training? It just locked your inference budget.
Before you optimize a single kernel, check if your op is memory-bound or compute-bound. The roofline model tells you in 30 seconds. Most engineers skip this and waste weeks.
A 70B model needs ~1,120 GB per GPU in vanilla DDP. With FSDP ZeRO-3, that drops to ~140 GB. If you can't do this math on a whiteboard, you'll fumble the systems design interview.
Prefill and decode are completely different computational regimes — one is compute-bound, the other is memory-bound. Every serving optimization you build is a tradeoff between them.
Start here. 9 lessons — each one ends with something that breaks.
Byte-sized insights on what actually matters in ML interviews and research engineering — straight from the trenches.
A frontier lab interviewer might ask any of these. Could you answer them on a whiteboard?
Your 70B model training is at 40% MFU. Walk me through where the other 60% is going.
We need to serve this model at 200 tokens/sec per user. What's your KV cache memory budget and how does it constrain batch_size?
Why does Chinchilla recommend a different compute-optimal ratio than Kaplan's original scaling laws?
Your model's loss spikes at step 50k. Here's the training log. Diagnose it.
Explain why speculative decoding gives exact samples from the target distribution, not approximate ones.
This operation runs at 2 TFLOPS on an A100 rated for 312 TFLOPS. Is that a problem? Why or why not?
Every one of these is covered in the course. Not as trivia — as understanding you build yourself.
The left column passes a tutorial quiz. The right column passes an interview.
Memorize the attention formula
Know that head_dim chosen at training time locks your inference memory budget
Here's how to use FSDP
Here's the memory math — 1,120 GB → ~140 GB with ZeRO-3, and why
Scaling laws say bigger is better
Kaplan and Chinchilla disagreed. Here's why, and when to break the rules
Watch this video about transformers
Run this 80-line attention implementation, then break it by removing the scaling factor
Speculative decoding speeds up inference
Here's what it looks like for one lesson — KV Cache & Memory.
Motivation
You deploy a chat model. Works at batch=1. At batch=32, it OOMs. Why?
Mental Model
A diagram of KV cache growing with sequence length. The formula: batch × seq × kv_heads × head_dim × 2 × 2 × layers.
Toy Code
80 lines of Python. You compute KV cache size for Llama 2 7B. Run it right here in your browser.
Break It
Double the sequence length. Watch the memory explode. Now you understand why long context is expensive.
Scale Thinking
What happens at 70B? At 128k context? When does KV cache dominate your entire GPU memory?
“I spent three months optimizing a training run before realizing the bottleneck was memory bandwidth, not compute. The roofline model lesson would have saved me those three months.”
Sourav Bose
Research Engineer, Observo AI (Now Acquired)
“My team burned $40k on a training run with the wrong parallelism strategy. The FSDP memory math in Track 2 is the kind of thing you only learn after making expensive mistakes — or taking this course.”
Osaid Rehman
Senior Research Engineer, LinkedIn
“I've taken every ML course on the internet. This is the first one where the 'Break It' exercise actually taught me something I couldn't have gotten from reading the docs.”
Tushar Kadam
Senior ML Engineer, EarnIn
The same understanding that frontier labs test for — now free and interactive.
See What You're MissingFree. No account needed. Start reading in 30 seconds.
The questions you should ask, before you even begin. ML System Design interviews aren't about showcasing the latest research — they're a sophisticated vibe check.
Here's the mathematical proof that it preserves the exact target distribution
Production
How vLLM uses PagedAttention to solve this. Why GQA exists. What Meta actually ships.
Key insights and interview-ready takeaways from the most influential ML papers — no fluff, just the important bits.
Ask questions, share insights, and discuss ML research engineering with people who are on the same path.
I'm trying to work through the memory math for deploying Llama 2 70B with long context. What's the right formula and where do most people get tripped up?
Curious what people are actually running at scale. The docs make them sound interchangeable but the tradeoffs seem real when you hit multi-node.
Preparing for interviews at frontier labs. Would love to hear what systems-level questions people actually got asked and what they wished they'd studied.
These are the kinds of conversations waiting to happen. Be the first to start one.
Frontier labs are hiring research engineers right now. Master the skills here, then apply to the roles below.