Work at a Frontier Lab

In this tutorial, you will classify training failures by loss curve shape (divergence, plateau, instability), use gradient statistics to pinpoint root causes, and apply a systematic triage checklist to resolve each pattern.

By the end you will be able to:

Distinguish divergence, plateau, and instability from a loss curve alone
Read gradient norm logs to determine whether the issue is vanishing gradients, exploding gradients, or data corruption
Apply the correct first fix for each failure pattern without trial-and-error

The Master Flowchart#

Loading diagram...

Pattern 1: Divergence (Loss to inf/nan)#

diagnose_divergence.py

Loading editor...

Pattern 2: Plateau (Loss Not Decreasing)#

diagnose_plateau.py

Loading editor...

Pattern 3: Instability (Oscillating Loss)#

diagnose_instability.py

Loading editor...

The Complete Debugging Checklist#

Step 1: Check the loss curve shape. Is it diverging (going to inf), plateauing (stuck), or unstable (oscillating)? Each has different causes.

Quick Reference Card#

Symptom	First Check	Likely Fix
Loss → nan	Gradient norms	Lower LR, gradient clipping
Loss stuck	Gradient magnitude	Higher LR, better init
Loss oscillates	Batch size, LR	Lower LR, larger batch
Val loss rises	Regularization	Dropout, weight decay
Slow progress	Learning rate	Increase LR, check warmup
Memory error	Batch size	Lower batch, gradient checkpointing

Production Debugging Tools#

debugging_tools.py

Loading editor...

Scale Thought Experiment#

Scale	Debugging Challenge	Approach
Local (1 GPU)	Quick iteration	Many small experiments
Single node (8 GPUs)	Longer runs	Log everything, catch issues early
Multi-node (64+ GPUs)	Expensive failures	Extensive validation before scaling
Production (1000+ GPUs)	Can't afford restarts	Automated monitoring, early stopping

Checkpoint Questions#

Each question requires diagnosis from concrete data, not just recall.

A training run shows: loss stable at 2.3 for 500 steps, then jumps to 847 at step 501, then NaN at step 502. Gradient norm at step 500 was 1.2, at step 501 was 4500. Classify this failure pattern, identify the most likely root cause, and list your first three fixes in order.
A 7B model has been training for 10K steps. Loss decreased from 3.1 to 2.8 in the first 2K steps, then has been between 2.79 and 2.81 for the last 8K steps. Average gradient norm is 3e-5. Is this convergence or a plateau? What single measurement would distinguish them? What is your first fix if it is a plateau?
You inherit a training run on 64 GPUs. The loss curve shows coefficient of variation of 0.35 over the last 1000 steps with 12 spikes above 3-sigma. The batch size is 32 (total 2048 across GPUs) and LR is 3e-4 with no warmup. Diagnose the instability and rank three fixes by cost (cheapest first).

Research Hooks#

Tools to Master:

PyTorch Profiler (torch.profiler)
NVIDIA Nsight Systems
Weights & Biases (wandb)
TensorBoard

Papers:

"On the Difficulty of Training Recurrent Neural Networks" (Pascanu et al., 2013) — Classic analysis of gradient problems
"Visualizing and Understanding Recurrent Networks" (Karpathy et al., 2015) — Debugging through visualization

This completes Track 0: Foundations. The mental models from this track — memory hierarchy, gradient flow, initialization, scaling laws, bandwidth analysis, and systematic debugging — form the vocabulary used throughout the remaining tracks on LLM training, parallelism, and inference.

By the end you will be able to:

Distinguish divergence, plateau, and instability from a loss curve alone
Read gradient norm logs to determine whether the issue is vanishing gradients, exploding gradients, or data corruption
Apply the correct first fix for each failure pattern without trial-and-error

The Master Flowchart#

Loading diagram...

Pattern 1: Divergence (Loss to inf/nan)#

diagnose_divergence.py

Loading editor...

Pattern 2: Plateau (Loss Not Decreasing)#

diagnose_plateau.py

Loading editor...

Pattern 3: Instability (Oscillating Loss)#

diagnose_instability.py

Loading editor...

The Complete Debugging Checklist#

Step 1: Check the loss curve shape. Is it diverging (going to inf), plateauing (stuck), or unstable (oscillating)? Each has different causes.

Quick Reference Card#

Symptom	First Check	Likely Fix
Loss → nan	Gradient norms	Lower LR, gradient clipping
Loss stuck	Gradient magnitude	Higher LR, better init
Loss oscillates	Batch size, LR	Lower LR, larger batch
Val loss rises	Regularization	Dropout, weight decay
Slow progress	Learning rate	Increase LR, check warmup
Memory error	Batch size	Lower batch, gradient checkpointing

Production Debugging Tools#

debugging_tools.py

Loading editor...

Scale Thought Experiment#

Scale	Debugging Challenge	Approach
Local (1 GPU)	Quick iteration	Many small experiments
Single node (8 GPUs)	Longer runs	Log everything, catch issues early
Multi-node (64+ GPUs)	Expensive failures	Extensive validation before scaling
Production (1000+ GPUs)	Can't afford restarts	Automated monitoring, early stopping

Checkpoint Questions#

Each question requires diagnosis from concrete data, not just recall.

A training run shows: loss stable at 2.3 for 500 steps, then jumps to 847 at step 501, then NaN at step 502. Gradient norm at step 500 was 1.2, at step 501 was 4500. Classify this failure pattern, identify the most likely root cause, and list your first three fixes in order.
A 7B model has been training for 10K steps. Loss decreased from 3.1 to 2.8 in the first 2K steps, then has been between 2.79 and 2.81 for the last 8K steps. Average gradient norm is 3e-5. Is this convergence or a plateau? What single measurement would distinguish them? What is your first fix if it is a plateau?
You inherit a training run on 64 GPUs. The loss curve shows coefficient of variation of 0.35 over the last 1000 steps with 12 spikes above 3-sigma. The batch size is 32 (total 2048 across GPUs) and LR is 3e-4 with no warmup. Diagnose the instability and rank three fixes by cost (cheapest first).

Research Hooks#

Tools to Master:

PyTorch Profiler (torch.profiler)
NVIDIA Nsight Systems
Weights & Biases (wandb)
TensorBoard

Papers:

"On the Difficulty of Training Recurrent Neural Networks" (Pascanu et al., 2013) — Classic analysis of gradient problems
"Visualizing and Understanding Recurrent Networks" (Karpathy et al., 2015) — Debugging through visualization