Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. The Debugging Flowchart

The Debugging Flowchart

Estimated reading time: 22 minutes

Previous

←Bandwidth & Profiling

In this tutorial, you will classify training failures by loss curve shape (divergence, plateau, instability), use gradient statistics to pinpoint root causes, and apply a systematic triage checklist to resolve each pattern.

By the end you will be able to:

  • Distinguish divergence, plateau, and instability from a loss curve alone
  • Read gradient norm logs to determine whether the issue is vanishing gradients, exploding gradients, or data corruption
  • Apply the correct first fix for each failure pattern without trial-and-error
💡

Core Idea

Systematic debugging follows a fixed sequence: classify the loss curve shape, inspect gradient statistics per layer, verify the data pipeline, then compare hyperparameters against known working configurations. Changing one variable at a time and logging the result is what separates methodical diagnosis from guessing.

The Master Flowchart#

Loading diagram...

Pattern 1: Divergence (Loss to inf/nan)#

❌

Divergence Symptoms

  • Loss suddenly jumps to inf or nan
  • Gradients contain nan values
  • Parameters contain nan values
  • Often happens suddenly after many stable steps
diagnose_divergence.py
Loading editor...

Pattern 2: Plateau (Loss Not Decreasing)#

⚠️

Plateau Symptoms

  • Loss decreases initially, then stalls
  • Gradients are very small (but not zero)
  • Model predictions don't change
  • Different from convergence (loss should still be high)
diagnose_plateau.py
Loading editor...

Pattern 3: Instability (Oscillating Loss)#

💡

Instability Symptoms

  • Loss oscillates up and down
  • Training sometimes makes progress, sometimes loses it
  • Often related to learning rate or batch size
  • May be worse at certain phases of training
diagnose_instability.py
Loading editor...

The Complete Debugging Checklist#

1

Step 1: Check the loss curve shape. Is it diverging (going to inf), plateauing (stuck), or unstable (oscillating)? Each has different causes.

Quick Reference Card#

SymptomFirst CheckLikely Fix
Loss → nanGradient normsLower LR, gradient clipping
Loss stuckGradient magnitudeHigher LR, better init
Loss oscillatesBatch size, LRLower LR, larger batch
Val loss risesRegularizationDropout, weight decay
Slow progressLearning rateIncrease LR, check warmup
Memory errorBatch sizeLower batch, gradient checkpointing

Production Debugging Tools#

debugging_tools.py
Loading editor...

Scale Thought Experiment#

ScaleDebugging ChallengeApproach
Local (1 GPU)Quick iterationMany small experiments
Single node (8 GPUs)Longer runsLog everything, catch issues early
Multi-node (64+ GPUs)Expensive failuresExtensive validation before scaling
Production (1000+ GPUs)Can't afford restartsAutomated monitoring, early stopping

Checkpoint Questions#

Each question requires diagnosis from concrete data, not just recall.

  1. A training run shows: loss stable at 2.3 for 500 steps, then jumps to 847 at step 501, then NaN at step 502. Gradient norm at step 500 was 1.2, at step 501 was 4500. Classify this failure pattern, identify the most likely root cause, and list your first three fixes in order.
  2. A 7B model has been training for 10K steps. Loss decreased from 3.1 to 2.8 in the first 2K steps, then has been between 2.79 and 2.81 for the last 8K steps. Average gradient norm is 3e-5. Is this convergence or a plateau? What single measurement would distinguish them? What is your first fix if it is a plateau?
  3. You inherit a training run on 64 GPUs. The loss curve shows coefficient of variation of 0.35 over the last 1000 steps with 12 spikes above 3-sigma. The batch size is 32 (total 2048 across GPUs) and LR is 3e-4 with no warmup. Diagnose the instability and rank three fixes by cost (cheapest first).

Research Hooks#

Tools to Master:

  • PyTorch Profiler (torch.profiler)
  • NVIDIA Nsight Systems
  • Weights & Biases (wandb)
  • TensorBoard

Papers:

  1. "On the Difficulty of Training Recurrent Neural Networks" (Pascanu et al., 2013) — Classic analysis of gradient problems
  2. "Visualizing and Understanding Recurrent Networks" (Karpathy et al., 2015) — Debugging through visualization

This completes Track 0: Foundations. The mental models from this track — memory hierarchy, gradient flow, initialization, scaling laws, bandwidth analysis, and systematic debugging — form the vocabulary used throughout the remaining tracks on LLM training, parallelism, and inference.