The Debugging Flowchart
Estimated reading time: 22 minutes
Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Estimated reading time: 22 minutes
In this tutorial, you will classify training failures by loss curve shape (divergence, plateau, instability), use gradient statistics to pinpoint root causes, and apply a systematic triage checklist to resolve each pattern.
By the end you will be able to:
| Symptom | First Check | Likely Fix |
|---|---|---|
| Loss → nan | Gradient norms | Lower LR, gradient clipping |
| Loss stuck | Gradient magnitude | Higher LR, better init |
| Loss oscillates | Batch size, LR | Lower LR, larger batch |
| Val loss rises | Regularization | Dropout, weight decay |
| Slow progress | Learning rate | Increase LR, check warmup |
| Memory error | Batch size | Lower batch, gradient checkpointing |
| Scale | Debugging Challenge | Approach |
|---|---|---|
| Local (1 GPU) | Quick iteration | Many small experiments |
| Single node (8 GPUs) | Longer runs | Log everything, catch issues early |
| Multi-node (64+ GPUs) | Expensive failures | Extensive validation before scaling |
| Production (1000+ GPUs) | Can't afford restarts | Automated monitoring, early stopping |
Each question requires diagnosis from concrete data, not just recall.
Tools to Master:
Papers:
This completes Track 0: Foundations. The mental models from this track — memory hierarchy, gradient flow, initialization, scaling laws, bandwidth analysis, and systematic debugging — form the vocabulary used throughout the remaining tracks on LLM training, parallelism, and inference.