Mathematical Foundations of Large-Scale Machine Learning
Every parallelism strategy exploits a mathematical property. Every communication pattern has an algebraic structure.
This is an investigation-based guide to distributed training. Rather than explaining techniques, we derive them from first principles—starting with the mathematical properties that make each approach possible.
The goal: develop the intuition to reason about any distributed training problem, not just memorize existing solutions.
📖 Read online — Free, no login required
Capacity Engineers and ML practitioners who want deep understanding of:
- Why tensor parallelism requires high-bandwidth interconnects
- How pipeline bubbles arise from the algebra of sequential composition
- When ZeRO stages trade communication for memory
- What makes certain operations shardable and others not
We assume you've trained models on a single GPU. We'll take you from there to reasoning about thousand-GPU clusters.
The mental models—extended roofline, α-β communication costs, estimation as discipline.
How compute budgets connect to model size and data through Chinchilla optimality and phase transitions.
Communication primitives as algebraic operations with formal properties.
Each strategy derived from the mathematical property it exploits:
- Data Parallelism ← Associativity of gradient accumulation
- Tensor Parallelism ← Linearity of matrix multiplication
- Pipeline Parallelism ← Separability of layer composition
- Sequence Parallelism ← Decomposability of attention
- Expert Parallelism ← Sparsity of MoE routing
ZeRO, activation recomputation, and offloading—techniques that trade communication for memory.
Combining parallelism strategies on device meshes, handling failures, configuration search.
Gradient compression, local SGD, reduced precision, overlapping communication.
Profiling methodology and case studies (LLaMA 3, DeepSeek, Mistral).
This book is a companion to The Algebra of Speed, which establishes the core mathematical properties for single-machine optimization. Here we extend those ideas to distributed systems.
git clone https://github.com/ttsugriy/distributed-training-book.git
cd distributed-training-book
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtmkdocs serve # Live development server
mkdocs build # Build static siteContributions welcome! See CONTRIBUTING.md for guidelines.
Types of contributions:
- 🐛 Issue reports and corrections
- 📝 Improved explanations and derivations
- 📊 Interactive visualizations
- 🌍 Translations
- Content: CC BY-NC-SA 4.0
- Code: MIT
Inspired by:
- Pólya's How to Solve It
- Stepanov's From Mathematics to Generic Programming
- The JAX Scaling Book
- The Ultra-Scale Playbook
"The right parallelization follows from understanding what can be decomposed and what must be synchronized."