Skip to content

ttsugriy/distributed-training-book

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

163 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Algebra of Distributed Training

Mathematical Foundations of Large-Scale Machine Learning

Every parallelism strategy exploits a mathematical property. Every communication pattern has an algebraic structure.

What This Book Is

This is an investigation-based guide to distributed training. Rather than explaining techniques, we derive them from first principles—starting with the mathematical properties that make each approach possible.

The goal: develop the intuition to reason about any distributed training problem, not just memorize existing solutions.

Read the Book

📖 Read online — Free, no login required

Who This Is For

Capacity Engineers and ML practitioners who want deep understanding of:

  • Why tensor parallelism requires high-bandwidth interconnects
  • How pipeline bubbles arise from the algebra of sequential composition
  • When ZeRO stages trade communication for memory
  • What makes certain operations shardable and others not

We assume you've trained models on a single GPU. We'll take you from there to reasoning about thousand-GPU clusters.

Book Structure

Part I: Foundations

The mental models—extended roofline, α-β communication costs, estimation as discipline.

Part II: Scaling Laws

How compute budgets connect to model size and data through Chinchilla optimality and phase transitions.

Part III: The Algebra of Collectives

Communication primitives as algebraic operations with formal properties.

Part IV: Parallelism from Properties

Each strategy derived from the mathematical property it exploits:

  • Data Parallelism ← Associativity of gradient accumulation
  • Tensor Parallelism ← Linearity of matrix multiplication
  • Pipeline Parallelism ← Separability of layer composition
  • Sequence Parallelism ← Decomposability of attention
  • Expert Parallelism ← Sparsity of MoE routing

Part V: Memory as a Dimension

ZeRO, activation recomputation, and offloading—techniques that trade communication for memory.

Part VI: Composition and Resilience

Combining parallelism strategies on device meshes, handling failures, configuration search.

Part VII: Efficiency Frontiers

Gradient compression, local SGD, reduced precision, overlapping communication.

Part VIII: Synthesis

Profiling methodology and case studies (LLaMA 3, DeepSeek, Mistral).

Connection to The Algebra of Speed

This book is a companion to The Algebra of Speed, which establishes the core mathematical properties for single-machine optimization. Here we extend those ideas to distributed systems.

Local Development

Prerequisites

  • MkDocs + Material
  • Python 3.10+
  • Node.js 18+ (for interactive elements)

Setup

git clone https://github.com/ttsugriy/distributed-training-book.git
cd distributed-training-book
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Build

mkdocs serve      # Live development server
mkdocs build      # Build static site

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Types of contributions:

  • 🐛 Issue reports and corrections
  • 📝 Improved explanations and derivations
  • 📊 Interactive visualizations
  • 🌍 Translations

License

Acknowledgments

Inspired by:


"The right parallelization follows from understanding what can be decomposed and what must be synchronized."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors