The Algebra of Distributed Training

Mathematical Foundations of Large-Scale Machine Learning

Every parallelism strategy exploits a mathematical property. Every communication pattern has an algebraic structure.

What This Book Is

This is an investigation-based guide to distributed training. Rather than explaining techniques, we derive them from first principles—starting with the mathematical properties that make each approach possible.

The goal: develop the intuition to reason about any distributed training problem, not just memorize existing solutions.

Read the Book

📖 Read online — Free, no login required

Who This Is For

Capacity Engineers and ML practitioners who want deep understanding of:

Why tensor parallelism requires high-bandwidth interconnects
How pipeline bubbles arise from the algebra of sequential composition
When ZeRO stages trade communication for memory
What makes certain operations shardable and others not

We assume you've trained models on a single GPU. We'll take you from there to reasoning about thousand-GPU clusters.

Book Structure

Part I: Foundations

The mental models—extended roofline, α-β communication costs, estimation as discipline.

Part II: Scaling Laws

How compute budgets connect to model size and data through Chinchilla optimality and phase transitions.

Part III: The Algebra of Collectives

Communication primitives as algebraic operations with formal properties.

Part IV: Parallelism from Properties

Each strategy derived from the mathematical property it exploits:

Data Parallelism ← Associativity of gradient accumulation
Tensor Parallelism ← Linearity of matrix multiplication
Pipeline Parallelism ← Separability of layer composition
Sequence Parallelism ← Decomposability of attention
Expert Parallelism ← Sparsity of MoE routing

Part V: Memory as a Dimension

ZeRO, activation recomputation, and offloading—techniques that trade communication for memory.

Part VI: Composition and Resilience

Combining parallelism strategies on device meshes, handling failures, configuration search.

Part VII: Efficiency Frontiers

Gradient compression, local SGD, reduced precision, overlapping communication.

Part VIII: Synthesis

Profiling methodology and case studies (LLaMA 3, DeepSeek, Mistral).

Connection to The Algebra of Speed

This book is a companion to The Algebra of Speed, which establishes the core mathematical properties for single-machine optimization. Here we extend those ideas to distributed systems.

Local Development

Prerequisites

MkDocs + Material
Python 3.10+
Node.js 18+ (for interactive elements)

Setup

git clone https://github.com/ttsugriy/distributed-training-book.git
cd distributed-training-book
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Build

mkdocs serve      # Live development server
mkdocs build      # Build static site

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Types of contributions:

🐛 Issue reports and corrections
📝 Improved explanations and derivations
📊 Interactive visualizations
🌍 Translations

License

Content: CC BY-NC-SA 4.0
Code: MIT

Acknowledgments

Inspired by:

Pólya's How to Solve It
Stepanov's From Mathematics to Generic Programming
The JAX Scaling Book
The Ultra-Scale Playbook

"The right parallelization follows from understanding what can be decomposed and what must be synchronized."

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
docs		docs
overrides		overrides
.gitignore		.gitignore
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Algebra of Distributed Training

What This Book Is

Read the Book

Who This Is For

Book Structure

Part I: Foundations

Part II: Scaling Laws

Part III: The Algebra of Collectives

Part IV: Parallelism from Properties

Part V: Memory as a Dimension

Part VI: Composition and Resilience

Part VII: Efficiency Frontiers

Part VIII: Synthesis

Connection to The Algebra of Speed

Local Development

Prerequisites

Setup

Build

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

The Algebra of Distributed Training

What This Book Is

Read the Book

Who This Is For

Book Structure

Part I: Foundations

Part II: Scaling Laws

Part III: The Algebra of Collectives

Part IV: Parallelism from Properties

Part V: Memory as a Dimension

Part VI: Composition and Resilience

Part VII: Efficiency Frontiers

Part VIII: Synthesis

Connection to The Algebra of Speed

Local Development

Prerequisites

Setup

Build

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages