Introduction to parallelism in PyTorch

Fri, 31 Oct 2025 22:00:00 +0000

Training large models inevitably requires a solid understanding of parallelism techniques. In this post, I’ll give a practical, in-depth overview of the most common approaches — DDP, FSDP, and TP — and how they’re actually used in real PyTorch training setups.

This article was inspired by the excellent “How to Scale Your Model” blog series. While that series is clear and insightful, I felt it was missing some hands-on perspective and real-world lessons from someone who has trained models in the wild.

Tokenization from first principles

Tue, 07 Oct 2025 00:00:00 +0000

Byte-level BPE from first principles: what matters for speed and quality, how to implement it cleanly, and why a SuperBPE variant can lift sample efficiency.

About on George Grigorev Blog

Introduction to parallelism in PyTorch

Tokenization from first principles