Pinned
Batch size is one of the most important and most debated questions in large-scale training.
We still rely on heuristics:
linear scaling (Goyal et al)
critical batch size (McClandlish et al.)
hyperparameter transfer (Yang et al.)
We show that under a fixed token budget, there









