Discussion about this post

User's avatar
Jet New's avatar

Yi Tay's post On Emergent Abilities, Scaling and Inductive Bias discusses how inductive biases influence scaling and emergent abilities [0] is an interesting read. I think one factor underlying inductive biases' effectiveness is computational efficiency. Sara Hooker's The Hardware Lottery discusses how neural networks have benefited from the GPU lottery [1]. I think part of the attention mechanism's success (over others like graph NNs, Bayesian NNs) stems from its computational efficiency with GPUs. FlashAttention, an engineering optimization that further sped up computation and thus training [2], enabled not just shorter model experiments and iterations but also training on much larger datasets. Computational efficiency is not commonly discussed when reflecting about what makes models performant at scale, but definitely deserves significant credit.

[0] https://www.yitay.net/blog/emergence-and-scaling

[1] https://arxiv.org/abs/2009.06489

[2] https://arxiv.org/abs/2205.14135

No posts

Ready for more?