Inductive bias
An easier way to understand model architecture
The first neural nets deployed in a commercial setting were used for optical character recognition. The key insight in training these neural nets was the introduction of convolution layers, which assess pixels (or groups of pixels) in the context of their surrounding pixels. This approach allows for identification of basic features like edges, curves and corners which can then be combined to allow for identification of characters or numbers.
Neural networks are incredibly flexible - it is formally provable that NNs can approximate any continuous function (and any discrete function for which continuous approximations are sufficiently accurate). The challenge in practice is building a sufficiently large dataset to train a neural network that is useful for a given purpose.
One way to look at what researchers do when exploring different model architectures is introducing inductive biases to constrain neural networks in certain ways that reduce flexibility in return for much faster training in terms of both data and compute.
Convolutions are one such bias. Forcing a computer vision model to assess a pixel in the context of the pixels adjacent to it enables training models to identify relevant features like edges much more quickly. However, in a generative language modeling context a similar bias - e.g. only use the last 5 words to predict the next word - would lead to very poor results.
Attention is the inductive bias that is common to all cutting edge generative language models today. Without going into the details, the attention mechanism basically forces a model to pay a different amount of attention to each token in its input when generating a prediction of the most likely next token. While this is a fairly loose constraint, it is sufficient to enable training of powerful models with the current available scale of data and compute.
The stronger the inductive biases in a model, the quicker it is to train. However, the wrong inductive biases may also limit the model’s usefulness and flexibility, particularly for contexts where those inductive biases are clearly incorrect like convolutions in language modeling.
Useful inductive biases are hard to come by. Convolutions were introduced in the 80’s and remain relevant today. Attention (sans recurrence) was introduced in 2017 and remains the bedrock of all language models today. This is the reason why much discussion has shifted towards how to build an optimal corpus of training data - changes to model architecture have a minimal impact on performance unless they introduce substantially different biases.


Yi Tay's post On Emergent Abilities, Scaling and Inductive Bias discusses how inductive biases influence scaling and emergent abilities [0] is an interesting read. I think one factor underlying inductive biases' effectiveness is computational efficiency. Sara Hooker's The Hardware Lottery discusses how neural networks have benefited from the GPU lottery [1]. I think part of the attention mechanism's success (over others like graph NNs, Bayesian NNs) stems from its computational efficiency with GPUs. FlashAttention, an engineering optimization that further sped up computation and thus training [2], enabled not just shorter model experiments and iterations but also training on much larger datasets. Computational efficiency is not commonly discussed when reflecting about what makes models performant at scale, but definitely deserves significant credit.
[0] https://www.yitay.net/blog/emergence-and-scaling
[1] https://arxiv.org/abs/2009.06489
[2] https://arxiv.org/abs/2205.14135