Pinned
HOW INFORMATION FLOWS THROUGH TRANSFORMERS
Because I've looked at those "transformers explained" pages and they really suck at explaining.
There are two distinct information highways in the transformer architecture:
- The residual stream (black arrows): Flows vertically through
KV caching overcomes statelessness in a very meaningful sense and provides a very nice mechanism for introspection (specifically of computations at earlier token positions)
the Value representations can encode information from residual streams of past positions without
























