Pinned
In our new paper, we naturally derive a new attention variant based on the surprising finding that deep layers benefit the most from learning a context-free value vectors, without the input from the residual stream.
The attention variant: since the value vector does not depend













































