Your language model is wasting half of its layers to just refine probability distributions rather than doing interesting computations.
In our paper, we found that the second half of the layers of the Llama 3 models have minimal effect on future computations. 1/6
Csordás Róbert
239 posts
- If you are training Transformers in mixed precision and you experience a systematic explosion in the loss always around the same iteration, consider scaling Q and K values by d_model^(-1/4) before computing the logit matrix instead of scaling the logits by d_model ^ (-1/2). (1/2)
- I'm happy to announce that I successfully defended my PhD thesis, "Systematic Generalization in Connectionist Models" (robertcsordas.github.io/data/thesis.pdf). I’m thankful to my advisor @SchmidhuberAI and all my wonderful colleagues for this awesome journey!
- Replying to @robert_csordasIn summary, LLMs are *not* using their depth efficiently. Thus, we call for future research on more efficient architectures and training objectives. With @chrmanning and @ChrisGPotts. Paper: arxiv.org/abs/2505.13898 Code: github.com/robertcsordas/… 6/6
- Replying to @robert_csordasOur results suggest that recurrent architectures, such as MoEUT (arxiv.org/abs/2405.16039), might use their layers more effectively. 5/6
- We are happy to announce that our paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers" got accepted to EMNLP Findings. With our improved MoE Transformers, we can match the performance of parameter-matched dense models.
- Do NNs learn solutions that are modular? Our #ICLR2021 paper investigates functional modularity in NNs and finds that although weights specialize, they are not reused to implement the same functionality elsewhere in the network. This limits certain types of generalization.
- Replying to @robert_csordasWe train linear maps between Qwen 2.5 1.5B and 14B, and find that the layers at identical relative depth correspond to each other the best, indicating that deeper models are not doing new kinds of computation, but only performing more fine-grained adjustments to the residual. 4/6
- Mixture-of-Experts Universal Transformer (MoEUT) is a new UT model that combines MoE MLP and MoE attention with a novel layer norm and grouping, making UTs competitive in language modeling for the first time. Paper: arxiv.org/abs/2405.16039 Code: github.com/robertcsordas/…
- I’m thrilled to announce that starting February 1st, I'm joining @stanfordnlp as a postdoc, under the supervision of @chrmanning and @ChrisGPotts. Excited for this incredible opportunity!
- Replying to @robert_csordasFor inputs involving many steps, the operands for each step remain important until an identical depth. This indicates that the model is *not* breaking down the computation, solving subproblems, and composing their results together. 2/6
- Replying to @robert_csordasUsing our “depth score” to measure the maximal depth of computation for an input, we show that multi-hop questions and math questions of varying difficulty use identical computation depth, confirming the lack of composition. 3/6
- I'm happy to announce that our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers" has been accepted to #EMNLP2021! paper: arxiv.org/abs/2108.12284 code: github.com/robertcsordas/… 1/4
- Come visit our poster "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention" on Thursday at 11 am in East Exhibit Hall A-C on #NeurIPS2024. With @PiotrPiekosAI, Kazuki Irie and @SchmidhuberAI.












