Csordás Róbert (@robert

Csordás Róbert

239 posts

Csordás Róbert

@robert_csordas

RS @OpenAI. Ex postdoc at Stanford working on systematic generalization and algorithmic reasoning. Ex IDSIA PhD, Ex @DeepMind intern. Views are my own.

Switzerland

robertcsordas.github.io

Joined June 2016

Csordás Róbert
@robert_csordas
May 27, 2025
Your language model is wasting half of its layers to just refine probability distributions rather than doing interesting computations. In our paper, we found that the second half of the layers of the Llama 3 models have minimal effect on future computations. 1/6
121K
Csordás Róbert
@robert_csordas
Nov 21, 2023
If you are training Transformers in mixed precision and you experience a systematic explosion in the loss always around the same iteration, consider scaling Q and K values by d_model^(-1/4) before computing the logit matrix instead of scaling the logits by d_model ^ (-1/2). (1/2)
52K
Csordás Róbert
@robert_csordas
Oct 17, 2023
I'm happy to announce that I successfully defended my PhD thesis, "Systematic Generalization in Connectionist Models" (robertcsordas.github.io/data/thesis.pdf). I’m thankful to my advisor @SchmidhuberAI and all my wonderful colleagues for this awesome journey!
34K
Csordás Róbert
@robert_csordas
May 27, 2025
Replying to @robert_csordas
In summary, LLMs are *not* using their depth efficiently. Thus, we call for future research on more efficient architectures and training objectives. With @chrmanning and @ChrisGPotts. Paper: arxiv.org/abs/2505.13898 Code: github.com/robertcsordas/… 6/6
arxiv.org
Do Language Models Use Their Depth Efficiently?
Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to...
4.5K
Csordás Róbert
@robert_csordas
May 27, 2025
Replying to @robert_csordas
Our results suggest that recurrent architectures, such as MoEUT (arxiv.org/abs/2405.16039), might use their layers more effectively. 5/6
4.8K
Csordás Róbert
@robert_csordas
Nov 3, 2023
We are happy to announce that our paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers" got accepted to EMNLP Findings. With our improved MoE Transformers, we can match the performance of parameter-matched dense models.
17K
Csordás Róbert
@robert_csordas
May 4, 2021
Do NNs learn solutions that are modular? Our #ICLR2021 paper investigates functional modularity in NNs and finds that although weights specialize, they are not reused to implement the same functionality elsewhere in the network. This limits certain types of generalization.
Csordás Róbert
@robert_csordas
May 27, 2025
Replying to @robert_csordas
We train linear maps between Qwen 2.5 1.5B and 14B, and find that the layers at identical relative depth correspond to each other the best, indicating that deeper models are not doing new kinds of computation, but only performing more fine-grained adjustments to the residual. 4/6
4.8K
Csordás Róbert
@robert_csordas
Jun 18, 2024
Mixture-of-Experts Universal Transformer (MoEUT) is a new UT model that combines MoE MLP and MoE attention with a novel layer norm and grouping, making UTs competitive in language modeling for the first time. Paper: arxiv.org/abs/2405.16039 Code: github.com/robertcsordas/…
8.7K
Csordás Róbert
@robert_csordas
Jan 26, 2024
I’m thrilled to announce that starting February 1st, I'm joining @stanfordnlp as a postdoc, under the supervision of @chrmanning and @ChrisGPotts. Excited for this incredible opportunity!
19K
Csordás Róbert
@robert_csordas
May 27, 2025
Replying to @robert_csordas
For inputs involving many steps, the operands for each step remain important until an identical depth. This indicates that the model is *not* breaking down the computation, solving subproblems, and composing their results together. 2/6
5.6K
Csordás Róbert
@robert_csordas
May 27, 2025
Replying to @robert_csordas
Using our “depth score” to measure the maximal depth of computation for an input, we show that multi-hop questions and math questions of varying difficulty use identical computation depth, confirming the lack of composition. 3/6
5.1K
Csordás Róbert
@robert_csordas
Aug 30, 2021
I'm happy to announce that our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers" has been accepted to #EMNLP2021! paper: arxiv.org/abs/2108.12284 code: github.com/robertcsordas/… 1/4
Csordás Róbert
@robert_csordas
Dec 11, 2024
Come visit our poster "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention" on Thursday at 11 am in East Exhibit Hall A-C on #NeurIPS2024. With @PiotrPiekosAI, Kazuki Irie and @SchmidhuberAI.
12K