Xiangming Gu (@gu_xiangming) / X

Xiangming Gu

200 posts

Xiangming Gu

@gu_xiangming

Prev: SR @GoogleDeepmind, Ph.D. @NUSingapore, B.E. @Tsinghua_Uni.

Singapore

Joined May 2022

Pinned
Xiangming Gu
@gu_xiangming
Aug 5, 2025
I noticed that @OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,
This post is unavailable.
290K
Xiangming Gu
@gu_xiangming
Nov 12, 2024
Attention sink is important! arxiv.org/pdf/2410.10781 From my understanding, the functionality of "super weights" is to maintain the massive activations / attention sink in large language models. Please see my following comments.
𝚐𝔪𝟾𝚡𝚡𝟾
@gm8xx8
Nov 12, 2024
The Super Weight in Large Language Models paper: arxiv.org/abs/2411.07191 A tiny subset of critical parameters, called “super weights,” plays a major role in LLM performance. Removing even one can severely impact the model. This study introduces a method to identify these key
18K
Xiangming Gu
@gu_xiangming
Apr 17, 2025
I will be 🇸🇬 #ICLR2025 during April 24-27 to present my spotlight paper about understanding attention sink in LLMs: openreview.net/forum?id=78Nn4…. Afterwards, I will be a student researcher @GoogleDeepMind in London. I am looking forward to connecting with old friends and making new
12K
Xiangming Gu
@gu_xiangming
Dec 8, 2024
I will be at #NeurIPS2024 and give an oral presentation about my recent paper "When Attention Sink Emerges in Language Models: An Empirical View" in the workshop on Attributing Model Behavior at Scale. Welcome to chat about anything, e.g. understanding LLMs, foundation models.
5.1K
Xiangming Gu
@gu_xiangming
Aug 5, 2025
Replying to @gu_xiangming
Here are screenshots for the code in GPT-OSS.
9.9K
Xiangming Gu
@gu_xiangming
May 1, 2025
I am glad to see such a research to mitigate attention sink and massive activations. I always believe that attention sink provides a new perspective of LLM architecture. Actually, I also designed linear attention operations, which could have no attention sink or massive
zed
@zmkzmkz
Apr 30, 2025
EARLY PREPRINT: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Why do we use softmax in attention, even though we don’t really need non-zero probabilities that sum to one, causing attention sink and large hidden state activations? Let that sink in.
6K
Xiangming Gu
@gu_xiangming
Aug 5, 2025
Replying to @archanfel_anoth @SonglinYang4 and @OpenAI
Do you think the current way in GPT-OSS is better than the method I mentioned?
6.4K
Xiangming Gu
@gu_xiangming
Aug 5, 2025
Replying to @gu_xiangming
And I think the difference is that GPT-OSS used 1 dimension for each head while I used d_k dimensions for each head. I am not sure whether this performs better. I had no chance to scale up my models previously.
6.7K
Xiangming Gu
@gu_xiangming
Aug 5, 2025
Replying to @xidulu and @OpenAI
Actually, this trick does not remove attention sink. It just moves the attention sink to the learnable bias or learnable k bias. The role of massive activations (or outlier features) is to have such bias. Now we make it learnable, there is no need for these outliers. Please check
2.7K
Xiangming Gu
@gu_xiangming
Aug 5, 2025
The bias does not remove attention sink. It just moves the attention sink to the learnable bias or learnable k bias. The role of massive activations (or outlier features) is to have such bias. Now we make it learnable, there is no need for these outliers. As for values, this is
Xiangming Gu
@gu_xiangming
Aug 5, 2025
I noticed that @OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,
3K
Xiangming Gu
@gu_xiangming
Aug 8, 2025
Very insightful and comprehensive blog. Must read on attention sinks.
Guangxuan Xiao
@Guangxuan_Xiao
Aug 8, 2025
I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…
1.6K
Xiangming Gu
@gu_xiangming
Jul 25, 2025
Please check our latest COLM2025: arxiv.org/abs/2504.02732. The main takeaway is that LLMs use attention sink to avoid over mixing. I think the idea here is essentially same.
Parachutes
@mingwuzheng
Jun 17, 2025
Why transformers excessively attend to the first token? Our new hypothesis: it's the model's way to perform a context-dependent identity operation. We provide strong empirical evidence supporting this explanation. Read more here: quilted-agreement-28c.notion.site/Why-Does-Atten… #AI #NLP #Transformer
1.5K
Xiangming Gu
@gu_xiangming
Jun 27, 2025
Hhh, accidentally unlocked a profession in magic, though not a pro in Dartz. Had a great time with the team @PetarV_93
Petar Veličković
@PetarV_93
Jun 27, 2025
our student researcher @gu_xiangming seems to be a part-time magician -- at least i was taught in probability class that hitting a dart with another dart should be mathematically (near-)impossible
748
Xiangming Gu
@gu_xiangming
Nov 21, 2024
Very solid work from my colleague and schoolmate @Haonan_Wang_ , we have discussions that attention sink plays an important role in such a phenomena.
𝚐𝔪𝟾𝚡𝚡𝟾
@gm8xx8
Nov 21, 2024
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training paper: arxiv.org/abs/2411.13476 code: github.com/haonan3/Anchor…
645