Log inSign up
Xiangming Gu
200 posts
user avatar
Xiangming Gu
@gu_xiangming
Prev: SR @GoogleDeepmind, Ph.D. @NUSingapore, B.E. @Tsinghua_Uni.
Singapore
guxm2021.github.io
Joined May 2022
526
Following
2,018
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • Pinned
    user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    I noticed that @OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,
    This post is unavailable.
    290K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Nov 12, 2024
    Attention sink is important! arxiv.org/pdf/2410.10781 From my understanding, the functionality of "super weights" is to maintain the massive activations / attention sink in large language models. Please see my following comments.
    user avatar
    𝚐𝔪𝟾𝚡𝚡𝟾
    @gm8xx8
    Nov 12, 2024
    The Super Weight in Large Language Models paper: arxiv.org/abs/2411.07191 A tiny subset of critical parameters, called “super weights,” plays a major role in LLM performance. Removing even one can severely impact the model. This study introduces a method to identify these key
    18K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Apr 17, 2025
    I will be 🇸🇬 #ICLR2025 during April 24-27 to present my spotlight paper about understanding attention sink in LLMs: openreview.net/forum?id=78Nn4…. Afterwards, I will be a student researcher @GoogleDeepMind in London. I am looking forward to connecting with old friends and making new
    12K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Dec 8, 2024
    I will be at #NeurIPS2024 and give an oral presentation about my recent paper "When Attention Sink Emerges in Language Models: An Empirical View" in the workshop on Attributing Model Behavior at Scale. Welcome to chat about anything, e.g. understanding LLMs, foundation models.
    5.1K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    Replying to @gu_xiangming
    Here are screenshots for the code in GPT-OSS.
    9.9K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    May 1, 2025
    I am glad to see such a research to mitigate attention sink and massive activations. I always believe that attention sink provides a new perspective of LLM architecture. Actually, I also designed linear attention operations, which could have no attention sink or massive
    user avatar
    zed
    @zmkzmkz
    Apr 30, 2025
    EARLY PREPRINT: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Why do we use softmax in attention, even though we don’t really need non-zero probabilities that sum to one, causing attention sink and large hidden state activations? Let that sink in.
    6K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    Replying to @archanfel_anoth @SonglinYang4 and @OpenAI
    Do you think the current way in GPT-OSS is better than the method I mentioned?
    6.4K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    Replying to @gu_xiangming
    And I think the difference is that GPT-OSS used 1 dimension for each head while I used d_k dimensions for each head. I am not sure whether this performs better. I had no chance to scale up my models previously.
    6.7K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    Replying to @xidulu and @OpenAI
    Actually, this trick does not remove attention sink. It just moves the attention sink to the learnable bias or learnable k bias. The role of massive activations (or outlier features) is to have such bias. Now we make it learnable, there is no need for these outliers. Please check
    2.7K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    The bias does not remove attention sink. It just moves the attention sink to the learnable bias or learnable k bias. The role of massive activations (or outlier features) is to have such bias. Now we make it learnable, there is no need for these outliers. As for values, this is
    user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 5, 2025
    I noticed that @OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,
    3K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Aug 8, 2025
    Very insightful and comprehensive blog. Must read on attention sinks.
    user avatar
    Guangxuan Xiao
    @Guangxuan_Xiao
    Aug 8, 2025
    I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…
    1.6K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Jul 25, 2025
    Please check our latest COLM2025: arxiv.org/abs/2504.02732. The main takeaway is that LLMs use attention sink to avoid over mixing. I think the idea here is essentially same.
    user avatar
    Parachutes
    @mingwuzheng
    Jun 17, 2025
    Why transformers excessively attend to the first token? Our new hypothesis: it's the model's way to perform a context-dependent identity operation. We provide strong empirical evidence supporting this explanation. Read more here: quilted-agreement-28c.notion.site/Why-Does-Atten… #AI #NLP #Transformer
    1.5K
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Jun 27, 2025
    Hhh, accidentally unlocked a profession in magic, though not a pro in Dartz. Had a great time with the team @PetarV_93
    user avatar
    Petar Veličković
    @PetarV_93
    Jun 27, 2025
    our student researcher @gu_xiangming seems to be a part-time magician -- at least i was taught in probability class that hitting a dart with another dart should be mathematically (near-)impossible
    748
  • user avatar
    Xiangming Gu
    @gu_xiangming
    Nov 21, 2024
    Very solid work from my colleague and schoolmate @Haonan_Wang_ , we have discussions that attention sink plays an important role in such a phenomena.
    user avatar
    𝚐𝔪𝟾𝚡𝚡𝟾
    @gm8xx8
    Nov 21, 2024
    When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training paper: arxiv.org/abs/2411.13476 code: github.com/haonan3/Anchor…
    645