Aurko Roy (@aurko79) / X

Aurko Roy

125 posts

Aurko Roy

@aurko79

San Francisco

scholar.google.com/citations?user…

Joined February 2025

Pinned
Aurko Roy
@aurko79
Jul 4, 2025
Excited to share what I worked on during my time at Meta. - We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention - We show how to adapt RoPE to tri-linear forms - We show 2-simplicial attention scales
149K
Aurko Roy
@aurko79
Oct 23, 2025
Who would have thought that a multi trillion dollar cap company could have been thrown into such chaos (layoffs) by a single technical decision they made a year ago - using expert choice MoEs for their frontier model.
256K
Aurko Roy
@aurko79
Jul 18, 2025
Got nerd sniped into checking out karpathy's nanoGPT github, I made the following changes to run 2-simplicial attention on my mac on Shakespeare: - 6 layers, 6 heads, 384 dim - reduced ctxt len to 32 - 2-simplicial attention with 32 x 32 x 32 window - run for 5000 steps
30K
Aurko Roy
@aurko79
Jul 7, 2025
Last week at Meta - looking back on the last 3 months I spent there, feel lucky to have worked with some amazing folks: @vinaysrao, @saanarkethayan, @_t_chou, @__yjc_, @_arohan_, @agarwl_, @brandfonbrener, @afrozenator, @dvsaisurya, @manzilzaheer Excited for what's next!
23K
Aurko Roy
@aurko79
Oct 17, 2025
Really nice extension of 2-simplicial attention from sliding window local attention to content based sparse attention! Bonus: Also draws a nice connection to the Weisfeiler Leman algorithm, which I last had the occasion to think about 13 years ago for my master's thesis. :)
tensorqt
@tensorqt
Oct 16, 2025
we can go beyond attention. as some of you know, higher-order attention methods (and the resulting schizodrawings) have been my focus for a while now, and, despite my earlier plans, they ended up being my choice for the second post in the series titled "the graph side of
11K
Aurko Roy
@aurko79
Sep 5, 2025
Amazing work by the Pytorch team!
PyTorch
@PyTorch
Sep 5, 2025
FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 hubs.la/Q03H6S9D0 #PyTorch #OpenSourceAI
6.8K
Aurko Roy
@aurko79
Jul 4, 2025
Replying to @aurko79
Paper link:
arxiv.org
Fast and Simplex: 2-Simplicial Attention in Triton
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count...
5.8K
Aurko Roy
@aurko79
Jul 4, 2025
Replying to @giffmana
Hanging out with @_arohan_ in SF
6K
Aurko Roy
@aurko79
Aug 6, 2025
Insight I had yesterday talking to someone: inference time compute is a way to scale attention FLOPs over FFN flops, since the ratio between them is n^2d/(nd^2) = n/d. In inference time scaling n grows while d remains fixed.
2.8K
Aurko Roy
@aurko79
Jul 18, 2025
Replying to @keveman
Code snippet and efficient triton kernels are in our paper
arxiv.org
Fast and Simplex: 2-Simplicial Attention in Triton
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count...
1.5K
Aurko Roy
@aurko79
Jul 4, 2025
Replying to @aurko79
Special shout-out to @_t_chou for some amazing work on triton kernels!
4.9K
Aurko Roy
@aurko79
Sep 17, 2025
Had the pleasure of grabbing a beer with the inspiring @danielmurfet and talking about attention, intelligence, math and Grothendieck at Berkeley this weekend! 🍻
1.8K
Aurko Roy
@aurko79
Sep 19, 2025
Agreed, this is why we worked on 2-simplicial attention and OSSed the kernels: Paper: arxiv.org/abs/2507.02754 Blog post: pytorch.org/blog/fast-2-si…
Percy Liang
@percyliang
Sep 19, 2025
-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design
arxiv.org
Fast and Simplex: 2-Simplicial Attention in Triton
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count...
4K
Aurko Roy
@aurko79
Oct 23, 2025
Replying to @suchenzang
Chunked attention aka Long range arena (LRA) "local attention" from github.com/google-researc… @msaffar3 @_arohan_ @ashVaswani and I spent many days trying to figure out how local attention could be worse than some of the methods listed there.
GitHub - google-research/long-range-arena: Long Range Arena for Benchmarking Efficient Transformers
From github.com
6.2K