Excited to present CASPR - a new optimization algorithm approximating full-matrix Adagrad , theoretically and empirically better than Shampoo. Drop by our poster at ICLR 2024 on May 7th, 4:30pm-6:30pm. Full-paper: openreview.net/pdf?id=8j9hz8D…. Here’s a 🧵:
Excited to share what I worked on during my time at Meta.
- We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention
- We show how to adapt RoPE to tri-linear forms
- We show 2-simplicial attention scales
📢 Thrilled to share our new paper, LASER: Attention with Exponential Transformation, accepted at ICML2025, work done at Google.
Come by our poster presentation! 🗓️ Thurs, July 17th, 4:30-7pm 📍 West Exhibition Hall B2-B3, # W-915
Read the full paper here: arxiv.org/abs/2411.03493
This highlights the importance of higher order preconditioning methods for sample efficient pretraining especially in smaller models such as Gemini 1.5 Flash.
Just finished reading the Gemini 1.5 report and I'm blown away by the depth of information shared in such a competitive environment! 🤯 Most surprising was the revelation about their optimizer - they didn't just use Adam! Optimization is still alive and kicking! Kudos to the team
full-matrix Adaptive regularization at its best, Distributed Shampoo winning Algoperf competition with 28% wall time improvements, over strong first-order baselines in 8 diverse benchmarks: 2 imagenet, 2 librispeech, DLRM, WMT translation, U-Net-FMRI reconstruction, GNN.
this has been in the making for a while and the results are super exciting!!! second-order optimizers are back on top beating adam style methods in a wall-time based benchmark 😄🥳 , kudos to @GeorgeEDahl@KasimbegPriya for making this happen!!
I think LoRA-RITE (led by Jui-Nan Yen) is one of very interesting/novel work I've been part of. Consider a function f(AB^T), where A and B are tall and thin parameter matrices. This function is invariant to matrix transformations M, A = A'M, B = B'M^{-T}. But is the optimization
Excited to announce DEXML, dual-encoders for extreme multi-label learning, poster @iclr_conf
TLDR;
- InfoNCE/BCE are not suitable for multi-label retrieval (we propose appropriate changes)
- Just dual encoders are enough for challenging XMC benchmarks
arxiv.org/abs/2310.10636
Excited to try this new form of attention, plus they put the entire triton kernel in the paper! 🤯 Some great authors including @Happylemon56775 @dvsaisurya and @_arohan_arxiv.org/abs/2507.02754