Sainbayar Sukhbaatar (@tesatory) / X

Sainbayar Sukhbaatar

1,446 posts

Sainbayar Sukhbaatar

@tesatory

Memory Networks, Asymmetric Self-Play, CommNet, Adaptive-Span, System2Attention, Feedback Transformer, Multi-Token Attention

Joined May 2010

Sainbayar Sukhbaatar
@tesatory
Apr 12, 2025
Ten years ago in 2015 we published a paper called End-to-End Memory Networks (arxiv.org/abs/1503.08895). Looking back, this paper had many of the ingredients of current LLMs. Our model was the first language model that completely replaced RNN with attention. It had dot-product
Andrej Karpathy
@karpathy
Dec 3, 2024
The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following
102K
Sainbayar Sukhbaatar
@tesatory
Jan 26, 2021
We updated our Feedback Transformer paper with new experiments. Transformers fail on very simple algorithmic tasks as it is a feedforward model. A simple fix is to attend to higher-level representations (it's like remembering our past thoughts) arxiv.org/abs/2002.09402
Sainbayar Sukhbaatar
@tesatory
Apr 3, 2019
Докторын диссертаци маань шагнал авч 😊
Sainbayar Sukhbaatar
@tesatory
Jul 10, 2019
We released our code for adaptive-span! It can train a Transformer with a context size of 8k tokens github.com/facebookresear… #ACL2019
GitHub - facebookresearch/adaptive-span: Transformer training code for sequential tasks
From github.com
Sainbayar Sukhbaatar
@tesatory
May 17, 2021
Our new work on "forgetting" got into ICML (long talk)! TLDR: compute a "date" for each memory, and gradually forget it when it's that date. We can see it learns to remember names (unlike me) ai.facebook.com/research/publi…
Sainbayar Sukhbaatar
@tesatory
Jul 20, 2021
Just gave a talk at ICML from my home country Mongolia. Doing things remotely is amazing!
Sainbayar Sukhbaatar
@tesatory
May 5, 2025
3 papers accepted to #ICLR2025 🎉 1. Thinking LLMs that trains LLM to think before answering on non-verifiable tasks. It came out before R1 and uses DPO instead of GRPO. It also doesn't use any external CoT data (arxiv.org/abs/2410.10630)
37K
Sainbayar Sukhbaatar
@tesatory
Jul 2, 2020
Гадаадад гацсан иргэдээ татах болохоор боломжгүй. Тэд гадаа хононо уу өлсөж үхнэ үү хамаагүй. Наадам, усан оргилуур болохоор боломжтой, тэрбум тэрбумаар нь цацна. Энэ хүнлэг ёс уу?
Sainbayar Sukhbaatar
@tesatory
Oct 15, 2024
Thinking is an integral part of general intelligence, not just for solving math problems. We show that you can train your very own Thinking LLM easily, without human data.
Jason Weston
@jaseweston
Oct 15, 2024
🚨New work: Thinking LLMs!🚨 - Introduces Thought Preference Optimization (TPO) - Trains LLMs to think & respond for *all* instruction following tasks, not just math -Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model arxiv.org/abs/2410.10630 🧵1/4
13K
Sainbayar Sukhbaatar
@tesatory
Sep 17, 2024
We have released our code for Contextual Position Encoding (CoPE) so you can try it out. Thanks @lanjanice @OlgaNLP for making it happen! github.com/facebookresear…
Jason Weston
@jaseweston
May 30, 2024
🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. -
11K
Sainbayar Sukhbaatar
@tesatory
Feb 23, 2024
🎉 New paper 🎉 We teach Transformers to do A* search (I had to relearn how A* works). Then, we're curious to see if it can self-improve, and it did surprisingly well. This direction of search, plan, self-improve is very exciting!
AK
@_akhaliq
Feb 23, 2024
Meta presents Beyond A* Better Planning with Transformers via Search Dynamics Bootstrapping While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making
14K
Sainbayar Sukhbaatar
@tesatory
Nov 21, 2023
Attention‼️ well actually System 2 Attention. Answers from LLMs tend to be affected by its context, even if it's irrelevant. We propose more deliberate attention mechanism to solve this issue.
Jason Weston
@jaseweston
Nov 21, 2023
🚨 New paper! 🚨 We introduce System 2 Attention (S2A). - Soft attention in Transformers is susceptible to irrelevant/biased info - S2A uses LLM reasoning to generate what to attend to Improves factuality & objectivity, decreases sycophancy. arxiv.org/abs/2311.11829 🧵(1/5)
26K
Sainbayar Sukhbaatar
@tesatory
Jun 30, 2025
We find semi-online DPO working as good as GRPO!
Jason Weston
@jaseweston
Jun 30, 2025
🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
9.1K
Sainbayar Sukhbaatar
@tesatory
Aug 24, 2019
A blog post about our two recent papers on transformer networks is out! Of course with better graphics.
AI at Meta
@AIatMeta
Aug 23, 2019
Facebook AI researchers are sharing an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient. Even with a much simpler architecture, these methods match or improve state-of-the-art results. ai.facebook.com/blog/making-tr…
GIF