user avatar
Sainbayar Sukhbaatar
@tesatory
Memory Networks, Asymmetric Self-Play, CommNet, Adaptive-Span, System2Attention, Feedback Transformer, Multi-Token Attention
Joined May 2010
Posts
  • user avatar
    Ten years ago in 2015 we published a paper called End-to-End Memory Networks (arxiv.org/abs/1503.08895). Looking back, this paper had many of the ingredients of current LLMs. Our model was the first language model that completely replaced RNN with attention. It had dot-product
    The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following
  • user avatar
    We updated our Feedback Transformer paper with new experiments. Transformers fail on very simple algorithmic tasks as it is a feedforward model. A simple fix is to attend to higher-level representations (it's like remembering our past thoughts) arxiv.org/abs/2002.09402
  • user avatar
    Докторын диссертаци маань шагнал авч 😊
  • user avatar
    We released our code for adaptive-span! It can train a Transformer with a context size of 8k tokens github.com/facebookresear… #ACL2019
  • user avatar
    Our new work on "forgetting" got into ICML (long talk)! TLDR: compute a "date" for each memory, and gradually forget it when it's that date. We can see it learns to remember names (unlike me) ai.facebook.com/research/publi…
  • user avatar
    Just gave a talk at ICML from my home country Mongolia. Doing things remotely is amazing!
  • user avatar
    3 papers accepted to #ICLR2025 🎉 1. Thinking LLMs that trains LLM to think before answering on non-verifiable tasks. It came out before R1 and uses DPO instead of GRPO. It also doesn't use any external CoT data (arxiv.org/abs/2410.10630)
  • user avatar
    Гадаадад гацсан иргэдээ татах болохоор боломжгүй. Тэд гадаа хононо уу өлсөж үхнэ үү хамаагүй. Наадам, усан оргилуур болохоор боломжтой, тэрбум тэрбумаар нь цацна. Энэ хүнлэг ёс уу?
  • user avatar
    Thinking is an integral part of general intelligence, not just for solving math problems. We show that you can train your very own Thinking LLM easily, without human data.
    🚨New work: Thinking LLMs!🚨 - Introduces Thought Preference Optimization (TPO) - Trains LLMs to think & respond for *all* instruction following tasks, not just math -Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model arxiv.org/abs/2410.10630 🧵1/4
  • user avatar
    We have released our code for Contextual Position Encoding (CoPE) so you can try it out. Thanks @lanjanice @OlgaNLP for making it happen! github.com/facebookresear…
    🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. -
  • user avatar
    🎉 New paper 🎉 We teach Transformers to do A* search (I had to relearn how A* works). Then, we're curious to see if it can self-improve, and it did surprisingly well. This direction of search, plan, self-improve is very exciting!
    Meta presents Beyond A* Better Planning with Transformers via Search Dynamics Bootstrapping While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making
  • user avatar
    Attention‼️ well actually System 2 Attention. Answers from LLMs tend to be affected by its context, even if it's irrelevant. We propose more deliberate attention mechanism to solve this issue.
    🚨 New paper! ​​🚨 We introduce System 2 Attention (S2A). - Soft attention in Transformers is susceptible to irrelevant/biased info - S2A uses LLM reasoning to generate what to attend to Improves factuality & objectivity, decreases sycophancy. arxiv.org/abs/2311.11829 🧵(1/5)
  • user avatar
    We find semi-online DPO working as good as GRPO!
    🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
  • user avatar
    A blog post about our two recent papers on transformer networks is out! Of course with better graphics.
    Facebook AI researchers are sharing an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient. Even with a much simpler architecture, these methods match or improve state-of-the-art results. ai.facebook.com/blog/making-tr…
    GIF