🚨 NEW PAPER DROP!
Wouldn't it be nice if LLMs could spot and correct their own mistakes? And what if we could do so directly from pre-training, without any SFT or RL?
We present a new class of discrete diffusion models, called GIDD, that are able to do just that: 🧵1/12
gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:
- Uses attention sinks (a.k.a. registers)
- Sliding window attention in every second layer
- YaRN context window extension
- RMSNorm without biases
- No QK norm, no attn. softcap
🚨📜 Announcing FABRIC, a training-free method for using iterative feedback to improve the results of any Stable Diffusion model.
Instead of spending hours to find the right prompt, just click 👍/👎 to tell the model what exactly you want.
🤗 Demo: huggingface.co/spaces/dvruett…
I feel like this completely flew under the radar despite being a huge deal for discrete diffusion models:
DremOn is a 7B dLLM that can do variable length generation, solving something that has been a huge challenge!
The idea is clever: Let's just randomly insert <|delete|>
🚨 OpenAssistant has just been released! Dataset and trained models with near-ChatGPT quality are available for download to everyone.
You can even try out our biggest model (based on LLaMA-30B) through a chat interface in your browser right now!
open-assistant.io/chat
✨ FABRIC plugin for SD WebUI is now available in alpha for testing. Check it out and let us know what you think!
github.com/dvruette/sd-we…
Also make sure to share your creations! We're excited to see what you talented folks out there can create with it ❤️
🚨📜 Announcing our latest work on LLM interpretability: We are able to control a model's humor, creativity, quality, truthfulness, and compliance by applying concept vectors to its hidden neural activations. 🧵
arxiv.org/abs/2402.14433
Weight decay is truly evil. looks worse for the first 400k steps and then suddenly overtakes on the home stretch..
But I can't even be mad, we've been warned about exactly this
Now that gpt-oss has made attention sinks all the rage again, I can't help but wonder why nobody is using attention bias, seemingly a strictly superior solution?
Minimal overhead, no awkward extra tokens, easy to implement.