Ready for GPU independence weekend?
PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software.
It. Just. Works.
Abhi Venigalla
942 posts
San Francisco, CA
Joined October 2018
- CNBC leaks PaLM2-L training config, says it is: * 340B params * 3.6T tokens * 7.3e24 FLOPs using the (6*N*D) approx
- Replying to @ml_hardwareAnd yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄
- We built a new model! 🧱 It's called DBRX 🧱 * mixture of experts * 16 choose 4 experts * 36B active, 132B total * trained on 12T tokens * built e2e in 2 months * using 3072xH100 * served up to 150 tok/s on @Databricks * open weights :)
- This is literally my new LK-99 🙏🙏🙏Update: more experimental results rolling in. Here it is against SGD with both the step-wise and cosine schedule (both baselines heavily tuned, no cheating) This is something special indeed!
- We're coming for all the models! This week our Vision team profiled Stable Diffusion on @MosaicML Cloud and found that training from scratch costs <$160k, and can be done in under 2 weeks. mosaicml.com/blog/training-…
- If you have apple silicon and > 70GB of RAM, you can run DBRX on your laptop!! Kudos to @awnihannun :)
- Our Vision team is insane. The original Stable Diffusion reportedly cost $600k... and now we've reproduced it for $50k🤯 and it took <1 week to train! All the training code is open-source! And we make it super fast + easy to customize on your own private data @MosaicMLAnd now it's < $50k. 🖼️Announcing @MosaicML's diffusion offering 📷We replicated Stable Diffusion 2.0, training from scratch with huge speedup, and we can do it on your data too. Human eval showed the model to be indistinguishable from the original. Blog: mosaicml.com/blog/training-…
- Replying to @francoisfleuretThe 30x is real and comes from this technical brief, page 15: nvdam.widen.net/s/xqt56dflgh/n… How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100? It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are
- i love you all = ilyai love you all. today was a weird experience in many ways. but one unexpected one is that it has been sorta like reading your own eulogy while you’re still alive. the outpouring of love is awesome. one takeaway: go tell your friends how great you think they are.









