Pinned
Sometimes I think I’m a good engineer then I see dudes like Karpathy and geohot cook and I’m like oh I suck
Day 24 of llm.c: we now do multi-GPU training, in bfloat16, with flash attention, directly in ~3000 lines of C/CUDA, and it is FAST! 🚀
We're running ~7% faster than PyTorch nightly, with no asterisks, i.e. this baseline includes all modern & standard bells-and-whistles: mixed









