mike64_t (@mike64

mike64_t

4,499 posts

mike64_t

@mike64_t

descending the gradient

Joined October 2022

Pinned
mike64_t
@mike64_t
Oct 9, 2025
Article
Long-horizon Perception requires re-thinking Recurrence
Excited to finally share what I've been working on. TLDR: Attention is not all you need: True depth across time results in a scaling law - not as a function of parameter count - but as a function of...
441K
mike64_t
@mike64_t
May 15, 2024
It always amazes me how this is not the first thing taught in a CS class. It always makes people have Eureka moments whenever I show this to people...
589K
mike64_t
@mike64_t
Nov 20, 2022
I can't explain how amazing @karpathy's lectures are. Andrej's lectures are detailed enough that I could not only follow along, but write my own tensor processing + autograd engine in Java+Kotlin & C++ from scratch. And best of all, it's 2x faster than PyTorch! SIMD for the win!
mike64_t
@mike64_t
Aug 3, 2025
python is actually fast when you control like 80% of the code in your call stack, move 60% to C++, micro-benchmark overhead of various library functions you don't control and build overkill acceleration datastructures like you're in CS undergrad for things you would have nuked
123K
mike64_t
@mike64_t
Jun 11, 2025
Replying to @tritlo
And yet, most of UI overhead is CPU-induced driven by bad abstractions. I can assure you your GPU can render this effect just fine and will result in close to zero ms of added input delay if implemented well.
25K
mike64_t
@mike64_t
Nov 11, 2023
Me at 19 [right now]
krish
@IamIronLAN
Nov 11, 2023
me at 19
96K
mike64_t
@mike64_t
Oct 13, 2025
I'm fairly convinced RL will not get us to end-to-end implementation of huge projects. Codex still has zero smell of "anticipating the future". We will likely have to revisit pre-training on *long*-running agentic data before attempting RL again. And even if we do "get there"
Sam Altman
@sama
Oct 12, 2025
Codex is so good, and is going to get so amazing. I am having a hard time imagining what creating software at the end of 2026 is going to look like.
117K
mike64_t
@mike64_t
Dec 2, 2023
So, neural networks are REALLY robust to errors, as it turns out. Like, I just discovered a bug in my kernel caching logic, where I was computing matmul with completely wrong strides, and it STILL LEARNED. Like... WHAT WAS COMPUTED HERE IS NOT A DERIVATIVE BY ANY MEANS
73K
mike64_t
@mike64_t
Aug 7, 2024
It is done. LibreCUDA can now launch CUDA kernels without relying on the proprietary CUDA runtime / driver api. It does so by communicating directly with the hardware via the ioctl "rm api" and Nvidia's QMD MMIO command queue structure.
50K
mike64_t
@mike64_t
Oct 19, 2025
Perks of writing your training code by rolling your own tensorlib: 900 fps on a 4090 @ batch size 32, pytorch impl reaches 1000 fps @ batch size 512 on a 8xH100 node And yet you can use Triton and Cutlass while shipping a 4 MiB executable while still supporting AMD.
29K
mike64_t
@mike64_t
Oct 21, 2025
ok codex is useful for this kind of task "add fp16 data type"
47K
mike64_t
@mike64_t
Jun 2, 2025
A 256x256 matrix memory at 2048 tokens of sequence length can achieve 98.6% retrieval accuracy. With random embeddings, accuracy would saturate at 67% and with optimized embeddigs at 80%. Combined with a 4-layer LSTM decoder and a 1-layer store gate, the accuracy increases to
37K
mike64_t
@mike64_t
Jun 11, 2025
Replying to @tritlo
if there is a platform where you should care the least about sharing memory between the GPU and the CPU, it is Apple Silicon. Traditional iGPUs have to reserve an entire chunk of physical memory dedicated to the iGPU. On Apple Silicon, both CPU and GPU can coherently negotiate
6.5K
mike64_t
@mike64_t
Oct 8, 2025
just so you guys know the bottleneck for getting data out of Minecraft is literally FFmpeg and I can guarantee this model is both slower and worse than actually being ingame. The timer speed can be increased, frame capture can happen in game at scaled rate, and the thing that
109K