Pinned
I wrote a custom NVFP4 GEMM kernel in CuTeDSL stripping away almost all the fancy CuTe layouts "headache" in the official examples and doing the PTX, TMA, and Tcgen05 manually. It's crazy how low-level you can go with this and still be performant! My notes and code are below:














