I am 栗子昂, an MTS @ humans& ai working on the full-stack of LLM performance engineering. I graduated from University of Michigan, Ann Arbor (BS), after I transfered and spent 2 years at Chinese University of Hong Kong, Shenzhen.
Something I have been working on recently:
- mxfp8 & nvfp4 RL
- topk for sparse attention
Some of the places I previously worked/interned at:
- NVIDIA GPU architecture simulation team
- NVIDIA DevTech Compute team
- Google Gemini GPU performance team
- Samsung OpenCL compute team
I used to enjoy cycle-level GPU kernel optimization but I no longer consider it an important problem. My work has shfited more into low-precision numerics and model co-design.

