Aman Salykov
104 posts
At the intersection of AI/ML systems, low-level kernel optimizations and algebra @AMD
Munich, Germany
Joined November 2019
- Life update: I've recently relocated to Munich and joined @AMD to work on AI inference optimization!
- I'm excited to announce my new blog post on programming Matrix Cores in HIP! The blog post is very educational and contains necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types and the Matrix Core compiler intrinsics.
- I didn't get the hype around PTX. The top-performance kernels have always used inline PTX (mostly for memory and arithmetic operations). You can't build smth like CUTLASS/cuBLAS without inline PTX"And these guys worked totally around CUDA and they did something called PTX" Wait till these "deep research" folks figure out most large frameworks use cutlass (wait no cuda?!)! 🤦♂️🤦♂️
- Replying to @maharshiiThanks for the repost! Part 2 is work in progress and will be published soon. I will show how to beat cuBLAS in FP32 Matrix Multiplication
- Replying to @salykova_
- ⚡️Blazing fast LLama3-8B on 8GB RAM Android device via executorch
00:00 - Live Demo: 9 tok/s LLama2 7B on Snapdragon 8 Gen2 8GB RAM via executorch. 1) XNNPACK backend 2) 4-Bit groupwise PTQ quantization
00:00 - It's your turn now @mobicham @ajhinh. Towards SOTA grayscale kernel 🤣💀 Jokes aside, the GPU Kernel Leaderboard is a great place to sharpen your CUDA skills and learn how to build the fastest kernels. Maybe @__tinygrad__ will make a comeback. Lets see. Link below
- Replying to @cHHillee @vitransformer and 3 others
- Replying to @salykova_I plan to publish more educational and in-depth technical blog posts on GPU kernel programming in HIP and on code optimization for CDNA/RDNA architectures. Please let me know if there are any other technical ROCm/HIP-related topics you would like to hear more about!
- Replying to @AstleDsa and @maharshiinumpy uses high-performance libraries like OpenBLAS or MKL under the hood for matmul. These libraries are written in C/Assembler/Fortran. Simple for loop in C will never be faster than numpy (=BLAS libraries) You need to carefully design and implement the algorithm











