Aman Salykov (@salykova

Aman Salykov

104 posts

Aman Salykov

@salykova_

At the intersection of AI/ML systems, low-level kernel optimizations and algebra @AMD

Munich, Germany

Joined November 2019

Following

2,518

Followers

Aman Salykov
@salykova_
Jan 14, 2025
Excited to announce!
71K
Aman Salykov
@salykova_
Oct 1, 2025
Life update: I've recently relocated to Munich and joined @AMD to work on AI inference optimization!
13K
Aman Salykov
@salykova_
Feb 5, 2025
10K
Aman Salykov
@salykova_
Oct 1, 2025
I'm excited to announce my new blog post on programming Matrix Cores in HIP! The blog post is very educational and contains necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types and the Matrix Core compiler intrinsics.
7.9K
Aman Salykov
@salykova_
Feb 3, 2025
I didn't get the hype around PTX. The top-performance kernels have always used inline PTX (mostly for memory and arithmetic operations). You can't build smth like CUTLASS/cuBLAS without inline PTX
Parth Chadha
@parth_29
Feb 3, 2025
"And these guys worked totally around CUDA and they did something called PTX" Wait till these "deep research" folks figure out most large frameworks use cutlass (wait no cuda?!)! 🤦‍♂️🤦‍♂️
7.3K
Aman Salykov
@salykova_
Nov 14, 2024
Replying to @maharshii
Thanks for the repost! Part 2 is work in progress and will be published soon. I will show how to beat cuBLAS in FP32 Matrix Multiplication
7.2K
Aman Salykov
@salykova_
Jan 14, 2025
Replying to @salykova_
Blog post: salykova.github.io/sgemm-gpu Code: github.com/salykova/sgemm…
4.6K
Aman Salykov
@salykova_
May 16, 2024
⚡️Blazing fast LLama3-8B on 8GB RAM Android device via executorch
00:00
3.7K
Aman Salykov
@salykova_
Jul 5, 2024
Live Demo: 9 tok/s LLama2 7B on Snapdragon 8 Gen2 8GB RAM via executorch. 1) XNNPACK backend 2) 4-Bit groupwise PTQ quantization
00:00
4.6K
Aman Salykov
@salykova_
Mar 15, 2025
It's your turn now @mobicham @ajhinh. Towards SOTA grayscale kernel 🤣💀 Jokes aside, the GPU Kernel Leaderboard is a great place to sharpen your CUDA skills and learn how to build the fastest kernels. Maybe @__tinygrad__ will make a comeback. Lets see. Link below
Mobius Labs
@Mobius_Labs
Feb 25, 2025
Our own @mobicham is having a blast writing fast kernels at @GPU_MODE discord. Currently #1 for grayscale on A100. Early days, but it is so much fun and learning to see how people are approaching it. Triton is holding strong :-)
11K
Aman Salykov
@salykova_
Jul 4, 2025
Replying to @cHHillee @vitransformer and 3 others
Instant sakana.ai flashbacks hmmm....
2.1K
Aman Salykov
@salykova_
Oct 1, 2025
Replying to @salykova_
I plan to publish more educational and in-depth technical blog posts on GPU kernel programming in HIP and on code optimization for CDNA/RDNA architectures. Please let me know if there are any other technical ROCm/HIP-related topics you would like to hear more about!
1.8K
Aman Salykov
@salykova_
Nov 14, 2024
Replying to @AstleDsa and @maharshii
numpy uses high-performance libraries like OpenBLAS or MKL under the hood for matmul. These libraries are written in C/Assembler/Fortran. Simple for loop in C will never be faster than numpy (=BLAS libraries) You need to carefully design and implement the algorithm
873
Aman Salykov
@salykova_
Nov 14, 2024
Replying to @maharshii
Ahaha no, Austria
1.1K