Midia Reshadi

June 2026

Attention kernels on MI300X: a utilization story

Benchmarking FlashAttention kernels on AMD MI300X — AITER vs Triton vs PyTorch SDPA, with hardware counters explaining the throughput ranking. 60 data points, 3 figures.

MI300X FlashAttention AITER Triton
May 2026

Why Outer-Product Beats Vendor Sparse Libraries by up to 198x for LLM Attention Sparsity

Benchmarking dense dataflow strategies and sparse SpMSpM algorithms on NVIDIA T4 and AMD MI300x — 222 data points, 7 figures.

Triton SpMSpM MI300x LLM