Submit Results

CSV format

Create a CSV file named <device-slug>.csv with these columns:

device_id,kernel_id,dtype,input_shape,batch_size,impl_lang,latency_us,driver_version,toolchain,git_sha,submitter
nvidia-h100-sxm,softmax,f32,"[64, 1024]",1,cuda,12.3,CUDA 12.4,nvcc 12.4,abc1234,your-name

Steps

Fork the pu-rs.org repo
Add your CSV to submissions/
Open a pull request
CI validates format and sanity checks
Maintainers review and merge

Requirements

Minimum 20 runs per (kernel, shape) pair
Report median latency
Include driver version and toolchain
Device must exist in db/seed_devices.sql (or add it in the same PR)

Running the benchmark

All benchmark scripts live in this repo under scripts/.

# Metal (Apple Silicon)
# Requires: ascend_metal_kernels Python module
#   (build: cd ascend-rs/crates/ascend_metal_py && maturin develop --release)
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m2-max-38
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m4-max-40 -o submissions/m4-max.csv

# Ascend NPU (Huawei 910B/910C)
# Requires: CANN SDK + ascend-rs repo cloned locally
bash scripts/bench_ascend.sh --device huawei-910b
bash scripts/bench_ascend.sh --device huawei-910c --only softmax --ascend-rs ~/ascend-rs

Supported backends

Backend	Script	Prerequisites
Apple Metal	`scripts/bench_metal.py`	`ascend_metal_kernels` Python module (build instructions)
Huawei Ascend	`scripts/bench_ascend.sh`	CANN SDK + ascend-rs repo
NVIDIA CUDA	`scripts/bench_cuda.py`	Planned
AMD ROCm	`scripts/bench_rocm.py`	Planned

Keyboard shortcuts

pu-rs.org — xPU Kernel Benchmark

Submit Results

CSV format

Steps

Requirements

Running the benchmark

Supported backends