Skip to content

perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization #386

@noahgift

Description

@noahgift

Benchmark data (aprender-bench-compute)

Operation Size Throughput vs memcpy
Q8_0 dequant 262K 1.22 Gelem/s (4.88 GB/s) 5.5x below
Q4_0 dequant 262K 1.25 Gelem/s (5.00 GB/s) 5.4x below
Q8_0 quantize 262K 302 Melem/s (1.21 GB/s) 22x below
Q4_0 quantize 262K 284 Melem/s (1.14 GB/s) 24x below
memcpy (f32 clone) 262K 25 GiB/s (26.8 GB/s) baseline

Key observations

  1. Q4_0 and Q8_0 dequant are nearly identical speed (~1.2 Gelem/s). Q4_0 reads half the data, so it should be ~2x faster if memory-bandwidth-limited. Equal speed means compute-limited on the unpacking logic.

  2. Quantize is 4x slower than dequant. Quantize needs to find block scales (reduction), but 4x overhead is high.

  3. Dequant at 4.88 GB/s output vs 26.8 GB/s memcpy — 5.5x gap. With AVX2 SIMD unpacking, Q8_0 dequant should approach memcpy speed (just multiply int8 × scale). Q4_0 needs nibble extraction but should still be 3-4x faster with SIMD.

Impact

For fused dequant+matvec in realizar, dequant overhead is a significant fraction. A 4096×11008 Q4_0 matrix = 22.5M Q4 values → ~18ms dequant time at current throughput. With SIMD: ~3.3ms.

Suggested fix

  • AVX2/AVX-512 vectorized Q8_0 dequant: _mm256_cvtepi8_epi32 + _mm256_mul_ps
  • AVX2 Q4_0 dequant: nibble extraction via shift+mask, then cvt + mul
  • Process 32 elements per iteration (one Q4_0/Q8_0 block = 32 elements)

Reproduce

cargo bench -p aprender-bench-compute --bench quantization

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions