perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization

## Benchmark data (aprender-bench-compute)

| Operation | Size | Throughput | vs memcpy |
|-----------|------|-----------|-----------|
| Q8_0 dequant | 262K | 1.22 Gelem/s (4.88 GB/s) | 5.5x below |
| Q4_0 dequant | 262K | 1.25 Gelem/s (5.00 GB/s) | 5.4x below |
| Q8_0 quantize | 262K | 302 Melem/s (1.21 GB/s) | 22x below |
| Q4_0 quantize | 262K | 284 Melem/s (1.14 GB/s) | 24x below |
| memcpy (f32 clone) | 262K | 25 GiB/s (26.8 GB/s) | baseline |

## Key observations

1. **Q4_0 and Q8_0 dequant are nearly identical speed** (~1.2 Gelem/s). Q4_0 reads half the data, so it should be ~2x faster if memory-bandwidth-limited. Equal speed means **compute-limited on the unpacking logic**.

2. **Quantize is 4x slower than dequant**. Quantize needs to find block scales (reduction), but 4x overhead is high.

3. **Dequant at 4.88 GB/s output vs 26.8 GB/s memcpy** — 5.5x gap. With AVX2 SIMD unpacking, Q8_0 dequant should approach memcpy speed (just multiply int8 × scale). Q4_0 needs nibble extraction but should still be 3-4x faster with SIMD.

## Impact

For fused dequant+matvec in realizar, dequant overhead is a significant fraction. A 4096×11008 Q4_0 matrix = 22.5M Q4 values → ~18ms dequant time at current throughput. With SIMD: ~3.3ms.

## Suggested fix

- AVX2/AVX-512 vectorized Q8_0 dequant: `_mm256_cvtepi8_epi32` + `_mm256_mul_ps`
- AVX2 Q4_0 dequant: nibble extraction via shift+mask, then `cvt` + `mul`
- Process 32 elements per iteration (one Q4_0/Q8_0 block = 32 elements)

## Reproduce

```bash
cargo bench -p aprender-bench-compute --bench quantization
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization #386

Benchmark data (aprender-bench-compute)

Key observations

Impact

Suggested fix

Reproduce

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Operation	Size	Throughput	vs memcpy
Q8_0 dequant	262K	1.22 Gelem/s (4.88 GB/s)	5.5x below
Q4_0 dequant	262K	1.25 Gelem/s (5.00 GB/s)	5.4x below
Q8_0 quantize	262K	302 Melem/s (1.21 GB/s)	22x below
Q4_0 quantize	262K	284 Melem/s (1.14 GB/s)	24x below
memcpy (f32 clone)	262K	25 GiB/s (26.8 GB/s)	baseline

perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization #386

Description

Benchmark data (aprender-bench-compute)

Key observations

Impact

Suggested fix

Reproduce

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions