Head-to-head benchmark: Candle (HuggingFace Rust ML) vs realizr (Sovereign AI Stack inference engine). Same model, same hardware, same methodology.
Both are pure Rust. Both load GGUF Q4_K_M. Does the fused-kernel + CUDA graph architecture outperform Candle's general-purpose approach?
Answer: Yes. Decisively. 1.63x faster.
| Engine | Decode tok/s | vs Candle | Notes |
|---|---|---|---|
| llama.cpp b7746 | 443.6 | 1.95x | -ngl 99, Flash Attention |
| realizr 0.8.6 | 369.9 | 1.63x | CUDA graph, Flash Decoding |
| Candle | 227.4 | 1.00x | CLI native, per-op dispatch |
| realizr advantage | Candle limitation | Impact |
|---|---|---|
| CUDA graph (647 kernels, 1 launch) | Per-op dispatch (~640 launches) | +26% |
| Flash Decoding (chunked KV) | Standard SDPA (sequential) | +15% long ctx |
| Fused DP4A GEMV (4-bit native) | Separate dequant + matmul | ~10% fewer mem passes |
| Continuous batching (Orca-style) | CLI only, no server | 3,220 tok/s at c=32 |
| GPU-resident KV + FP8 cache | Per-call allocation | Lower TTFT |
Candle cannot close this gap without: (a) CUDA graph support (requires unsafe FFI redesign), (b) fused quantized GEMV kernels (requires custom PTX), (c) a server architecture for batching. These are fundamental design differences, not tuning parameters.
| c | Agg tok/s | Scaling | Note |
|---|---|---|---|
| 1 | 367.0 | 1.0x | bootstrap N=5 CI [365, 369] |
| 4 | 634.1 | 1.73x | continuous batching |
| 8 | 954.4 | 2.60x | |
| 16 | 1,771.5 | 4.83x | |
| 32 | 3,219.9 | 8.77x | Orca-style iteration scheduling |
Phase 16 five-whys decomposition:
| Component | % of gap | Root cause | Status |
|---|---|---|---|
| Attention | 51% | Flash Decoding occupancy (3% vs FA2) | 3 multi-warp approaches FALSIFIED |
| GEMV | 21% | DP4A Q4K vs cuBLAS | trueno#239 (Marlin), trueno#175 (half-warp) |
| Other | 28% | RoPE, RmsNorm, residuals | Near parity |
Next steps: FlashInfer TC attention (P1), Marlin GEMV pre-packing (P2). See chain of thought in spec.
Cross-Project: realizr Across Hardware (qcd v6.34.0)
| Hardware | c=1 | c=32 | vs llama.cpp | Graph |
|---|---|---|---|---|
| RTX 4090 (this repo) | 369.9 | 3,220 | 0.83x | Yes (647 nodes) |
| RTX 4060L Yoga (qcd) | 136 | 1,895 | 0.92x | No (driver poison) |
| Blackwell GB10 (qcd) | 101 | 1,677 | — | No |
| Jetson Orin (qcd) | 40.8 | — | 1.13x | No |
realizr beats Candle on every target. Competitive with llama.cpp (0.83-1.13x). Gap to vLLM (0.53-0.88x) is CPU dispatch overhead, not kernel quality — DP4A runs at 92% of theoretical ceiling (qcd PMAT-110).
Key cross-validated findings:
- 16 kernel fusion approaches falsified in both projects
- BrickProfiler 3.4x fidelity bug caught by contract enforcement
- CPU dispatch ~5ms/step is the c>1 bottleneck (graph doesn't help)
- Orca scaling confirmed: 8.77x (4090), 13.4x (Yoga), 14.3x (qcd)
Full analysis: performance.md. Falsification spec (29 F-conditions, 28 tested): candle-vs-apr-spec.md.
| Runtime | Architecture | Server | Formats |
|---|---|---|---|
| Candle | QMatMul dequant, per-op dispatch | CLI only | GGUF, SafeTensors |
| realizr | Fused DP4A GEMV, CUDA graph (647 nodes) | OpenAI API + SSE | GGUF, SafeT, APR v2 |
Qwen2.5-Coder-1.5B-Instruct Q4_K_M -- same model as
qwen-coder-deploy.
APR v2 prepared via apr import
(aprender).
Default produces Q4K (raw passthrough, --preserve-q4k deprecated).
| Path | PPL | Method |
|---|---|---|
| realizr CPU FP32 | 12.72 | Q4K dequant + FP32 matmul |
| llama.cpp GPU | 12.97 | cuBLAS FP32 dequant |
| realizr GPU FP8 | 41.31 | Batched prefill cuBLASLt |
| realizr GPU DP4A | 42.94 | Sequential int8 accumulation |
CPU FP32 matches llama.cpp (2% delta). GPU DP4A is 3.2x worse -- the gap is entirely from int8 accumulation precision, not dequantization. FP8 prefill narrows it 3.8%.
| Platform | GPU | Role |
|---|---|---|
| Lambda Vector | RTX 4090, 2520 MHz locked | Primary |
| Yoga | RTX 4060 Laptop, 1900 MHz | Scaling validation |
v2 (current): probador llm load -- same tool as
qwen-coder-deploy.
--concurrency 1 --duration 30s --warmup 5s --max-tokens 256
--stream false --num-layers 28 --gpu-telemetry.
CRITICAL: llama.cpp requires -ngl 99 (all layers GPU).
-ngl 28 = 310 tok/s (29% penalty from CPU embedding transfer).
realizr loads all weights to GPU natively.
Common controls:
- GPU clocks locked at 2520 MHz (eliminates thermal variance)
- Temperature 0 (greedy, deterministic)
nvidia-smipre-flight (GPU isolation, realizr#190 lesson)- Binary fingerprinting (PATH ordering, 0.4.11 vs 0.4.12 lesson)
- Upstream bugs filed via
gh+ provable-contracts
# Start realizr
realizr serve --model /path/to/model.gguf --gpu --port 8081 \
--openai-api --context-length 4096
# Benchmark
probador llm load --url http://127.0.0.1:8081 \
--concurrency 1 --duration 30s --warmup 5s \
--max-tokens 256 --stream false --num-layers 28 \
--gpu-telemetry --expected-clock-mhz 2520 \
--runtime-name realizr -o results/benchmark.json
# Candle (CLI only)
quantized-qwen2-instruct --model model.gguf \
--prompt "Write fibonacci" \
--sample-len 256 --temperature 0
# llama.cpp
llama-server --model model.gguf --port 8082 \
-ngl 99 --parallel 1 --flash-attn on --ctx-size 4096| Path | Purpose |
|---|---|
scripts/bootstrap-ci.sh |
Bootstrap CIs + Mann-Whitney U |
scripts/bottleneck-gate.sh |
Pre-experiment roofline validation |
scripts/run-showdown.sh |
Multi-framework showdown runner |
configs/showdown.yaml |
Framework definitions (realizr, llama.cpp, vLLM, ollama) |
results/ |
JSON results (git-tracked) |
docs/specifications/ |
Popperian falsification spec |
Source of truth: candle-vs-apr-spec.md §7.
Score (v15.0.0, 29 F-conditions): 28 tested (12 confirmed, 6 revised, 4 falsified, 2 weakened, 2 fixed, 1 measured, 1 wired). 1 proposed.
Headline results:
- F-1.5X-01 CONFIRMED: realizr 369.9 tok/s = 1.63x Candle
- F-SCALE-01 CONFIRMED: c=32 at 3,220 tok/s on RTX 4090 (8.77x)
- F-PARITY-02 FIXED: c=4 scaling 1.03x → 1.73x (realizr#211)
- F-STREAM-01 CONFIRMED: stream=false within 1% of true (realizr#212)
- F-MULTIWARPC-01 FALSIFIED: 2-warp chunk kernel -1.7%/+1.9% (barrier O(n))
- F-TCATTN-01 FALSIFIED: multi-warp block-level -13.7% (12 blocks on 128 SMs)
- F-QUALITY-01 FALSIFIED: DP4A PPL 42.94 vs FP32 12.97 (int8 precision)
- F-NCU-01 CONFIRMED: 2.15% occupancy root cause identified via NCU
| Repo | Issue | Fix | Impact |
|---|---|---|---|
| realizr | #198 | Graph capture missing SwiGLU recording | +26% (262→329 tok/s) |
| realizr | #211 | Non-streaming batch scheduler routing | +82% c=4 |
| realizr | #212 | stream=false bulk-send after generation | +4.3% c=1 |
| realizr | #203 | FP8 batched prefill PPL | 3.8% PPL improvement |
| trueno | #246 | chunk_size 32→16 | +7.4% short / +45% long ctx |
| trueno | #253 | 2-warp flash decode (falsified) | Correct but not faster |
See spec for full register and evidence.