qwen-coder-deploy

What This Is

A head-to-head benchmark of five different ways to run the same AI model on your GPU. We take Qwen2.5-Coder (a code-generation model) and deploy it across five inference runtimes to answer a simple question: which engine gives you the best throughput, latency, and quality — and at what concurrency?

If you're running local AI and wondering whether Ollama is fast enough, or whether you should switch to llama.cpp or vLLM, this repo has the data.

The Five Runtimes

Runtime	What It Is	Best For
realizar	Rust inference engine with CUDA kernels	Single-user quality (lowest ITL)
Ollama	Popular local AI runtime	Simplicity — `ollama pull` and go
llama.cpp	C++ GGUF inference, the gold standard	CPU inference, stable throughput
vLLM	Python/CUDA production serving	High-concurrency throughput (c=8+)
realizar-wgpu	Rust + Vulkan (AMD/Intel GPUs)	Non-NVIDIA hardware

All five serve the same model (Qwen2.5-Coder-1.5B Q4_K_M) via OpenAI-compatible endpoints, so you can swap runtimes without changing your application code.

Key Findings

Should I ditch Ollama?

For single-user use: Ollama is fine. At concurrency=1, all four GPU runtimes produce nearly identical throughput (136-163 tok/s). Ollama is the easiest to set up.

For multi-user serving: yes, switch. Ollama is serial — it processes one request at a time. At 4 concurrent users, vLLM delivers 598 tok/s vs Ollama's 635 (Ollama batches differently). But at 8+ users, Ollama can't scale at all while vLLM reaches 3,000 tok/s. llama.cpp and realizr both scale to 1,900 tok/s.

The Throughput Picture

RTX 4060 Laptop GPU, 1900 MHz locked, production workload (streaming, mixed output lengths, 60s runs):

Concurrent Users	realizar	llama.cpp	vLLM	Ollama
1	136 tok/s	160	154	163
4	351	351	598	635
8	610	419	1,142	--
16	1,072	912	2,037	--
32	1,895	1,949	2,998	--

What this means:

At 1 user: pick any runtime — they're all within 20% of each other
At 4 users: vLLM pulls ahead (1.7x faster than llama.cpp)
At 8+ users: vLLM dominates on raw throughput, but realizr and llama.cpp offer better per-request quality (lower latency jitter)
At 128 users: realizr actually beats vLLM on quality score (66 vs 64) because vLLM's per-request latency degrades

Larger Models on Blackwell GB10

On NVIDIA's Grace Blackwell (120 GB unified memory), we also tested 7B and 32B models:

Model	1 User	32 Users	HumanEval pass@1
1.5B	101 tok/s	1,677 tok/s	--
7B	31 tok/s	472 tok/s	84.76%
32B	8.4 tok/s	--	90.85% (149/164)

The 32B model achieves 90.85% on HumanEval — near state-of-the-art code generation quality.

How to Replicate

Prerequisites

Linux with NVIDIA GPU (CUDA 12.0+)
forjar — declarative deployment tool
probador — load testing and scoring tool
The model: qwen2.5-coder-1.5b-instruct-q4_k_m.gguf from HuggingFace

Quick Benchmark (single runtime)

# 1. Deploy realizr on your GPU
forjar apply -f forjar-yoga-realizr.yaml

# 2. Run a load test
probador llm load \
    --url http://your-gpu-host:8081 \
    --concurrency 4 \
    --duration 60 \
    --stream true

# 3. Tear down
forjar apply -f forjar-yoga-teardown.yaml

Full Comparative Benchmark

# Run all 4 GPU runtimes in isolated serial mode (deploy one, bench, teardown, repeat)
make bench-yoga-serial

# Generate quality scorecards
make score-prod

# Stop everything
make teardown-yoga

What Gets Measured

Each benchmark captures:

Aggregate throughput (tok/s across all concurrent users)
Decode speed (tok/s per user — how fast text appears)
TTFT (time to first token — how long before you see any output)
ITL (inter-token latency — consistency of token delivery)
Tail latency (P99, P99.9 — worst-case experience)
Error rate (should be 0%)

Results are saved as JSON in results/ and scored via probador's scoring system.

Repository Structure

Path	Purpose
`forjar-yoga-*.yaml`	Deployment configs for each runtime on Yoga (RTX 4060L)
`forjar-gx10.yaml`	Grace Blackwell GB10 deployment
`forjar.yaml`	CPU-only deployment (Intel Xeon)
`prompts/correctness.yaml`	6 test prompts (math, code gen, explanation, JSON, SQL)
`scripts/nightly.sh`	Automated benchmark pipeline (`yoga`, `gx10`, `wgpu`, `cpu`)
`results/`	JSON benchmark results (git-tracked)
`docs/specifications/`	Performance spec (v6.34.0, 414 work items) and scoring contracts

Available Make Targets

# Yoga (RTX 4060L) — primary benchmark platform
make bench-yoga-prod          # All 4 runtimes, production methodology
make bench-yoga-prod-realizr  # realizr only
make bench-yoga-prod-vllm     # vLLM only
make score-prod               # Production scorecards

# GB10 (Blackwell) — larger model testing
make bench-gx10               # realizr on GB10 (requires SSH tunnel)
make test-gx10                # Correctness tests

# CPU (Intel Xeon)
make deploy && make test && make load

# WGPU (AMD Radeon)
make build-wgpu && make deploy-wgpu && make test-wgpu

# Scoring
make score                    # All scorecards
make score-gate               # CI gate: fail if any runtime below C grade

Methodology

All benchmarks use production methodology (introduced at PMAT-177):

Medium prompts (~102 tokens) — realistic, not cherry-picked short prompts
Uniform output distribution (16-256 tokens) — simulates real traffic
Streaming enabled — measures TTFT and ITL, not just batch throughput
60-second runs with 5-second warmup — steady-state, not burst
Locked GPU clocks (1900 MHz) — eliminates thermal throttle variance
Isolated serial — one runtime at a time, clean GPU state between runs

Quality scoring uses absolute thresholds (not relative rankings), with jitter penalties and best-in-class bonuses. Full scoring contract: scoring.yaml.

Deep Dive

The full 414-item performance specification documents every optimization attempt, including 16 kernel fusion approaches that were tested and falsified:

gpu-performance-spec.md (v6.34.0)

This follows Popperian falsification methodology — every claim has a prediction that can be disproved by measurement. The spec includes profiling data, root cause analyses, and academic references for the architectural decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 475 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
bench-results-v2		bench-results-v2
docs		docs
prompts		prompts
results		results
scripts		scripts
systemd		systemd
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
forjar-gpu-llamacpp.yaml		forjar-gpu-llamacpp.yaml
forjar-gpu-realizr.yaml		forjar-gpu-realizr.yaml
forjar-gpu-teardown.yaml		forjar-gpu-teardown.yaml
forjar-gpu.yaml		forjar-gpu.yaml
forjar-gx10.yaml		forjar-gx10.yaml
forjar-intel-wgpu.yaml		forjar-intel-wgpu.yaml
forjar-jetson-llamacpp.yaml		forjar-jetson-llamacpp.yaml
forjar-jetson-ollama.yaml		forjar-jetson-ollama.yaml
forjar-jetson-realizr.yaml		forjar-jetson-realizr.yaml
forjar-jetson-teardown.yaml		forjar-jetson-teardown.yaml
forjar-jetson.yaml		forjar-jetson.yaml
forjar-teardown.yaml		forjar-teardown.yaml
forjar-yoga-llamacpp.yaml		forjar-yoga-llamacpp.yaml
forjar-yoga-ollama.yaml		forjar-yoga-ollama.yaml
forjar-yoga-realizr.yaml		forjar-yoga-realizr.yaml
forjar-yoga-teardown.yaml		forjar-yoga-teardown.yaml
forjar-yoga-vllm.yaml		forjar-yoga-vllm.yaml
forjar.yaml		forjar.yaml
performance.md		performance.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qwen-coder-deploy

What This Is

The Five Runtimes

Key Findings

Should I ditch Ollama?

The Throughput Picture

Larger Models on Blackwell GB10

How to Replicate

Prerequisites

Quick Benchmark (single runtime)

Full Comparative Benchmark

What Gets Measured

Repository Structure

Available Make Targets

Methodology

Deep Dive

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

qwen-coder-deploy

What This Is

The Five Runtimes

Key Findings

Should I ditch Ollama?

The Throughput Picture

Larger Models on Blackwell GB10

How to Replicate

Prerequisites

Quick Benchmark (single runtime)

Full Comparative Benchmark

What Gets Measured

Repository Structure

Available Make Targets

Methodology

Deep Dive

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages