Pascal/p5200 build#99
Merged
Merged
Conversation
Working notes for getting ht-llama.cpp running on the Quadro P5200 (Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs sm_61 binaries fine. Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb): pp128 269 t/s pp512 278 t/s pp2048 251 t/s tg32 35 t/s tg128 35 t/s CUDA results pending cuda-pascal install (gcc14 source build dominates). Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts.
CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61). Five obstacles climbed on stock Arch: 1) CUDA 13 dropped Pascal → installed 12.9 from runfile --extract 2) Runfile libxml2.so.2 missing → bypassed installer with --extract 3) gcc-15/16 too new for nvcc → gcc-14 from archlinux-archive 4) gcc14 AUR source-build slow → 51MB binary pkg.tar.zst (30s install) 5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp Full recipe in scripts/build-pascal-p5200.md. Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61): CUDA fa=1: pp512=795, tg128=45.8 t/s Vulkan: pp512=418, tg128=43.0 t/s CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound). ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback = the INT8 dp4a path is what is running. JSON artifacts committed alongside for replay/comparison.
…marchy ISO Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain + stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the stale "TODO — fill in once build-cuda completes" placeholder and moves Sources to the true end of the doc. Recipe reproduces the verified-clean tarball: rpath stripped on all installed targets, libllama/libllama-common copied + patchelfed, symlinks recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d snippet documented so no LD_LIBRARY_PATH is needed at runtime.
Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is ~816 MB unpacked, 512 MB compressed (110 members). Adds the reference sha256 from the verified crystal build for hai-os-dev to byte-check against. Notes zstd non-determinism so re-runs are expected to differ.
Round out the Pascal/P5200 enablement bundle (PR #99) with the two human-facing companions to scripts/build-pascal-p5200.md: - quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide (the two facts that drive every decision, CUDA vs Vulkan, measured 1080-parity numbers, 16 GB VRAM sizing, optimization checklist). - quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.), pre-build at image time, HaiOS integration points, verified 512 MB / sha256 0efed65... reference tarball, measured baseline. Both docs reference the canonical recipe at scripts/build-pascal-p5200.md and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst.
Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to
the CUDA configure step. Required for the llama-app unified router
(bin/llama) to link — without server-on, libllama-server-impl.so is not
built and llama-app link fails with `cannot find -lllama-server-impl`.
Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant
draft class lives only in tools/server/server-context.cpp; the
standalone llama-speculative-simple binary segfaults with
"Gemma4Assistant requires ctx_other to be set".
Rationale block also captures the spec-decode footgun: --spec-type
defaults to `none`, so -md <draft> alone is silently ignored. Must pass
--spec-type draft-mtp to engage. The /props
default_generation_settings.params["speculative.types"] field is
per-REQUEST sampler default, NOT the server engine state — the
canonical engagement read is server stderr (draft acceptance line +
statistics draft-mtp: ... summary).
Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA
FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1):
baseline (no MTP, llama-bench):
pp128=465.71 t/s, pp512=456.37 t/s
tg32=25.54 t/s, tg128=25.54 t/s (flat — bandwidth-bound)
MTP A/B via `llama serve` /completion (degenerate "0"×128 output):
A baseline (--spec-type none): 25.26 t/s
B MTP (--spec-type draft-mtp): 103.72 t/s ← 4.11× CEILING
draft acceptance: 1.00 (118/118) — trivially predictable, not deployment
MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens):
A baseline: 25.18 t/s
B MTP: 76.06 t/s ← 3.02× REPRESENTATIVE greedy speedup
draft acceptance: 0.7627 (225/295)
bit-identical content sha A vs B (greedy lossless property)
All three regimes labeled in the JSON so 4.11× isn't quoted as the
deployment number — the representative ~3× greedy or the
memory-recorded titan 1.66× (default sampling) are the honest reads.
…umbers Follow-on to 3662be4 (v2 build flags). Lands the doc side of the LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions. omarchy autoinstall guide §6: - v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst, sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1 (0efed65..., untouched) stays valid for non-serving bakes; v2 is the additive serving-capable successor, not a recall. - serving footgun: --spec-type defaults to `none` (-md silently ignored); engagement proof is server stderr, not /props. - Gemma4 MTP results, three clearly-labeled regimes (lossless A/B): 4.11x degenerate ceiling / 3.02x representative greedy (headline) / 1.66x sampling deployment ref. build-pascal §7 packaging: - version the tarball filename; never overwrite a live pull source. - v1/v2 size+sha table. - reconcile the stale "router not in this tarball" section to v2 reality: member-delta (+11), single-.so impls, lib64 prune, extraction-validate. - note that bin/llama-server / bin/llama-cli are separate targets, not in llama-app's dep closure (reproducing v2 needs them in --target). Also folds in a one-line build-target fix (line 80: add llama-server + llama-cli to --target) that landed in the shared tree from the fork-manager session concurrently — verified correct, kept so the recipe reproduces v2.
…gpu-only, vision+MTP Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build): - bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s - bullet 2 (llama-server router): unified `bin/llama serve` shim → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s - bullet 3 (gpu only works): both runs above use -ngl 99 -fa on - bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj + draft-mtp + spec engine → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118) → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template empty-content quirk): all 3 ground-truth features matched (PASCAL, P5200, red rectangle); requires --jinja (otherwise std::runtime_error custom-template-not-supported abort). Methodology: - regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP; both server-router runs in band (24.77-26.31 t/s window). - engagement read on stderr (draft acceptance / draft-mtp stats), NOT /props (per --spec-type footgun memory). - chat-content quirk explicitly noted in JSON so empty content does not read as fail or regression. Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B) land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS transfers complete on crystal.
…ss-harness note) Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on the v2 server-capable build to lock the no-regression gate against the committed baseline (25.54 t/s) and MTP reference (76.05 t/s). - Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54), in band. - MTP /v1/chat greedy: mean 75.07 t/s across 3 reps (-1.29% vs 76.05), in band. - Draft acceptance: 0.76271 — bit-identical to committed regime-2 (225/295 accepted/generated). Strong determinism proof. Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per crystal-assist's review): - Memory slug citations switched from hyphen-form to underscore-form to match the actual slug names (feedback_spec_type_footgun, reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling. - bench-pascal-server-router-smoke.json: added cross_harness_note clarifying the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS the no-regression claim (different harnesses, in band).
…t fold-ins
Three model bench JSONs from the v2 server-capable Pascal build:
bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at
UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu,
65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s,
/v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply:
"The capital of France is Paris." (qwen3.6 thinking mode active).
bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10):
ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's
recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu
at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48.
Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU,
4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload,
host_model=3967 MiB, host_context=768 MiB. Card 96% utilized.
gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on
titan or lithium (sweep done by crystal-assist) — confirms 31B = the
documented offload-demo model, closes bullets 4 AND 10 in one bench.
Regression rerun JSON: minor wording fix — the bit-identical content
sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty
string or null. Banks the harness blind-spot that hashing `jq -r .content`
output cannot distinguish JSON-null vs "" vs literal "null" vs "\n".
A==B determinism conclusion stands (per crystal-assist review).
…emma4 26B MoE Closes the last two model bullets: bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B (MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion 44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active keeps per-token compute light). Content reply: "The capital of France is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom). bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B (MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB), -ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s, /v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France is Paris." VRAM 14345/16384 MiB after load. Classifier note (banked in JSON): the 26B host_model=748 MiB tripped the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4 config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748 is the embedding tensor + boundary buffers (≈671 MiB pure embedding), NOT expert offload — all 128 experts are in gpu.model_mib=12952. PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The 600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and under-scales for larger vocab×hidden_dim products. Mode patched to full-gpu with classifier_note explaining the misfire + suggested remediation (host_model_pct_of_total < 10-15% = embedding-pattern; ≥ that = real expert offload). All 10 issue #100 bullets now have committed bench evidence.
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements