Pascal/p5200 build by marksverdhei · Pull Request #99 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-09T09:51:19Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Working notes for getting ht-llama.cpp running on the Quadro P5200 (Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs sm_61 binaries fine. Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb): pp128 269 t/s pp512 278 t/s pp2048 251 t/s tg32 35 t/s tg128 35 t/s CUDA results pending cuda-pascal install (gcc14 source build dominates). Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts.

CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61). Five obstacles climbed on stock Arch: 1) CUDA 13 dropped Pascal → installed 12.9 from runfile --extract 2) Runfile libxml2.so.2 missing → bypassed installer with --extract 3) gcc-15/16 too new for nvcc → gcc-14 from archlinux-archive 4) gcc14 AUR source-build slow → 51MB binary pkg.tar.zst (30s install) 5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp Full recipe in scripts/build-pascal-p5200.md. Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61): CUDA fa=1: pp512=795, tg128=45.8 t/s Vulkan: pp512=418, tg128=43.0 t/s CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound). ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback = the INT8 dp4a path is what is running. JSON artifacts committed alongside for replay/comparison.

…marchy ISO Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain + stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the stale "TODO — fill in once build-cuda completes" placeholder and moves Sources to the true end of the doc. Recipe reproduces the verified-clean tarball: rpath stripped on all installed targets, libllama/libllama-common copied + patchelfed, symlinks recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d snippet documented so no LD_LIBRARY_PATH is needed at runtime.

Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is ~816 MB unpacked, 512 MB compressed (110 members). Adds the reference sha256 from the verified crystal build for hai-os-dev to byte-check against. Notes zstd non-determinism so re-runs are expected to differ.

Round out the Pascal/P5200 enablement bundle (PR #99) with the two human-facing companions to scripts/build-pascal-p5200.md: - quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide (the two facts that drive every decision, CUDA vs Vulkan, measured 1080-parity numbers, 16 GB VRAM sizing, optimization checklist). - quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.), pre-build at image time, HaiOS integration points, verified 512 MB / sha256 0efed65... reference tarball, measured baseline. Both docs reference the canonical recipe at scripts/build-pascal-p5200.md and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst.

Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to the CUDA configure step. Required for the llama-app unified router (bin/llama) to link — without server-on, libllama-server-impl.so is not built and llama-app link fails with `cannot find -lllama-server-impl`. Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant draft class lives only in tools/server/server-context.cpp; the standalone llama-speculative-simple binary segfaults with "Gemma4Assistant requires ctx_other to be set". Rationale block also captures the spec-decode footgun: --spec-type defaults to `none`, so -md <draft> alone is silently ignored. Must pass --spec-type draft-mtp to engage. The /props default_generation_settings.params["speculative.types"] field is per-REQUEST sampler default, NOT the server engine state — the canonical engagement read is server stderr (draft acceptance line + statistics draft-mtp: ... summary). Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1): baseline (no MTP, llama-bench): pp128=465.71 t/s, pp512=456.37 t/s tg32=25.54 t/s, tg128=25.54 t/s (flat — bandwidth-bound) MTP A/B via `llama serve` /completion (degenerate "0"×128 output): A baseline (--spec-type none): 25.26 t/s B MTP (--spec-type draft-mtp): 103.72 t/s ← 4.11× CEILING draft acceptance: 1.00 (118/118) — trivially predictable, not deployment MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens): A baseline: 25.18 t/s B MTP: 76.06 t/s ← 3.02× REPRESENTATIVE greedy speedup draft acceptance: 0.7627 (225/295) bit-identical content sha A vs B (greedy lossless property) All three regimes labeled in the JSON so 4.11× isn't quoted as the deployment number — the representative ~3× greedy or the memory-recorded titan 1.66× (default sampling) are the honest reads.

…umbers Follow-on to 3662be4 (v2 build flags). Lands the doc side of the LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions. omarchy autoinstall guide §6: - v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst, sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1 (0efed65..., untouched) stays valid for non-serving bakes; v2 is the additive serving-capable successor, not a recall. - serving footgun: --spec-type defaults to `none` (-md silently ignored); engagement proof is server stderr, not /props. - Gemma4 MTP results, three clearly-labeled regimes (lossless A/B): 4.11x degenerate ceiling / 3.02x representative greedy (headline) / 1.66x sampling deployment ref. build-pascal §7 packaging: - version the tarball filename; never overwrite a live pull source. - v1/v2 size+sha table. - reconcile the stale "router not in this tarball" section to v2 reality: member-delta (+11), single-.so impls, lib64 prune, extraction-validate. - note that bin/llama-server / bin/llama-cli are separate targets, not in llama-app's dep closure (reproducing v2 needs them in --target). Also folds in a one-line build-target fix (line 80: add llama-server + llama-cli to --target) that landed in the shared tree from the fork-manager session concurrently — verified correct, kept so the recipe reproduces v2.

…gpu-only, vision+MTP Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build): - bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s - bullet 2 (llama-server router): unified `bin/llama serve` shim → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s - bullet 3 (gpu only works): both runs above use -ngl 99 -fa on - bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj + draft-mtp + spec engine → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118) → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template empty-content quirk): all 3 ground-truth features matched (PASCAL, P5200, red rectangle); requires --jinja (otherwise std::runtime_error custom-template-not-supported abort). Methodology: - regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP; both server-router runs in band (24.77-26.31 t/s window). - engagement read on stderr (draft acceptance / draft-mtp stats), NOT /props (per --spec-type footgun memory). - chat-content quirk explicitly noted in JSON so empty content does not read as fail or regression. Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B) land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS transfers complete on crystal.

…ss-harness note) Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on the v2 server-capable build to lock the no-regression gate against the committed baseline (25.54 t/s) and MTP reference (76.05 t/s). - Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54), in band. - MTP /v1/chat greedy: mean 75.07 t/s across 3 reps (-1.29% vs 76.05), in band. - Draft acceptance: 0.76271 — bit-identical to committed regime-2 (225/295 accepted/generated). Strong determinism proof. Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per crystal-assist's review): - Memory slug citations switched from hyphen-form to underscore-form to match the actual slug names (feedback_spec_type_footgun, reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling. - bench-pascal-server-router-smoke.json: added cross_harness_note clarifying the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS the no-regression claim (different harnesses, in band).

…t fold-ins Three model bench JSONs from the v2 server-capable Pascal build: bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s, /v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply: "The capital of France is Paris." (qwen3.6 thinking mode active). bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10): ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48. Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU, 4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload, host_model=3967 MiB, host_context=768 MiB. Card 96% utilized. gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on titan or lithium (sweep done by crystal-assist) — confirms 31B = the documented offload-demo model, closes bullets 4 AND 10 in one bench. Regression rerun JSON: minor wording fix — the bit-identical content sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty string or null. Banks the harness blind-spot that hashing `jq -r .content` output cannot distinguish JSON-null vs "" vs literal "null" vs "\n". A==B determinism conclusion stands (per crystal-assist review).

…emma4 26B MoE Closes the last two model bullets: bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B (MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion 44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active keeps per-token compute light). Content reply: "The capital of France is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom). bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B (MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB), -ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s, /v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France is Paris." VRAM 14345/16384 MiB after load. Classifier note (banked in JSON): the 26B host_model=748 MiB tripped the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4 config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748 is the embedding tensor + boundary buffers (≈671 MiB pure embedding), NOT expert offload — all 128 experts are in gpu.model_mib=12952. PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The 600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and under-scales for larger vocab×hidden_dim products. Mode patched to full-gpu with classifier_note explaining the misfire + suggested remediation (host_model_pct_of_total < 10-15% = embedding-pattern; ≥ that = real expert offload). All 10 issue #100 bullets now have committed bench evidence.

marksverdhei and others added 11 commits June 9, 2026 10:43

marksverdhei mentioned this pull request Jun 9, 2026

Pascal / P5200: Desired model list / milestone #100

Closed

10 tasks

marksverdhei marked this pull request as ready for review June 9, 2026 15:34

marksverdhei merged commit 626a757 into ht Jun 9, 2026

marksverdhei deleted the pascal/p5200-build branch June 9, 2026 15:34

marksverdhei mentioned this pull request Jun 12, 2026

docs(readme): complete HT Fork Changes inventory with per-change justifications #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pascal/p5200 build#99

Pascal/p5200 build#99
marksverdhei merged 11 commits into
htfrom
pascal/p5200-build

marksverdhei commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 9, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant