Skip to content

Pascal/p5200 build#99

Merged
marksverdhei merged 11 commits into
htfrom
pascal/p5200-build
Jun 9, 2026
Merged

Pascal/p5200 build#99
marksverdhei merged 11 commits into
htfrom
pascal/p5200-build

Conversation

@marksverdhei

Copy link
Copy Markdown

Overview

Additional information

Requirements

marksverdhei and others added 11 commits June 9, 2026 10:43
Working notes for getting ht-llama.cpp running on the Quadro P5200
(Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA
backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs
sm_61 binaries fine.

Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb):
  pp128  269 t/s  pp512  278 t/s  pp2048 251 t/s
  tg32    35 t/s  tg128   35 t/s

CUDA results pending cuda-pascal install (gcc14 source build dominates).

Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the
source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts.
CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61).
Five obstacles climbed on stock Arch:
1) CUDA 13 dropped Pascal       → installed 12.9 from runfile --extract
2) Runfile libxml2.so.2 missing → bypassed installer with --extract
3) gcc-15/16 too new for nvcc   → gcc-14 from archlinux-archive
4) gcc14 AUR source-build slow  → 51MB binary pkg.tar.zst (30s install)
5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp

Full recipe in scripts/build-pascal-p5200.md.

Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61):
  CUDA fa=1: pp512=795, tg128=45.8 t/s
  Vulkan:    pp512=418, tg128=43.0 t/s
  CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound).

ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ
baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback =
the INT8 dp4a path is what is running.

JSON artifacts committed alongside for replay/comparison.
…marchy ISO

Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain +
stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the
runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the
stale "TODO — fill in once build-cuda completes" placeholder and moves
Sources to the true end of the doc.

Recipe reproduces the verified-clean tarball: rpath stripped on all
installed targets, libllama/libllama-common copied + patchelfed, symlinks
recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d
snippet documented so no LD_LIBRARY_PATH is needed at runtime.
Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is
~816 MB unpacked, 512 MB compressed (110 members). Adds the reference
sha256 from the verified crystal build for hai-os-dev to byte-check
against. Notes zstd non-determinism so re-runs are expected to differ.
Round out the Pascal/P5200 enablement bundle (PR #99) with the two
human-facing companions to scripts/build-pascal-p5200.md:

- quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide
  (the two facts that drive every decision, CUDA vs Vulkan, measured
  1080-parity numbers, 16 GB VRAM sizing, optimization checklist).
- quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for
  hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build
  flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.),
  pre-build at image time, HaiOS integration points, verified
  512 MB / sha256 0efed65... reference tarball, measured baseline.

Both docs reference the canonical recipe at scripts/build-pascal-p5200.md
and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst.
Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to
the CUDA configure step. Required for the llama-app unified router
(bin/llama) to link — without server-on, libllama-server-impl.so is not
built and llama-app link fails with `cannot find -lllama-server-impl`.
Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant
draft class lives only in tools/server/server-context.cpp; the
standalone llama-speculative-simple binary segfaults with
"Gemma4Assistant requires ctx_other to be set".

Rationale block also captures the spec-decode footgun: --spec-type
defaults to `none`, so -md <draft> alone is silently ignored. Must pass
--spec-type draft-mtp to engage. The /props
default_generation_settings.params["speculative.types"] field is
per-REQUEST sampler default, NOT the server engine state — the
canonical engagement read is server stderr (draft acceptance line +
statistics draft-mtp: ... summary).

Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA
FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1):

  baseline (no MTP, llama-bench):
    pp128=465.71 t/s, pp512=456.37 t/s
    tg32=25.54 t/s,  tg128=25.54 t/s  (flat — bandwidth-bound)

  MTP A/B via `llama serve` /completion (degenerate "0"×128 output):
    A baseline (--spec-type none):     25.26 t/s
    B MTP (--spec-type draft-mtp):    103.72 t/s   ← 4.11× CEILING
    draft acceptance: 1.00 (118/118)  — trivially predictable, not deployment

  MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens):
    A baseline: 25.18 t/s
    B MTP:      76.06 t/s   ← 3.02× REPRESENTATIVE greedy speedup
    draft acceptance: 0.7627 (225/295)
    bit-identical content sha A vs B (greedy lossless property)

All three regimes labeled in the JSON so 4.11× isn't quoted as the
deployment number — the representative ~3× greedy or the
memory-recorded titan 1.66× (default sampling) are the honest reads.
…umbers

Follow-on to 3662be4 (v2 build flags). Lands the doc side of the
LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions.

omarchy autoinstall guide §6:
- v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst,
  sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1
  (0efed65..., untouched) stays valid for non-serving bakes; v2 is the
  additive serving-capable successor, not a recall.
- serving footgun: --spec-type defaults to `none` (-md silently ignored);
  engagement proof is server stderr, not /props.
- Gemma4 MTP results, three clearly-labeled regimes (lossless A/B):
  4.11x degenerate ceiling / 3.02x representative greedy (headline) /
  1.66x sampling deployment ref.

build-pascal §7 packaging:
- version the tarball filename; never overwrite a live pull source.
- v1/v2 size+sha table.
- reconcile the stale "router not in this tarball" section to v2 reality:
  member-delta (+11), single-.so impls, lib64 prune, extraction-validate.
- note that bin/llama-server / bin/llama-cli are separate targets, not in
  llama-app's dep closure (reproducing v2 needs them in --target).

Also folds in a one-line build-target fix (line 80: add llama-server +
llama-cli to --target) that landed in the shared tree from the
fork-manager session concurrently — verified correct, kept so the recipe
reproduces v2.
…gpu-only, vision+MTP

Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build):

- bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server
  → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s
- bullet 2 (llama-server router): unified `bin/llama serve` shim
  → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s
- bullet 3 (gpu only works): both runs above use -ngl 99 -fa on
- bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj +
  draft-mtp + spec engine
  → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp
    engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118)
  → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template
    empty-content quirk): all 3 ground-truth features matched (PASCAL,
    P5200, red rectangle); requires --jinja (otherwise std::runtime_error
    custom-template-not-supported abort).

Methodology:
- regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP;
  both server-router runs in band (24.77-26.31 t/s window).
- engagement read on stderr (draft acceptance / draft-mtp stats), NOT
  /props (per --spec-type footgun memory).
- chat-content quirk explicitly noted in JSON so empty content does not
  read as fail or regression.

Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B)
land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS
transfers complete on crystal.
…ss-harness note)

Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on
the v2 server-capable build to lock the no-regression gate against the
committed baseline (25.54 t/s) and MTP reference (76.05 t/s).
- Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54),
  in band.
- MTP /v1/chat greedy:  mean 75.07 t/s across 3 reps (-1.29% vs 76.05),
  in band.
- Draft acceptance: 0.76271 — bit-identical to committed regime-2
  (225/295 accepted/generated). Strong determinism proof.

Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per
crystal-assist's review):
- Memory slug citations switched from hyphen-form to underscore-form to
  match the actual slug names (feedback_spec_type_footgun,
  reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling.
- bench-pascal-server-router-smoke.json: added cross_harness_note clarifying
  the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS
  the no-regression claim (different harnesses, in band).
…t fold-ins

Three model bench JSONs from the v2 server-capable Pascal build:

bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at
UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu,
65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s,
/v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply:
"The capital of France is Paris." (qwen3.6 thinking mode active).

bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10):
ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's
recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu
at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48.
Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU,
4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload,
host_model=3967 MiB, host_context=768 MiB. Card 96% utilized.
gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on
titan or lithium (sweep done by crystal-assist) — confirms 31B = the
documented offload-demo model, closes bullets 4 AND 10 in one bench.

Regression rerun JSON: minor wording fix — the bit-identical content
sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty
string or null. Banks the harness blind-spot that hashing `jq -r .content`
output cannot distinguish JSON-null vs "" vs literal "null" vs "\n".
A==B determinism conclusion stands (per crystal-assist review).
…emma4 26B MoE

Closes the last two model bullets:

bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B
(MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 →
mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion
44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active
keeps per-token compute light). Content reply: "The capital of France
is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom).

bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B
(MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB),
-ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s,
/v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France
is Paris." VRAM 14345/16384 MiB after load.

Classifier note (banked in JSON): the 26B host_model=748 MiB tripped
the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4
config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748
is the embedding tensor + boundary buffers (≈671 MiB pure embedding),
NOT expert offload — all 128 experts are in gpu.model_mib=12952.
PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The
600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and
under-scales for larger vocab×hidden_dim products. Mode patched to
full-gpu with classifier_note explaining the misfire + suggested
remediation (host_model_pct_of_total < 10-15% = embedding-pattern;
≥ that = real expert offload).

All 10 issue #100 bullets now have committed bench evidence.
@marksverdhei marksverdhei marked this pull request as ready for review June 9, 2026 15:34
@marksverdhei marksverdhei merged commit 626a757 into ht Jun 9, 2026
@marksverdhei marksverdhei deleted the pascal/p5200-build branch June 9, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant