fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d) by noahgift · Pull Request #1800 · paiml/aprender

noahgift · 2026-05-18T20:31:08Z

Summary

Three real defects surfaced when live-dispatching the #1799-fixed script to gx10:

/mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). → New GX10_RUN_PREFIX env var (default $HOME/runs).
apr pull uses pacha cache, not HF hub — the fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b) #1799 hf_repo_to_dir looked for ~/.cache/huggingface/hub/models--<sanitized>/snapshots/<sha>/, but actual layout is ~/.cache/pacha/models/<sha>.safetensors.
apr distill --backend cuda expects a directory — CudaTrainerTeacher::for_inference reads model.apr or model.safetensors at the dir root. Pacha caches loose files. Need to symlink-stage.

Fixes

New stage_repo() shell helper inside SSH heredoc that:
- Captures apr pull Path: from stdout
- mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs
- Symlinks pacha-cached file as model.<ext>
- Symlinks companion tokenizer.json / config.json / tokenizer_config.json
- For GGUF teachers, runs apr import --preserve-q4k --arch qwen2 to convert to APR
GX10_RUN_PREFIX env var (default $HOME/runs)
Default TEACHER_REPO → Qwen/Qwen2.5-Coder-1.5B-Instruct (SafeTensors, loads directly). Original 7B-q4k GGUF teacher needs the convert path which is slow/disk-heavy; deferred to PMAT-698e after the smoke validates the pipeline.

Test plan

DRY_RUN=1 STEPS=50 bash scripts/... exits cleanly
bashrs lint count unchanged (11 pre-existing heredoc/string mis-parses, no new errors)
Live STEPS=50 dispatch on gx10 reaches the training loop (in flight as background task bv3zayk9b — result lands in evidence/distill-phase-3-sanity-50-v2/)

Evolution from #1799

PR #1799 fixed the FLAG NAMES (--num-steps → --epochs, etc.) and added a (broken) HF cache lookup. The flag part was correct. The cache lookup wasn't tested live and didn't survive contact with the real apr pull pacha output. This PR replaces it with apr pull-output-parsing + symlink staging, which works against the actual cache layout.

🤖 Generated with Claude Code

Three real defects surfaced by live #1799 dispatch attempt to gx10: 1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). Switch default to $HOME/runs, env-overridable via GX10_RUN_PREFIX. 2. The HF cache lookup (hf_repo_to_dir from #1799) targeted the wrong layout. `apr pull` uses pacha, which caches as: ~/.cache/pacha/models/<sha>.safetensors ~/.cache/pacha/models/<sha>.tokenizer.json ... not HF hub's snapshots/<sha>/ directory structure. 3. apr distill --backend cuda calls CudaTrainerTeacher::for_inference which expects a directory containing model.apr or model.safetensors. The pacha cache is a flat file. Need to symlink-stage into a dir. Fixes: - GX10_RUN_PREFIX env var (default $HOME/runs) - New stage_repo() shell function inside SSH heredoc that: - captures `apr pull` Path: from stdout - mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs - symlinks pacha-cached files as model.<ext> - symlinks companion tokenizer.json / config.json / tokenizer_config.json - for GGUF teachers, runs apr import --preserve-q4k to convert to APR - Default TEACHER_REPO changed to Qwen/Qwen2.5-Coder-1.5B-Instruct (SafeTensors, loads directly into CudaTrainerTeacher). The original paiml/qwen2.5-coder-7b-apache-q4k-v1 (GGUF) needs the apr import conversion path, which works but is slow and disk-intensive on gx10 (58GB free). Defer real-MODEL-1 dispatch to PMAT-698e after smoke validates the pipeline. Test plan: - [x] DRY_RUN=1 STEPS=50 bash scripts/... exits cleanly - [x] bashrs lint: 11 errors (pre-existing heredoc/string mis-parses, no regression) - [ ] STEPS=50 dispatch on gx10 reaches the training loop (verified live via the PR description, not in CI since this is a script) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…98d cont.) Second live-dispatch attempt revealed another defect: apr distill --backend cuda reads teacher_path AS A FILE via std::fs::read + AprV2Reader::from_bytes, THEN uses its parent directory for for_inference. Two expectations are tied together: - the file at teacher_path must be a valid APR v2 binary (for metadata) - the parent directory must contain a loadable checkpoint (model.apr or model.safetensors) for CudaTransformerTrainer The previous staging symlinked .safetensors at stage_dir/model.safetensors which satisfied for_inference but failed the AprV2Reader read step (symlink target had wrong magic bytes). Fix: always run `apr import` (with --preserve-q4k for GGUF) to produce a real APR v2 file at stage_dir/model.apr. The teacher_path passed to apr distill is that .apr file. Its parent dir is the stage dir, which now satisfies both expectations. Also rename the dispatch output from student.apr to student-trained.apr to disambiguate from the staged input checkpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…mport (PMAT-698d cont.) Fourth defect surfaced live: apr import requires config.json next to the source file (default search path), but pacha caches them sha-prefixed at: ~/.cache/pacha/models/<sha>.config.json ~/.cache/pacha/models/<sha>.tokenizer.json The previous staging copied them to stage_dir/ AFTER running apr import, so import couldn't find them and failed: error: Validation failed: Invalid model format: config.json not found at /home/noah/.cache/pacha/models/config.json Fix: stage all companion files into stage_dir BEFORE apr import, and also symlink the source file itself (.safetensors or .gguf) into stage_dir as source.<ext>. apr import then finds everything in the same directory. Result is still stage_dir/model.apr — that's what apr distill consumes. Reordering only; no semantic change to the directory layout consumed by apr distill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… (PMAT-698d cont.) Fifth defect surfaced live: 1.5B Qwen teacher loaded fine for inference but hit CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during the for_inference GPU upload path. Blackwell's unified 128GB pool reports correctly but the training-time peak (weights + gradients + Adam optimizer state + per-block activations + workspace) overflows the actual VRAM budget for >1B models. For a Phase 3 SMOKE (whose contract is just "val_loss decreases over N steps"), teacher and student don't have to be different. Using the 0.5B Qwen for both exercises every KD-loop branch (forward, kd_step, gradient, optimizer) at minimal memory. This lets us validate the engineering tower (PMAT-693 through PMAT-697 + 698b + 698d) end-to-end before scoping the GB10 memory budget for the real 7B-Q4K MODEL-1 teacher (deferred to PMAT-698e). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 20:31

noahgift and others added 3 commits May 18, 2026 22:33

noahgift merged commit 00a9bca into main May 18, 2026
10 checks passed

noahgift deleted the fix/distill-phase-3-gx10-staging-pmat-698d branch May 18, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d)#1800

fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d)#1800
noahgift merged 4 commits into
mainfrom
fix/distill-phase-3-gx10-staging-pmat-698d

noahgift commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

Fixes

Test plan

Evolution from #1799

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant