fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d)#1800
Merged
Conversation
Three real defects surfaced by live #1799 dispatch attempt to gx10: 1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). Switch default to $HOME/runs, env-overridable via GX10_RUN_PREFIX. 2. The HF cache lookup (hf_repo_to_dir from #1799) targeted the wrong layout. `apr pull` uses pacha, which caches as: ~/.cache/pacha/models/<sha>.safetensors ~/.cache/pacha/models/<sha>.tokenizer.json ... not HF hub's snapshots/<sha>/ directory structure. 3. apr distill --backend cuda calls CudaTrainerTeacher::for_inference which expects a directory containing model.apr or model.safetensors. The pacha cache is a flat file. Need to symlink-stage into a dir. Fixes: - GX10_RUN_PREFIX env var (default $HOME/runs) - New stage_repo() shell function inside SSH heredoc that: - captures `apr pull` Path: from stdout - mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs - symlinks pacha-cached files as model.<ext> - symlinks companion tokenizer.json / config.json / tokenizer_config.json - for GGUF teachers, runs apr import --preserve-q4k to convert to APR - Default TEACHER_REPO changed to Qwen/Qwen2.5-Coder-1.5B-Instruct (SafeTensors, loads directly into CudaTrainerTeacher). The original paiml/qwen2.5-coder-7b-apache-q4k-v1 (GGUF) needs the apr import conversion path, which works but is slow and disk-intensive on gx10 (58GB free). Defer real-MODEL-1 dispatch to PMAT-698e after smoke validates the pipeline. Test plan: - [x] DRY_RUN=1 STEPS=50 bash scripts/... exits cleanly - [x] bashrs lint: 11 errors (pre-existing heredoc/string mis-parses, no regression) - [ ] STEPS=50 dispatch on gx10 reaches the training loop (verified live via the PR description, not in CI since this is a script) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…98d cont.)
Second live-dispatch attempt revealed another defect: apr distill --backend
cuda reads teacher_path AS A FILE via std::fs::read + AprV2Reader::from_bytes,
THEN uses its parent directory for for_inference. Two expectations are tied
together:
- the file at teacher_path must be a valid APR v2 binary (for metadata)
- the parent directory must contain a loadable checkpoint (model.apr
or model.safetensors) for CudaTransformerTrainer
The previous staging symlinked .safetensors at stage_dir/model.safetensors
which satisfied for_inference but failed the AprV2Reader read step
(symlink target had wrong magic bytes).
Fix: always run `apr import` (with --preserve-q4k for GGUF) to produce a
real APR v2 file at stage_dir/model.apr. The teacher_path passed to apr
distill is that .apr file. Its parent dir is the stage dir, which now
satisfies both expectations.
Also rename the dispatch output from student.apr to student-trained.apr
to disambiguate from the staged input checkpoints.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mport (PMAT-698d cont.) Fourth defect surfaced live: apr import requires config.json next to the source file (default search path), but pacha caches them sha-prefixed at: ~/.cache/pacha/models/<sha>.config.json ~/.cache/pacha/models/<sha>.tokenizer.json The previous staging copied them to stage_dir/ AFTER running apr import, so import couldn't find them and failed: error: Validation failed: Invalid model format: config.json not found at /home/noah/.cache/pacha/models/config.json Fix: stage all companion files into stage_dir BEFORE apr import, and also symlink the source file itself (.safetensors or .gguf) into stage_dir as source.<ext>. apr import then finds everything in the same directory. Result is still stage_dir/model.apr — that's what apr distill consumes. Reordering only; no semantic change to the directory layout consumed by apr distill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… (PMAT-698d cont.) Fifth defect surfaced live: 1.5B Qwen teacher loaded fine for inference but hit CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during the for_inference GPU upload path. Blackwell's unified 128GB pool reports correctly but the training-time peak (weights + gradients + Adam optimizer state + per-block activations + workspace) overflows the actual VRAM budget for >1B models. For a Phase 3 SMOKE (whose contract is just "val_loss decreases over N steps"), teacher and student don't have to be different. Using the 0.5B Qwen for both exercises every KD-loop branch (forward, kd_step, gradient, optimizer) at minimal memory. This lets us validate the engineering tower (PMAT-693 through PMAT-697 + 698b + 698d) end-to-end before scoping the GB10 memory budget for the real 7B-Q4K MODEL-1 teacher (deferred to PMAT-698e). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three real defects surfaced when live-dispatching the #1799-fixed script to gx10:
/mnt/nvme-raid0/runs/is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). → NewGX10_RUN_PREFIXenv var (default$HOME/runs).apr pulluses pacha cache, not HF hub — the fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b) #1799hf_repo_to_dirlooked for~/.cache/huggingface/hub/models--<sanitized>/snapshots/<sha>/, but actual layout is~/.cache/pacha/models/<sha>.safetensors.apr distill --backend cudaexpects a directory —CudaTrainerTeacher::for_inferencereadsmodel.aprormodel.safetensorsat the dir root. Pacha caches loose files. Need to symlink-stage.Fixes
stage_repo()shell helper inside SSH heredoc that:apr pullPath:from stdoutRUN_DIR_REMOTE/teacherand/studentstage dirsmodel.<ext>tokenizer.json/config.json/tokenizer_config.jsonapr import --preserve-q4k --arch qwen2to convert to APRGX10_RUN_PREFIXenv var (default$HOME/runs)TEACHER_REPO→Qwen/Qwen2.5-Coder-1.5B-Instruct(SafeTensors, loads directly). Original 7B-q4k GGUF teacher needs the convert path which is slow/disk-heavy; deferred to PMAT-698e after the smoke validates the pipeline.Test plan
DRY_RUN=1 STEPS=50 bash scripts/...exits cleanlybashrs lintcount unchanged (11 pre-existing heredoc/string mis-parses, no new errors)STEPS=50dispatch on gx10 reaches the training loop (in flight as background taskbv3zayk9b— result lands in evidence/distill-phase-3-sanity-50-v2/)Evolution from #1799
PR #1799 fixed the FLAG NAMES (--num-steps → --epochs, etc.) and added a (broken) HF cache lookup. The flag part was correct. The cache lookup wasn't tested live and didn't survive contact with the real
apr pullpacha output. This PR replaces it withapr pull-output-parsing + symlink staging, which works against the actual cache layout.🤖 Generated with Claude Code