Skip to content

fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d)#1800

Merged
noahgift merged 4 commits into
mainfrom
fix/distill-phase-3-gx10-staging-pmat-698d
May 18, 2026
Merged

fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d)#1800
noahgift merged 4 commits into
mainfrom
fix/distill-phase-3-gx10-staging-pmat-698d

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Three real defects surfaced when live-dispatching the #1799-fixed script to gx10:

  1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). → New GX10_RUN_PREFIX env var (default $HOME/runs).

  2. apr pull uses pacha cache, not HF hub — the fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b) #1799 hf_repo_to_dir looked for ~/.cache/huggingface/hub/models--<sanitized>/snapshots/<sha>/, but actual layout is ~/.cache/pacha/models/<sha>.safetensors.

  3. apr distill --backend cuda expects a directoryCudaTrainerTeacher::for_inference reads model.apr or model.safetensors at the dir root. Pacha caches loose files. Need to symlink-stage.

Fixes

  • New stage_repo() shell helper inside SSH heredoc that:
    • Captures apr pull Path: from stdout
    • mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs
    • Symlinks pacha-cached file as model.<ext>
    • Symlinks companion tokenizer.json / config.json / tokenizer_config.json
    • For GGUF teachers, runs apr import --preserve-q4k --arch qwen2 to convert to APR
  • GX10_RUN_PREFIX env var (default $HOME/runs)
  • Default TEACHER_REPOQwen/Qwen2.5-Coder-1.5B-Instruct (SafeTensors, loads directly). Original 7B-q4k GGUF teacher needs the convert path which is slow/disk-heavy; deferred to PMAT-698e after the smoke validates the pipeline.

Test plan

  • DRY_RUN=1 STEPS=50 bash scripts/... exits cleanly
  • bashrs lint count unchanged (11 pre-existing heredoc/string mis-parses, no new errors)
  • Live STEPS=50 dispatch on gx10 reaches the training loop (in flight as background task bv3zayk9b — result lands in evidence/distill-phase-3-sanity-50-v2/)

Evolution from #1799

PR #1799 fixed the FLAG NAMES (--num-steps → --epochs, etc.) and added a (broken) HF cache lookup. The flag part was correct. The cache lookup wasn't tested live and didn't survive contact with the real apr pull pacha output. This PR replaces it with apr pull-output-parsing + symlink staging, which works against the actual cache layout.

🤖 Generated with Claude Code

Three real defects surfaced by live #1799 dispatch attempt to gx10:

1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10
   (916GB root, no /mnt). Switch default to $HOME/runs, env-overridable
   via GX10_RUN_PREFIX.

2. The HF cache lookup (hf_repo_to_dir from #1799) targeted the wrong
   layout. `apr pull` uses pacha, which caches as:
     ~/.cache/pacha/models/<sha>.safetensors
     ~/.cache/pacha/models/<sha>.tokenizer.json
     ...
   not HF hub's snapshots/<sha>/ directory structure.

3. apr distill --backend cuda calls CudaTrainerTeacher::for_inference
   which expects a directory containing model.apr or model.safetensors.
   The pacha cache is a flat file. Need to symlink-stage into a dir.

Fixes:
- GX10_RUN_PREFIX env var (default $HOME/runs)
- New stage_repo() shell function inside SSH heredoc that:
  - captures `apr pull` Path: from stdout
  - mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs
  - symlinks pacha-cached files as model.<ext>
  - symlinks companion tokenizer.json / config.json / tokenizer_config.json
  - for GGUF teachers, runs apr import --preserve-q4k to convert to APR
- Default TEACHER_REPO changed to Qwen/Qwen2.5-Coder-1.5B-Instruct
  (SafeTensors, loads directly into CudaTrainerTeacher). The original
  paiml/qwen2.5-coder-7b-apache-q4k-v1 (GGUF) needs the apr import
  conversion path, which works but is slow and disk-intensive on gx10
  (58GB free). Defer real-MODEL-1 dispatch to PMAT-698e after smoke
  validates the pipeline.

Test plan:
- [x] DRY_RUN=1 STEPS=50 bash scripts/...  exits cleanly
- [x] bashrs lint: 11 errors (pre-existing heredoc/string mis-parses, no
      regression)
- [ ] STEPS=50 dispatch on gx10 reaches the training loop (verified
      live via the PR description, not in CI since this is a script)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 20:31
noahgift and others added 3 commits May 18, 2026 22:33
…98d cont.)

Second live-dispatch attempt revealed another defect: apr distill --backend
cuda reads teacher_path AS A FILE via std::fs::read + AprV2Reader::from_bytes,
THEN uses its parent directory for for_inference. Two expectations are tied
together:

  - the file at teacher_path must be a valid APR v2 binary (for metadata)
  - the parent directory must contain a loadable checkpoint (model.apr
    or model.safetensors) for CudaTransformerTrainer

The previous staging symlinked .safetensors at stage_dir/model.safetensors
which satisfied for_inference but failed the AprV2Reader read step
(symlink target had wrong magic bytes).

Fix: always run `apr import` (with --preserve-q4k for GGUF) to produce a
real APR v2 file at stage_dir/model.apr. The teacher_path passed to apr
distill is that .apr file. Its parent dir is the stage dir, which now
satisfies both expectations.

Also rename the dispatch output from student.apr to student-trained.apr
to disambiguate from the staged input checkpoints.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mport (PMAT-698d cont.)

Fourth defect surfaced live: apr import requires config.json next to the
source file (default search path), but pacha caches them sha-prefixed at:

  ~/.cache/pacha/models/<sha>.config.json
  ~/.cache/pacha/models/<sha>.tokenizer.json

The previous staging copied them to stage_dir/ AFTER running apr import,
so import couldn't find them and failed:

  error: Validation failed: Invalid model format: config.json not found
  at /home/noah/.cache/pacha/models/config.json

Fix: stage all companion files into stage_dir BEFORE apr import, and
also symlink the source file itself (.safetensors or .gguf) into stage_dir
as source.<ext>. apr import then finds everything in the same directory.
Result is still stage_dir/model.apr — that's what apr distill consumes.

Reordering only; no semantic change to the directory layout consumed
by apr distill.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… (PMAT-698d cont.)

Fifth defect surfaced live: 1.5B Qwen teacher loaded fine for inference but
hit CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during the for_inference
GPU upload path. Blackwell's unified 128GB pool reports correctly but the
training-time peak (weights + gradients + Adam optimizer state + per-block
activations + workspace) overflows the actual VRAM budget for >1B models.

For a Phase 3 SMOKE (whose contract is just "val_loss decreases over N
steps"), teacher and student don't have to be different. Using the 0.5B
Qwen for both exercises every KD-loop branch (forward, kd_step, gradient,
optimizer) at minimal memory.

This lets us validate the engineering tower (PMAT-693 through PMAT-697 +
698b + 698d) end-to-end before scoping the GB10 memory budget for the
real 7B-Q4K MODEL-1 teacher (deferred to PMAT-698e).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 00a9bca into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the fix/distill-phase-3-gx10-staging-pmat-698d branch May 18, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant