Releases: paiml/aprender
v0.35.2 — Hiatus close-out drain
Bug fixes + DX (last release for 3 months)
Two-PR drain before the 3-month hiatus: mega-bundle PR #1898 (subsumed
#1880/#1883/#1886/#1891/#1896/#1897, which itself subsumed #1874/#1877/#1879/#1881)
- PR #1888 PMAT-706 smoke mode.
Fixed
- #1874 / PMAT-702:
apr evalno longer reports fakepass@1=1.0on
broken models. Eval now distinguishes inference failure from test failure. - #1877 / PMAT-703: distill teacher logits are vocab-aligned for the
Qwen2.5-Coder 7B → 0.5B KD pair (152064 → 151936). New contract
apr-distill-teacher-vocab-alignment-v1.yaml. - #1879 / PMAT-704:
apr distill --backend cudadefaults Q4K teacher
toCudaTrainerTeacher(cuBLAS), reverting the PMAT-704 Bug B
slow-path. New contractapr-distill-teacher-backend-selection-v1.yaml. - #1891:
apr pull qwen2.5-coder-1.5bandapr runshort-name
resolution + Pacha cache alignment. Addsqwen2.5-coder-1.5balias.
Filesystem paths that don't exist preserve the original path in the
FileNotFound error (regression: chat tests). - #1897 (clippy): removed
..Default::default()from 5 CallbackContext
literals in pipeline.rs where all fields are explicitly listed
(clippy::needless_update).
Added
- #1881 / PMAT-705:
ProgressCallbackwired into distillPipeline.
Operators can now subscribe toon_train_begin / on_epoch_begin / on_step_end / on_epoch_end / on_train_endevents. New contract
distill-pipeline-observability-v1.yaml. - #1888 / PMAT-706:
APR_DISTILL_MAX_STEPS=Nsmoke-validation mode.
Training loop early-breaks after N steps and prints a[SMOKE]summary
with projected full-run wall time. Use case: 60s smoke before committing
to a 50K-step run, catching cascade defects (e.g., PMAT-704 Realizar
CPU-bound hang) without waiting hours. New contract
apr-distill-smoke-validation-v1.yaml. - #1883 / #1886: dispatch wrappers for
apr distillStage D and
Phase 5 HumanEval, baked with PMAT-701 lessons. - #1880: SPEC-DISTILL-001 §87 post-mortem on PMAT-704 Bug B wrong turn.
- #1885:
scripts/gx10-disk-cleanup-distill-runs.shfor the gx10 host.
Chore
- #1896:
Cargo.locksynced toaprender@0.35.1(subsumed by #1891
in the mega-bundle). - README: hiatus banner updated to reflect v0.35.2 as the last release.
Versioning
- Root facade
aprender:0.35.1 → 0.35.2 - Sub-crates: stay at
0.35.0 aprender@0.35.2depends onapr-cli@0.35.0— no transitive churn.- Only the root facade republishes to crates.io.
Verification (post-merge dogfood, 2026-05-23)
- ✓
apr qa <qwen2.5-coder-7b-instruct-q4_k_m.gguf>Golden Output PASS
(closes #1864 confirmed live) - ✓
apr run <1.5B>produces "4" via wgpu→CPU parity-fallback safety net - ✓
aprv0.35.x stable; ~16 merged this session
v0.35.1 — Root facade feature passthroughs
Summary
Patch release closing a packaging gap surfaced by post-publish dogfood of v0.35.0: cargo install aprender --features cuda failed because the root facade exposed only cli and default features. Per memory/feedback_cuda_feature_footgun, --features cuda is the documented install command for GPU users (20 vs 400 tok/s on RTX 4090).
What changed
Root facade aprender now exposes 11 passthrough features that forward to apr-cli/*:
cuda, cuda-batch, wgpu, inference, training, training-gpu,
visualization, zram, xet, whisper, full
Versioning
aprenderroot facade:0.35.0 → 0.35.1- All other workspace crates: stay at
0.35.0 aprender@0.35.1depends onapr-cli@0.35.0(no transitive churn)
End-user impact
cargo install aprender # CPU + wgpu (default)
cargo install aprender --features cuda # GPU acceleration (newly enabled)
cargo install aprender --features full # everythingAlso in this release
- README hiatus banner: v0.35.x is the last release until 2026-08-22
- README contract count drift fix (1151 → 1153)
- v0.35.0 + v0.35.1 release callouts
🤖 Generated with Claude Code
v0.35.0 — Dogfood-driven release: Qwen story + multi-step parity + #1864 closed
🎉 Dogfood-driven release — Qwen end-to-end story, multi-step parity safety net, #1864 closed (it was a 5-line config gap, not a deep numerical bug)
81 commits since v0.34.0. Major work landed across three threads:
- Distill on NVIDIA GB10 Blackwell: Phase 1-3 of SPEC-DISTILL-001 working end-to-end on sm_121 — 62 steps in 82.1s after the 8-PR PMAT-698 cascade unwound a single one-character bug (
warm!macro hardcoded"silu_forward"for every kernel cache key). Phase 4 ladder running. - MoE (Qwen3) inference: M32d KV cache (19× speedup), streaming SSE per-token emit, full temperature/top_k/top_p sampling. New contracts
qwen3-moe-streaming-sse-v1andqwen3-moe-sampling-v1. M-GPU-MOE-3 cascade identified that CPU uses Q8K activation quant while CUDA uses f32 → 237,775× Q4_K matvec divergence (still tracked as #1583). - 2026-05-22 dogfood pass: 8 bugs filed, 7 fixed (#1862 #1865 #1866 README drift + serve syntax + deny advisory). The eighth, #1864 cuBLAS FP8 "gibberish", turned out to be a missing
stop_tokensin the QA gate — not a numerical bug. The user-visibleapr servepath was never affected.
Added
aprQwen end-to-end story in README (#1875) — 8-beat narrative (Discover → Trust → Explore → Adapt → Use → Serve → Operate → Scale) anchored on the Qwen scale ladder (0.5B safetensors → 30B-MoE GGUF). Every beat is a falsifier incontracts/qwen-story-v1.yaml; runnable asscripts/qwen-story.sh; nightly cron in.github/workflows/qwen-story-daily.ymlwith pmat bug-hunt manifest emitted per beat.- Multi-step wgpu parity gate (#1876) — closes the wgpu side of #1864. The pre-existing single-step gate passed at step 0 cosine ≥ 0.99 but missed autoregressive KV-cache drift on 7B Q4K. New
multi_step_parity_gateequation runs CPU vs wgpu in lockstep for N=3 steps (default; configurable viaAPR_WGPU_PARITY_STEPS∈ [1,16]). Live-discharged on 7B Q4K Vulkan: cos drops to 0.722 at step 1/3 → CPU fallback returns correct "2 + 2 equals 4." Contractapr-cpu-vs-gpu-output-parity-v1→ v1.6.0 + FALSIFY-CPU-GPU-006. /dogfoodGates 13-17 (#1872) — five new falsifier gates: G13 worktree HEAD sanity, G14 APR→GGUF export round-trip, G15apr validate --qualityconsistency vsapr qa, G16apr runexit-code on chat-template gibberish, G17 7B inference smoke. Pre-Gate methodology note locks theOUT=$(cmd); EC=$?exit-code-capture pattern.- M32d qwen3-moe KV cache (#1832) — 19× speedup for qwen3-moe inference, KV reuse across decode steps.
- qwen3-moe streaming SSE (#1854) — per-token emit when
stream=trueon/v1/chat/completions. Contractqwen3-moe-streaming-sse-v1. - qwen3-moe sampling (#1842) — temperature / top_k / top_p for qwen3-moe (was greedy-only).
- clean-chat-output sanitization contract (#1859) — codifies the M287 cascade prefix-stripping invariants for
apr codechat output. - Blackwell GB10 distill enablement (#1797 + #1804-#1820 cascade) —
apr distill --backend cudaruns end-to-end on sm_121. SPEC-BLACKWELL-FIX-001 + PMAT-700 (autodetect Grace Blackwell, skip PTX GEMM pre-warm when cuBLAS bound). - HTTP 3-knob wire-up (#1846) — operator-actionable temperature/top_p/repeat_penalty env vars for
apr code. - cuBLAS FP8 reproducer + per-layer parity infrastructure (#1884 + #1887) — general-purpose diagnostic tooling that survived the #1864 phantom investigation. The Stage A reproducer pins FP8 forward output to a bit-identical FNV-1a signature for any future numerical comparison; Stage B uncaps
CPU_DEBUG_LAYERS=1to dump all 28 layers + shipsscripts/cublas_fp8_per_layer_diff.shto split CPU/GPU streams.
Fixed
- #1864 cuBLAS FP8 7B Q4K "gibberish" (#1890) — was not a numerical bug. Root cause: the Golden Output gate's
gen_configused..Default::default()without overridingstop_tokens. Default =Vec::new(), so generation ran the full 512-token budget. After emitting the correct answer "4", the model continued from in-distribution chat-template noise →<|im_start|>repeats →verify_outputflagged as gibberish. Fix: 5 lines — addstop_tokens: vec![specials.eos_id]to both CPU and GPU gen_configs. User-visibleapr servewas never affected (it populated stop_tokens correctly atcuda_chat_backend.rs:113). Methodology lesson saved tomemory/feedback_falsify_simple_before_deep.md. - #1862
apr --versionstale SHA in git worktrees (#1867) — build.rs watched a hardcoded../../.git/HEADpath that doesn't exist in worktree layout (.gitis a file pointer there, not a dir). Replaced withgit rev-parse --git-dirfor per-worktree HEAD +--git-common-dirfor shared refs. Contractapr-version-traceability-v1→ v1.1.0 + FALSIFY-VERSION-004. - #1865
apr export <model>.apr --format ggufpanic (#1868) —.expect()onapr_metadata.num_layersaborted the process (exit 101) on APR files that didn't carry the field. Replaced withResult-propagatingok_or_else+ fallback that infersnum_layersfromblk.N.*tensor names. Exit code 5 (clean validation error), not 101 (panic). New contractapr-export-num-layers-v1. - #1866
apr validate --qualityGrade F on working models (#1870) — gate comparedtotal_scoreagainst a 100-point ceiling, but 22 of 25 quality checks were stubbedSkip("Not implemented"). NewValidationReport::implemented_score_pct() -> Option<f64>gates the threshold on the runnable denominator. Working models now exit 0; fully-stubbed suites treated as informational. New contractapr-validate-quality-threshold-v1. - README drift +
apr serveexample syntax (#1873) — contract claimed 1134 contracts / 82 CLI commands; actual was 1151 / 103.apr serve model.ggufexample errored ("unrecognized subcommand"); correct usage isapr serve run model.gguf. Both fixed; CLAUDE.md status line bumped to reflect v0.34.0 ship. cargo deny check advisoriesblocker (#1878) — RUSTSEC-2026-0105 (core2unmaintained + yanked, transitive viabitstream-io) started failing ALL PRs simultaneously the morning of 2026-05-22. Added to ignore list with recovery note.- distill GPU checkpoint export (#1856 / PMAT-699) —
apr distillnow saves trained GPU weights at the end of each phase + periodic checkpoints; previously the trained weights stayed on the GPU and were lost on process exit. - M-GPU-MOE-3 Q4_K root cause documented (#1822) — CPU uses Q8K activation quantization while CUDA uses f32 → different algorithms. Closed FALSIFY-Q4K-BISECT-007. Fix still tracked as #1583 (cuda f32→Q8K activation quant kernel).
- Eight stale-path / contract-registry fixes (#1857 #1860 #1861) — repair stale
include_str!paths after the monorepo consolidation; repair stale CARGO_MANIFEST_DIR in fusion contract test; register 5 missing fused kernels inkernel-fusion-v1.yaml(closes #1858). - clean_chat_output prefix stripping (#1853) — strip leading
Human:/User:/Assistant:from model output before returning to chat client. - try_qwen3_moe_backend EOS stop_tokens (#1852) — populate
stop_tokenswith EOS for qwen3-moe HTTP path (fixes M287 runaway generation). - qwen3_moe arch guard at /v1/chat/completions (#1806) — guard at HTTP handler so qwen3_moe traffic routes to the MoE-aware forward; prevents
Buffer with 'layer.0.up_proj' label binding size is zeropanic.
Verification
- End-to-end 7B Q4K GGUF on RTX 4090:
apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf→ ✓ ALL GATES PASSED (pre-fix: ✗ FAIL Golden Output<|im_start|>repeats) - End-to-end 7B Q4K HTTP:
apr serve run <7B> --port 8080+ curl/v1/chat/completionswith{"role":"user","content":"What is 2+2?"}→'2+2 equals 4.' - End-to-end 1.5B Q4K APR:
apr runproduces "2 + 2 equals 4." via the multi-step parity gate → CPU fallback safety net - End-to-end 0.5B SafeTensors:
apr pull→apr inspect→apr convert→apr exportround-trip; all commands clean exit, no panics /dogfoodGates 1-17: all GO on this host with canonical Qwen scale ladder (0.5B / 1.5B / 7B / 30B-MoE)- Tests: 25,300+ workspace lib tests + 5968 apr-cli tests + 13,805 aprender-core tests pass; 1153 provable contracts lint-clean
Methodology notes saved to memory
feedback_falsify_simple_before_deep— when a test gate FAILs with a complex symptom, first check whether the user-visible path that the test purports to verify ALSO fails. Saved the session from days of phantom investigation.feedback_release_only_after_bug_hunt— for releases, dogfood → wait for in-flight fixes → bug-hunt → THEN cut.- (existing
feedback_test_methodology_can_fake_bugsandfeedback_falsifier_cascade_decomposes_magnituderules reinforced by this session)
v0.34.0 — MODEL-2 §88 stack-existence-proof + apr publish defect cascade
🎉 MODEL-2 §88 stack-existence-proof published
End-to-end publish of the first model trained with the pure-Rust Sovereign AI Stack: https://huggingface.co/paiml/albor-370m-v1. 494M-parameter Qwen2 architecture (init from Qwen2.5-Coder-0.5B-Instruct, fine-tuned on bigcode/the-stack-dedup + codeparrot/codeparrot-clean Python permissive subset), val_loss=4.6227, all 3 binary artifacts (.apr, .gguf, .safetensors) + tokenizer + config + 11.6KB model card. GGUF verified loadable by llama.cpp.
PMAT-690 P3-C-prep defect cascade (Class-3 wave of 5)
| # | Title | PR |
|---|---|---|
| 1 | apr stamp --tokenizer for embedded vocab |
#1769 |
| 2 | GGUF Q4_K K-divisibility fallback | #1771 |
| 3 | GGUF Q4_K shape pass-through (fix llama-cli offsets) | #1771 |
| 5 | apr publish LFS batch + NDJSON commits + model-index YAML | #1772 |
Install
cargo install aprender
apr --version # apr 0.34.0Highlights
apr stamp --tokenizer <DIR>— embedsvocab.json+merges.txt(ortokenizer.json) into APRcustom.tokenizer.vocabulary+ setsHAS_VOCABflag. Closes the §86 salvage workflow for pre-P0-K APRs.- GGUF Q4_K compatibility for non-256-divisible architectures — Qwen2 0.5B (hidden=896) now exports a llama.cpp-loadable GGUF. Tensors with K%256 ≠ 0 fall back to F32; others quantize with corrected APR-native shape (was producing transposed/inflated bytes).
apr publishcorrectness — three sub-defects fixed: missing LFS batch API call (5MB-5GB band), JSONaddOrUpdatecommits silently dropping files, andmodel-indexYAML missingresults:triggering HTTP 400.
See CHANGELOG.md for the full release notes.
Nightly Build
Automated nightly build from main.
Date: 2026-05-16T04:06:12Z
Commit: 63a6562
This is a prerelease. For stable releases, see the latest tagged version.
🎉 v0.33.0 — MODEL-1 SHIP % = 100% (SHIP-007 LIVE-DISCHARGED)
🎉 MODEL-1 SHIP % = 100%
This release marks SHIP-TWO-001 MODEL-1 fully ship-ready. All 10 acceptance criteria (AC-SHIP1-001 through AC-SHIP1-010) are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (RTX 4090, --features cuda).
| AC | What | Discharge § |
|---|---|---|
| SHIP-001 | apr run safetensors load via realizar | §72 |
| SHIP-002 | apr run "def fib(n):" valid Python | §61 |
| SHIP-003 | q4_k_m round-trip cos ≥ 0.999 | §72 |
| SHIP-004 | GGUF loads in llama.cpp | §72 |
| SHIP-005 | HumanEval pass@1 = 86.59% (gx10 164-run) | §71 |
| SHIP-006 | apr qa 12-gate aggregate PASS | §61.8 |
| SHIP-007 | PARITY-GATE PASS + 124.6 tok/s @ 128-tok decode | §75 |
| SHIP-008 | Chat template render | §61 |
| SHIP-009 | License + provenance in model.apr metadata | §72 |
| SHIP-010 | Published HF URL + sha256 match | §72 |
SHIP-007 root cause + fix (PR #1651, §75)
F32 GEMV PTX kernel layout bug. The kernel assumed weight matrix A is [K rows × N cols] row-major. The standard ML weight convention is [output_dim=N, input_dim=K] row-major. Kernel was reading TRANSPOSED weights → cos = -0.005190 vs CPU.
Fix: rewrite inner loop to iterate K within row block_id. Empirical: 124.6 tok/s @ 128-tok decode on RTX 4090 (4.15× the AC-SHIP1-007 30 tok/s floor). PARITY-GATE PASS, default path, no workarounds needed.
Other major fixes
SHIP-005 — HumanEval harness RC3 fix (PR #1635, §70/§71)
The ChatML branch dropped problem.prompt's typing imports (e.g., from typing import List). ~70% of HumanEval signatures use typing aliases → NameError at line 1.
Fix: new extract_prompt_preamble(prompt, entry_point) helper prepends prompt preamble before extracted code block.
Empirical: pre-fix (§67 H4 only) pass@1 = 80.49% → post-fix (§71 +RC3) pass@1 = 86.59% (+6.10pp; clears 84.80% floor by +1.79pp).
Diagnostic surfaces
APR_EVAL_DEBUG=1— per-problem JSON dump in apr eval. Diagnoses harness false-negatives. Used to localize §70 RC3 in 5 minutes on gx10.APR_GPU_STAGE_DUMP=<dir>— GPU-side per-stage F32 tensor dump in APRT format. Used to localize SHIP-007 bug to F32 GEMV via stage-by-stage stats analysis.
Methodology lessons (#16-#22)
Lessons #16-#22 captured in MEMORY.md document the falsifier-first cascade that closed §65→§75 in 2 days. The cascade went from "5-10 PR / 1-2 week" estimate (§63) to actual "13 PR / 2 days". Methodology lessons compose; each makes the next bug cheaper.
Spec versions
docs/specifications/aprender-train/ship-two-models-spec.md: 3.13.0 → 3.21.0 across §67, §68, §69, §70, §71, §72, §73, §74, §75 (9 amendments over 2 days).
Cascade arc
| Date | § | What |
|---|---|---|
| 2026-05-12 | 67-72 | SHIP-005 cascade: H4 → RC3 → LIVE-DISCHARGED at 86.59% pass@1; 5-AC LIVE evidence cascade |
| 2026-05-12 | 73 | SHIP-007 cascade scope reduced from 3 layers to 1 |
| 2026-05-13 | 74-75 | SHIP-007 bug LOCALIZED to F32 GEMV → 1-PR layout fix → MODEL-1 100% |
🤖 Generated with Claude Code
v0.31.2 — aprender-mcp publish fix (v0.31.1 yank re-publish)
Summary
Re-publish of v0.31.1 with the aprender-mcp `cargo install` fix.
v0.31.1 was yanked because `cargo install aprender@0.31.1` panicked at build time — aprender-mcp's build.rs read `../../contracts/apr-mcp-tool-schemas-v1.yaml`, which lives in the monorepo tree but not in the published tarball.
What's Changed
aprender-mcp publish fix (#910)
- Bundle `contracts/apr-mcp-tool-schemas-v1.yaml` inside the crate via `Cargo.toml` `include = ["contracts/*.yaml"]`.
- build.rs now reads `CARGO_MANIFEST_DIR/contracts/` — no `..` escape.
- Drift-guard test `contract_copy_matches_workspace_root` asserts in-crate and workspace-root copies stay byte-identical.
Poka-Yoke prevention
- New `scripts/check_build_rs_paths.sh` — static check that flags any `build.rs` joining `".."` with `panic!`/`unwrap_or_else(…panic)` unless it has an `.exists()` guard or `ALLOW_ESCAPE` annotation.
- Wired into `make tier3` (pre-push) and `.github/workflows/ci.yml` `workspace-test` job.
Version bump (#911)
- 71 `Cargo.toml` files + `Cargo.lock` bumped 0.31.1 → 0.31.2.
Install
```sh
cargo install aprender --version 0.31.2
apr --version
```
Previous versions
- v0.31.1 — YANKED from crates.io (build.rs path escape broke `cargo install`).
- v0.31.0 — current max non-yanked before this release.
v0.31.1 — QA format_parity SKIP fix + MCP M5 pmcp scaffold
Patch release over v0.31.0 bundling two post-release PRs plus cargo fmt normalization.
Fixed
apr qaformat_paritygate now SKIPs when the primary model is non-GGUF (SafeTensors, APR, ONNX) instead of FAILing the overall QA run (#907). Matches the pre-existing SKIP semantics of the 5 other inference-only gates when golden-output / golden-input / reference tokenizer are unavailable. Regression tests assertskipped=true && passed=truefor both SafeTensors and APR primaries.
Added
- MCP M5 scaffold (#908) — optional
pmcp = "2.3"dependency onaprender-mcpbehind a newpmcp-dispatcherfeature flag (default off). Zero behaviour change: the hand-rolled stdio dispatcher still runs by default. Unblocks the M5 migration path (pmcp::Server delegation + FALSIFY-MCP-009 byte-identical parity test + SSE/WebSocket transports).
CI
.github/workflows/ci.yml— temporarily setenable_sccache: falseto work around a missingrustc-sccachewrapper in the sovereign-ci runner container image. Will revert once the upstream runner image is fixed.
Install
cargo install aprender --version 0.31.1
apr --versionFull changelog: v0.31.0...v0.31.1
v0.31.0 — MCP M1–M3 + Claude Code parity epic + SHIP-TWO-001 teacher
Highlights
- MCP Server M1–M3 —
apr mcpexposes 9 apr tools over stdio JSON-RPC 2.0 with YAML-codegen'd schemas (FALSIFY-MCP-008),notifications/cancelled+notifications/progress, and Draft 7 meta-validation. - apr code — Claude Code parity epic CLOSED (
contracts/apr-code-parity-v1.yamlv5.1): 21 rows, 14 SHIPPED / 3 PARTIAL / 4 NONE. 10 tickets closed in one cycle (P0×4, P1×5, P2×2). Epic PMAT-CODE-PARITY-MATRIX-001 closure conditions met. - SHIP-TWO-001 first sovereign model —
paiml/qwen2.5-coder-7b-apache-q4k-v1(7.5 GB .apr, Apache-2.0) published to HuggingFace Hub. First artifact to pass the fullapr validate-manifestcontract. - Decode hot-path hygiene (HP-001/002/003) — 1.5B Q4_K_M: 184 → 382 tok/s (2.07×) by removing per-token /tmp writes and diagnostic eprintlns from the GPU decode path.
- Contracts harness — new
pv check-paritySEMANTIC gate +apr-claude-proxy-v1.yamlDRAFT (Messages-API drop-in forapr serve anthropic).
Install
cargo install aprender
apr --versionWhat's in this release
See CHANGELOG.md [0.31.0] - 2026-04-19 for the full list — MCP server M1–M3, apr-code parity epic, SHIP-TWO-001 teacher, perf HP-001/002/003, GH-434 streaming APR→Q4K, GH-478 per-layer dequant, apr compile / apr serve plan hf:// / apr eval --task classify, --arch gemma|falcon|mamba|t5, sccache + nextest CI, and many flaky-test fixes.
Merge sequence
| PR | Title | Merged |
|---|---|---|
| #748 | release: aprender v0.31.0 (78-crate bump) | 2026-04-18 |
| #888 | docs(mcp): spec v1.2.0 + parity epic | 2026-04-19 05:48 UTC |
| #899 | release: consolidated CHANGELOG | 2026-04-19 06:07 UTC |
🤖 Tag cut with Claude Code
v0.30.0 — Monorepo Consolidation Complete
What's Changed
Full Changelog: nightly...v0.30.0