Skip to content

feat(crux-e-02): perplexity classifier + apr ppl CLI (5 of 6 FALSIFY FULL, 1 of 6 PARTIAL; blocked on BLOCKER-UPSTREAM-MISSING)#987

Merged
noahgift merged 24 commits into
mainfrom
feat/crux-e-02-perplexity
May 13, 2026
Merged

feat(crux-e-02): perplexity classifier + apr ppl CLI (5 of 6 FALSIFY FULL, 1 of 6 PARTIAL; blocked on BLOCKER-UPSTREAM-MISSING)#987
noahgift merged 24 commits into
mainfrom
feat/crux-e-02-perplexity

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Second CRUX feature shipped under CRUX-SHIP-001 (after #986). Adds a pure-math perplexity classifier in aprender::metrics::perplexity plus apr ppl --log-probs-file FILE.json --json, mirroring the llama-perplexity convention without requiring live model compute. The live-inference half (PPL over a held-out corpus using a real GGUF/APR model) stays PARTIAL under a declared BLOCKER-UPSTREAM-MISSING until a stable per-token log-probs extraction path lands.

Surface: apr ppl --log-probs-file nll.json --json emits ppl, mean_nll, num_tokens, log_probs_path keys. PPL = exp(-mean(log p)) with invariants: ppl ≥ 1.0, finite, monotone in mean NLL.

CRUX-SHIP-001 Merge Gates

Gate Status Evidence
g1_classifier_green 13 unit tests pass (aprender-core metrics::perplexity)
g2_cli_reachable apr ppl --help advertises --log-probs-file
g3_e2e_runs 9 tests pass in crates/apr-cli/tests/falsification_crux_e_02.rs
g4_contract_discharged ✅ (partial allowed) 5 of 6 FALSIFY-* FULL; FALSIFY-006 PARTIAL_ALGORITHM_LEVEL under declared BLOCKER-UPSTREAM-MISSING

Contract Discharge

contracts/crux-E-02-v1.yaml v1.1.0, status partial, pv validate 0 errors / 0 warnings:

  • FALSIFY-CRUX-E-02-001 FULL — --log-probs-file flag reachable from CLI
  • FALSIFY-CRUX-E-02-002 FULL — JSON emits ppl key with correct value
  • FALSIFY-CRUX-E-02-003 FULL — PPL ≥ 1.0 and finite
  • FALSIFY-CRUX-E-02-004 FULL — no-silent-pass on empty/NaN/±∞/positive log-prob
  • FALSIFY-CRUX-E-02-005 FULL — PPL monotone in mean NLL
  • FALSIFY-CRUX-E-02-006 PARTIAL_ALGORITHM_LEVEL — live fp16 Llama-3-8B band check needs log-probs extraction path (BLOCKER-UPSTREAM-MISSING)

Research Grounding

  • llama.cpp examples/perplexity — canonical PPL CLI we mirror
  • arXiv:2402.16775 — held-out perplexity for pretraining evaluation
  • llama.cpp#7111 — user demand for stable PPL reporting

Test Plan

  • cargo test -p aprender-core --lib metrics::perplexity → 13/13 pass
  • cargo test -p apr-cli --test falsification_crux_e_02 → 9/9 pass
  • cargo run -p aprender-contracts-cli -- validate contracts/crux-E-02-v1.yaml → 0 errors, 0 warnings
  • apr ppl --help | grep -F -- '--log-probs-file' → reachable
  • echo '[-0.693, -0.693, -0.693]' > /tmp/lp.json && apr --json ppl --log-probs-file /tmp/lp.jsonppl ≈ 2.0
  • Full-discharge of FALSIFY-006 once apr eval --task perplexity --corpus <path> wiring lands (follow-up; blocker declared)

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 21, 2026 09:06
@noahgift noahgift force-pushed the feat/crux-e-02-perplexity branch 2 times, most recently from 475a1be to 93f414d Compare April 22, 2026 08:32
noahgift added a commit that referenced this pull request Apr 23, 2026
…bling (#1007)

Flake surfaced in PR #987 workspace-test run 24782269410 — f32 SIMD
rounding produced diff=0.01074 at max_val=0.854, exceeding the 1e-2
small-value tolerance.

The sibling test_vecmat_associativity already uses 2e-2 uniformly
(proptest_properties.rs:252). This aligns the matvec branch to match.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…FULL, 1 of 6 PARTIAL; blocked on BLOCKER-UPSTREAM-MISSING)

CRUX-SHIP-001 merge gates:
- g1_classifier_green: 13 unit tests pass (aprender-core metrics::perplexity)
- g2_cli_reachable: apr ppl --help advertises --log-probs-file
- g3_e2e_runs: 9 falsification tests pass (falsification_crux_e_02)
- g4_contract_discharged: 5 of 6 FALSIFY-* FULL; FALSIFY-006 PARTIAL_ALGORITHM_LEVEL
  under BLOCKER-UPSTREAM-MISSING (no stable per-token log-probs extraction
  path for arbitrary GGUF/APR models in-tree yet)

Contract: contracts/crux-E-02-v1.yaml v1.1.0 status=partial
Classifier: aprender::metrics::perplexity (pure PPL = exp(-mean(log p));
no-silent-pass on empty/NaN/Inf/positive log-prob)
CLI: apr ppl --log-probs-file FILE.json --json emits ppl, mean_nll,
num_tokens, log_probs_path keys

Competitor parity: llama.cpp examples/perplexity nearest analogue
Research: arXiv:2402.16775 (held-out PPL for pretraining evaluation)
User demand: llama.cpp#7111

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit a137aa8 into main May 13, 2026
10 checks passed
@noahgift noahgift deleted the feat/crux-e-02-perplexity branch May 13, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant