contract+test(apr-cli-tokenize-import-hf-v1): v1.0 → v1.1 — bind FALSIFY-001 with real integration test by noahgift · Pull Request #1506 · paiml/aprender

noahgift · 2026-05-05T08:02:20Z

Summary

v1.0.0 stamped a vague test reference for FALSIFY-TOK-IMPORT-HF-001 ("or equivalent" — not a runnable invocation). Same drift class as PR #1504 + PR #1505 caught for sibling pretrain contracts.

What ships

Test: `tokenize_import_hf_subcommand_registered` in `crates/apr-cli/tests/cli_commands.rs` runs `apr tokenize import-hf --help` and asserts (a) exit 0, (b) `--input`, `--output`, `--include-added-tokens` flags appear. Mirrors the `pretrain_init_flag_registered` pattern.

Contract apr-cli-tokenize-import-hf-v1 v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL:

FALSIFY-TOK-IMPORT-HF-001 `test:` updated to cite the new test.
5/5 falsifiers PASS, all bound to real tests.

Verification

`cargo test -p apr-cli --test cli_commands -- tokenize_import_hf_subcommand_registered` PASS
`pv validate contracts/apr-cli-tokenize-import-hf-v1.yaml` exits 0

Five Whys

Why was the v1.0.0 reference vague? Authored alongside the subcommand impl; integration test deferred under assumption `test_no_unregistered_commands` would cover it. But that only covers top-level subcommands.
Why is sub-subcommand not covered by test_no_unregistered_commands? It walks `apr-cli-commands-v1.yaml` which only enumerates top-level.
Why bump to v1.1.0? Test-binding INVARIANT was broken; v1.1.0 restores it.
Why mirror the `pretrain_init_flag_registered` pattern? Codebase consistency; the pattern is a clean drift guard.
Why pin specific flags? Flag-level drift (e.g., `--input` → `--source`) breaks UX without breaking exit-0 check.

Net effects

Contract v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL.
1 new integration test (33 LOC).
MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%.

Third drift-fix PR in the same session (after PR #1504 + PR #1505), closing the test-reference drift class across the §50.4 cascade contracts.

Test plan

🤖 Generated with Claude Code

…IFY-001 with real integration test v1.0.0 stamped a vague test reference for FALSIFY-TOK-IMPORT-HF-001: "cargo test -p apr-cli --test cli_commands -- --nocapture (or equivalent) reports import-hf as a registered tokenize subcommand" This was not a runnable invocation — same drift class as PR #1504 + PR #1505 caught for sibling pretrain contracts. The contract said "or equivalent" rather than naming an actual test, leaving the falsifier unfalsifiable. ## What ships Test: - `tokenize_import_hf_subcommand_registered` in `crates/apr-cli/tests/cli_commands.rs` runs `apr tokenize import-hf --help` and asserts (a) exit 0, (b) `--input`, `--output`, `--include-added-tokens` flags appear. Pins the 3-surface drift triangle: (1) clap variant `TokenizeCommands::ImportHf` (2) unit tests `commands::tokenize::tests::import_hf_*` (3) this integration test Contract apr-cli-tokenize-import-hf-v1 v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL: - FALSIFY-TOK-IMPORT-HF-001 `test:` updated to cite the new test. - Status remains PARTIAL_ALGORITHM_LEVEL: 5/5 falsifiers PASS. ## Verification $ cargo test -p apr-cli --test cli_commands -- tokenize_import_hf_subcommand_registered test result: ok. 1 passed; ... $ pv validate contracts/apr-cli-tokenize-import-hf-v1.yaml 0 error(s), 0 warning(s) ## Five Whys 1. Why was the v1.0.0 reference vague? Authored alongside the subcommand impl + unit tests; the integration test was deferred under the assumption that "test_no_unregistered_commands" would cover it. But that test only covers TOP-LEVEL subcommands, not sub-subcommands of `apr tokenize`. 2. Why is sub-subcommand registration not covered by test_no_unregistered_commands? It walks `apr-cli-commands-v1.yaml` which only enumerates top-level subcommands; sub-subcommand surfaces are out of scope. 3. Why bump to v1.1.0 (not v1.0.1)? Same logic as PR #1504/#1505: the test-binding INVARIANT was broken in v1.0.0; v1.1.0 restores it. 4. Why mirror the `pretrain_init_flag_registered` pattern instead of authoring something new? The pattern (run `apr <subcmd> --help`, assert exit 0 + key flags appear) is a clean drift guard; mirroring it preserves codebase consistency. 5. Why pin the 3 specific flags rather than just `apr tokenize import-hf --help` exit 0? Because flag-level drift (e.g., a future PR renaming `--input` to `--source`) would silently break operator-facing UX without breaking the help-shows-up binary check; pinning the exact flag names catches that class. ## Net effects - Contract v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL. - 1 new integration test (33 LOC). - 5/5 falsifiers PASS, all bound to real tests. - MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%. This is hygiene work while 5g.1 (~11hr) corpus retokenize runs. Third drift-fix PR in the same session (after PR #1504 + PR #1505) closing the test-reference drift class across the §50.4 cascade contracts (apr-pretrain-from-init-v1, apr-pretrain-arch-polymorphic-v1, apr-cli-tokenize-import-hf-v1). Refs: SPEC-SHIP-TWO-001 §50.4 cascade (PRs #1473-#1505), contracts/apr-cli-tokenize-import-hf-v1.yaml v1.1.0, feedback_cli_subcommand_three_surface_drift.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oughput characterization (#1508) §56 closed with 5g.1 full-corpus retokenization dispatched (PID 2767124, ~17hr wall projected). §57 records the parallel drift-sweep work that landed during the 5g.1 wait + throughput characterization of 5g.1 mid-run. ## Drift sweep (4 PRs) While 5g.1 ran in the background, a sweep of the §50.4 cascade contracts surfaced THE SAME drift class across multiple contracts: cited test names that didn't match what the impl PR actually authored. PR | Contract | v_old → v_new | Drift --- | --- | --- | --- #1502 | apr-pretrain-arch-polymorphic-v1 | v1.3 → v1.4 | CUDA-001 was REFERENCED in changelog but had no formal falsification_test entry #1504 | apr-pretrain-from-init-v1 | v1.1 → v1.2 | 7 of 8 cited test names didn't exist; re-aligned to existing tests #1505 | apr-pretrain-arch-polymorphic-v1 | v1.4 → v1.5 | FALSIFY-005/006 cited names diverged from PR #1476's actual authoring #1506 | apr-cli-tokenize-import-hf-v1 | v1.0 → v1.1 | FALSIFY-001 cited "or equivalent" — no real test name After PR #1506 lands, `pv lint contracts/` reports 0 PV-VER-001 errors across all 870+ contracts. The drift class is fully closed. ## 5g.1 throughput (real-time mid-run) Shard | Closed at | Δ from prev 0 | 07:08 | (start) 1 | 07:24 | 16 min 2 | 07:39 | 15 min 3 | 07:55 | 16 min ... 12 | 10:16 | (in progress) Mean wall: 16.3 min/shard. Linear projection: 57 shards × 16.3 min = 929 min = ~15.5 hr total → ETA ~22:30Z (slightly under §56's 17hr smoke estimate). ## Methodology takeaway When a contract is authored in PR_A alongside its impl, AND the impl's test names are stamped in the contract's `test:` field BEFORE the impl PR finalizes the names, the names diverge at the cascade boundary. Happened in 3 of 4 §50.4 cascade contracts. Prevention rule: when authoring a new contract that cites tests, EITHER reference tests that already exist on main, OR mark them `PENDING_PR_<N>:` with the impl PR ref so PV-VER-001 lint can flag dangling refs at contract-merge time. A future spec amendment could codify a `pv lint --strict-test-binding` enforcement that blocks contract merge when any `test:` field doesn't resolve to an existing test invocation. Out of §57 scope. ## Net effects - Spec v3.01.0 → v3.02.0. - Three contract bumps land cleanly (apr-pretrain-arch-polymorphic-v1 v1.3→v1.4→v1.5, apr-pretrain-from-init-v1 v1.1→v1.2, apr-cli-tokenize-import-hf-v1 v1.0→v1.1). - pv lint contracts/ 0 PV-VER-001 errors across 870+ contracts. - 5g.1 full corpus run progressing at 16.3 min/shard; ETA ~22:30Z. - MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57% until step 5g.3 produces val_loss < 9.38. Refs: SPEC-SHIP-TWO-001 §50.4 cascade, PRs #1502/#1504/#1505/#1506 (drift sweep), apr-cookbook spec v5.1.0 (companion update — operator-facing recipe) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-003/004/007 drift (round 2) (#1509) * contract(apr-pretrain-arch-polymorphic-v1): v1.5 → v1.6 — fix FALSIFY-003/004/007 drift (round 2) Second-round test-reference drift correction. §57's drift sweep (this contract's v1.4 → v1.5 bump in PR #1505) caught FALSIFY-005/006 but a more thorough audit (cross-referencing every `test:` field against the source-code function-name registry) surfaced three additional dangling references. ## Drift inventory (round 2) | Falsifier | v1.5.0 cited test | Exists? | Actual test | | --- | --- | --- | --- | | 003 | build_transformer_config_qwen_init_matches_constructor | ❌ | build_transformer_config_qwen_init_matches_input | | 004 | transformer::attention::tests::gqa_7_to_1_matches_full_mha | ❌ | transformer::model::tests::falsify_apr_pretrain_arch_004_* | | 007 | build_transformer_config_encoder_init_errors | ❌ | validate_pretrain_init_arch_rejects_encoder | ## Why §57 (PR #1505) didn't catch these §57's grep audited test-name SUFFIXES and FRAGMENTS, which produced false-negatives on: - `_init_matches_constructor` vs `_init_matches_input` — both end in `_matches_<word>` so a fragment grep counted the contract's name as "not dangling" - `transformer::attention::tests::` vs `transformer::model::tests::` — module-path drift not just function-name drift; only fully- qualified path comparison catches this - `_encoder_init_errors` vs `validate_pretrain_init_arch_rejects_encoder` — the contract's name was a guess at the impl name; impl PR #1479 chose a completely different convention ## How this round was found Used a stricter audit: for every `cargo test ... ::tests::<name>` in contracts, grep `fn <name>` in the actual source tree. If the fn doesn't exist, drift. This catches drift that PR #1505's fragment-based audit missed. ## Resolution Update FALSIFY-003/004/007 `test:` fields to the actual function names. No falsifier semantics change. 11 falsifiers all PASS; contract status remains FUNCTIONAL. ## Verification $ cargo test -p aprender-train --lib -- build_transformer_config_qwen_init_matches_input test result: ok. 1 passed $ cargo test -p aprender-train --lib -- falsify_apr_pretrain_arch_004_gqa_7_1_forward_pass_smoke test result: ok. 1 passed $ cargo test -p aprender-train --lib -- validate_pretrain_init_arch_rejects_encoder test result: ok. 1 passed $ pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml 0 error(s), 0 warning(s) ## Five Whys 1. Why did §57's sweep miss these? Used name-fragment grep (`::tests::[a-z_]+`) which counted false-negatives on suffix- close names like `_constructor` ↔ `_input`. 2. Why is module-path drift a separate class? Because grep against the `[a-z_]+` regex captures the FUNCTION name, not the `::module::tests::` path. A function with the right name in the wrong module passes that audit but fails actual test invocation. 3. Why fix in a separate PR rather than amending PR #1505? PR #1505 already merged. Per `feedback_falsifier_first_cascade_pattern.md` the cleanest cadence is one-bump-per-PR. 4. Why bump to v1.6.0? Same pattern as PR #1505's v1.4 → v1.5: the test-binding INVARIANT was broken in v1.5.0 (residual drift) and v1.6.0 restores it. 5. Why now (during 5g.1 wait)? Productive use of the 5g.1 (~10hr remaining) compute-bound idle time. Each drift fix is small (~30 LOC), reduces drift risk for future agents, and restores the falsifier-binding invariant. The alternative (manufacture bigger work) would risk introducing defects the contract base doesn't catch yet. ## Net effects - Contract v1.5.0 → v1.6.0 FUNCTIONAL. - 11 falsifiers, all PASS — same count, but FALSIFY-003/004/007 now reference tests that actually exist. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. This is the SECOND round of drift sweep on this contract. Together with PRs #1502/#1504/#1505/#1506 (round 1), all known test-reference drift is closed across the §50.4 cascade contracts. A future spec amendment could codify a `pv lint --strict-test-binding` enforcement that prevents drift at contract-merge time. Refs: SPEC-SHIP-TWO-001 §50.4 cascade, contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.6.0, PR #1505 (round 1 partial fix), PR #1502/#1504/#1506 (sibling fixes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(apr-pretrain-arch-polymorphic-v1): also fix FALSIFY-001 (round 2.5 — surfaced by PR #1511) Round 2 (initial commit on this branch) fixed FALSIFY-003/004/007. Sub-agent PR #1511 (`pv lint --strict-test-binding`) surfaced a 4th drift in this same contract: FALSIFY-001 cited `qwen2_0_5b_matches_hf_config` → does NOT exist on main. Actual: `qwen2_0_5b_matches_hf_config_2026_05_04` (date-suffix added by impl PR #1474 / commit 9af6e71 — May 4). The earlier round-2 audit (which focused on suffix + module-path drift) didn't catch this because the test name has a DATE-SUFFIX drift class (function name + `_<date>` is a real Rust test, but the contract truncated to the prefix). Updates: - FALSIFY-001 test ref: append `_2026_05_04` suffix. - v1.6.0 changelog updated to record 4 fixes (was 3). - Verified: cargo test qwen2_0_5b_matches_hf_config_2026_05_04 PASS. - pv lint --strict-test-binding contracts/apr-pretrain-arch-polymorphic-v1.yaml: 0 PV-VER-002 (down from 4 pre-fix). This consolidates round 2 into a single commit on the same branch + PR (#1509) rather than spawning a round-3 PR for one extra fix. The lint hardening in #1511 is what made finding the 4th drift trivial; future drift will be caught at contract-merge time once #1511 lands. Refs: SPEC-SHIP-TWO-001 §50.4 cascade, PR #1511 (sub-agent's pv lint --strict-test-binding), Issue #1510 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 5, 2026 08:02

noahgift merged commit cf765a1 into main May 5, 2026
11 checks passed

noahgift deleted the fix/tok-import-hf-001-binding branch May 5, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract+test(apr-cli-tokenize-import-hf-v1): v1.0 → v1.1 — bind FALSIFY-001 with real integration test#1506

contract+test(apr-cli-tokenize-import-hf-v1): v1.0 → v1.1 — bind FALSIFY-001 with real integration test#1506
noahgift merged 1 commit into
mainfrom
fix/tok-import-hf-001-binding

noahgift commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 5, 2026

Summary

What ships

Verification

Five Whys

Net effects

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant