Skip to content

contract+test(apr-cli-tokenize-import-hf-v1): v1.0 → v1.1 — bind FALSIFY-001 with real integration test#1506

Merged
noahgift merged 1 commit into
mainfrom
fix/tok-import-hf-001-binding
May 5, 2026
Merged

contract+test(apr-cli-tokenize-import-hf-v1): v1.0 → v1.1 — bind FALSIFY-001 with real integration test#1506
noahgift merged 1 commit into
mainfrom
fix/tok-import-hf-001-binding

Conversation

@noahgift

@noahgift noahgift commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

v1.0.0 stamped a vague test reference for FALSIFY-TOK-IMPORT-HF-001 ("or equivalent" — not a runnable invocation). Same drift class as PR #1504 + PR #1505 caught for sibling pretrain contracts.

What ships

Test: `tokenize_import_hf_subcommand_registered` in `crates/apr-cli/tests/cli_commands.rs` runs `apr tokenize import-hf --help` and asserts (a) exit 0, (b) `--input`, `--output`, `--include-added-tokens` flags appear. Mirrors the `pretrain_init_flag_registered` pattern.

Contract apr-cli-tokenize-import-hf-v1 v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL:

  • FALSIFY-TOK-IMPORT-HF-001 `test:` updated to cite the new test.
  • 5/5 falsifiers PASS, all bound to real tests.

Verification

  • `cargo test -p apr-cli --test cli_commands -- tokenize_import_hf_subcommand_registered` PASS
  • `pv validate contracts/apr-cli-tokenize-import-hf-v1.yaml` exits 0

Five Whys

  1. Why was the v1.0.0 reference vague? Authored alongside the subcommand impl; integration test deferred under assumption `test_no_unregistered_commands` would cover it. But that only covers top-level subcommands.
  2. Why is sub-subcommand not covered by test_no_unregistered_commands? It walks `apr-cli-commands-v1.yaml` which only enumerates top-level.
  3. Why bump to v1.1.0? Test-binding INVARIANT was broken; v1.1.0 restores it.
  4. Why mirror the `pretrain_init_flag_registered` pattern? Codebase consistency; the pattern is a clean drift guard.
  5. Why pin specific flags? Flag-level drift (e.g., `--input` → `--source`) breaks UX without breaking exit-0 check.

Net effects

  • Contract v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL.
  • 1 new integration test (33 LOC).
  • MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%.

Third drift-fix PR in the same session (after PR #1504 + PR #1505), closing the test-reference drift class across the §50.4 cascade contracts.

Test plan

  • `pv validate` exits 0
  • PMAT pre-commit quality gates pass
  • New integration test passes
  • CI gate green
  • Auto-merge fires on green CI

🤖 Generated with Claude Code

…IFY-001 with real integration test

v1.0.0 stamped a vague test reference for FALSIFY-TOK-IMPORT-HF-001:
  "cargo test -p apr-cli --test cli_commands -- --nocapture
   (or equivalent) reports import-hf as a registered tokenize subcommand"

This was not a runnable invocation — same drift class as PR #1504
+ PR #1505 caught for sibling pretrain contracts. The contract said
"or equivalent" rather than naming an actual test, leaving the
falsifier unfalsifiable.

## What ships

Test:
- `tokenize_import_hf_subcommand_registered` in
  `crates/apr-cli/tests/cli_commands.rs` runs `apr tokenize
  import-hf --help` and asserts (a) exit 0, (b) `--input`,
  `--output`, `--include-added-tokens` flags appear. Pins the
  3-surface drift triangle:
    (1) clap variant `TokenizeCommands::ImportHf`
    (2) unit tests `commands::tokenize::tests::import_hf_*`
    (3) this integration test

Contract apr-cli-tokenize-import-hf-v1 v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL:
- FALSIFY-TOK-IMPORT-HF-001 `test:` updated to cite the new test.
- Status remains PARTIAL_ALGORITHM_LEVEL: 5/5 falsifiers PASS.

## Verification

  $ cargo test -p apr-cli --test cli_commands -- tokenize_import_hf_subcommand_registered
    test result: ok. 1 passed; ...
  $ pv validate contracts/apr-cli-tokenize-import-hf-v1.yaml
    0 error(s), 0 warning(s)

## Five Whys

1. Why was the v1.0.0 reference vague? Authored alongside the
   subcommand impl + unit tests; the integration test was deferred
   under the assumption that "test_no_unregistered_commands" would
   cover it. But that test only covers TOP-LEVEL subcommands, not
   sub-subcommands of `apr tokenize`.
2. Why is sub-subcommand registration not covered by
   test_no_unregistered_commands? It walks
   `apr-cli-commands-v1.yaml` which only enumerates top-level
   subcommands; sub-subcommand surfaces are out of scope.
3. Why bump to v1.1.0 (not v1.0.1)? Same logic as PR #1504/#1505:
   the test-binding INVARIANT was broken in v1.0.0; v1.1.0 restores it.
4. Why mirror the `pretrain_init_flag_registered` pattern instead
   of authoring something new? The pattern (run `apr <subcmd>
   --help`, assert exit 0 + key flags appear) is a clean drift
   guard; mirroring it preserves codebase consistency.
5. Why pin the 3 specific flags rather than just `apr tokenize
   import-hf --help` exit 0? Because flag-level drift (e.g., a
   future PR renaming `--input` to `--source`) would silently
   break operator-facing UX without breaking the help-shows-up
   binary check; pinning the exact flag names catches that class.

## Net effects

- Contract v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL.
- 1 new integration test (33 LOC).
- 5/5 falsifiers PASS, all bound to real tests.
- MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%.

This is hygiene work while 5g.1 (~11hr) corpus retokenize runs.
Third drift-fix PR in the same session (after PR #1504 + PR #1505)
closing the test-reference drift class across the §50.4 cascade
contracts (apr-pretrain-from-init-v1, apr-pretrain-arch-polymorphic-v1,
apr-cli-tokenize-import-hf-v1).

Refs: SPEC-SHIP-TWO-001 §50.4 cascade (PRs #1473-#1505),
      contracts/apr-cli-tokenize-import-hf-v1.yaml v1.1.0,
      feedback_cli_subcommand_three_surface_drift.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 5, 2026 08:02
@noahgift noahgift merged commit cf765a1 into main May 5, 2026
11 checks passed
@noahgift noahgift deleted the fix/tok-import-hf-001-binding branch May 5, 2026 08:26
noahgift added a commit that referenced this pull request May 5, 2026
…oughput characterization (#1508)

§56 closed with 5g.1 full-corpus retokenization dispatched (PID
2767124, ~17hr wall projected). §57 records the parallel drift-sweep
work that landed during the 5g.1 wait + throughput characterization
of 5g.1 mid-run.

## Drift sweep (4 PRs)

While 5g.1 ran in the background, a sweep of the §50.4 cascade
contracts surfaced THE SAME drift class across multiple contracts:
cited test names that didn't match what the impl PR actually authored.

  PR     | Contract                              | v_old → v_new | Drift
  ---    | ---                                   | ---           | ---
  #1502  | apr-pretrain-arch-polymorphic-v1      | v1.3 → v1.4   | CUDA-001 was REFERENCED in changelog but had no formal falsification_test entry
  #1504  | apr-pretrain-from-init-v1             | v1.1 → v1.2   | 7 of 8 cited test names didn't exist; re-aligned to existing tests
  #1505  | apr-pretrain-arch-polymorphic-v1      | v1.4 → v1.5   | FALSIFY-005/006 cited names diverged from PR #1476's actual authoring
  #1506  | apr-cli-tokenize-import-hf-v1         | v1.0 → v1.1   | FALSIFY-001 cited "or equivalent" — no real test name

After PR #1506 lands, `pv lint contracts/` reports 0 PV-VER-001
errors across all 870+ contracts. The drift class is fully closed.

## 5g.1 throughput (real-time mid-run)

  Shard | Closed at | Δ from prev
  0     | 07:08    | (start)
  1     | 07:24    | 16 min
  2     | 07:39    | 15 min
  3     | 07:55    | 16 min
  ...
  12    | 10:16    | (in progress)

Mean wall: 16.3 min/shard. Linear projection: 57 shards × 16.3 min =
929 min = ~15.5 hr total → ETA ~22:30Z (slightly under §56's 17hr
smoke estimate).

## Methodology takeaway

When a contract is authored in PR_A alongside its impl, AND the
impl's test names are stamped in the contract's `test:` field BEFORE
the impl PR finalizes the names, the names diverge at the cascade
boundary. Happened in 3 of 4 §50.4 cascade contracts.

Prevention rule: when authoring a new contract that cites tests,
EITHER reference tests that already exist on main, OR mark them
`PENDING_PR_<N>:` with the impl PR ref so PV-VER-001 lint can flag
dangling refs at contract-merge time.

A future spec amendment could codify a `pv lint --strict-test-binding`
enforcement that blocks contract merge when any `test:` field doesn't
resolve to an existing test invocation. Out of §57 scope.

## Net effects

- Spec v3.01.0 → v3.02.0.
- Three contract bumps land cleanly (apr-pretrain-arch-polymorphic-v1
  v1.3→v1.4→v1.5, apr-pretrain-from-init-v1 v1.1→v1.2,
  apr-cli-tokenize-import-hf-v1 v1.0→v1.1).
- pv lint contracts/ 0 PV-VER-001 errors across 870+ contracts.
- 5g.1 full corpus run progressing at 16.3 min/shard; ETA ~22:30Z.
- MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%
  until step 5g.3 produces val_loss < 9.38.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade,
      PRs #1502/#1504/#1505/#1506 (drift sweep),
      apr-cookbook spec v5.1.0 (companion update — operator-facing recipe)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…-003/004/007 drift (round 2) (#1509)

* contract(apr-pretrain-arch-polymorphic-v1): v1.5 → v1.6 — fix FALSIFY-003/004/007 drift (round 2)

Second-round test-reference drift correction. §57's drift sweep
(this contract's v1.4 → v1.5 bump in PR #1505) caught FALSIFY-005/006
but a more thorough audit (cross-referencing every `test:` field
against the source-code function-name registry) surfaced three
additional dangling references.

## Drift inventory (round 2)

  | Falsifier | v1.5.0 cited test                                           | Exists? | Actual test                                                  |
  | ---       | ---                                                         | ---     | ---                                                          |
  | 003       | build_transformer_config_qwen_init_matches_constructor      | ❌       | build_transformer_config_qwen_init_matches_input             |
  | 004       | transformer::attention::tests::gqa_7_to_1_matches_full_mha  | ❌       | transformer::model::tests::falsify_apr_pretrain_arch_004_*   |
  | 007       | build_transformer_config_encoder_init_errors                | ❌       | validate_pretrain_init_arch_rejects_encoder                  |

## Why §57 (PR #1505) didn't catch these

§57's grep audited test-name SUFFIXES and FRAGMENTS, which produced
false-negatives on:
  - `_init_matches_constructor` vs `_init_matches_input` — both end
    in `_matches_<word>` so a fragment grep counted the contract's
    name as "not dangling"
  - `transformer::attention::tests::` vs `transformer::model::tests::` —
    module-path drift not just function-name drift; only fully-
    qualified path comparison catches this
  - `_encoder_init_errors` vs `validate_pretrain_init_arch_rejects_encoder` —
    the contract's name was a guess at the impl name; impl PR #1479
    chose a completely different convention

## How this round was found

Used a stricter audit: for every `cargo test ... ::tests::<name>`
in contracts, grep `fn <name>` in the actual source tree. If the
fn doesn't exist, drift. This catches drift that PR #1505's
fragment-based audit missed.

## Resolution

Update FALSIFY-003/004/007 `test:` fields to the actual function
names. No falsifier semantics change. 11 falsifiers all PASS;
contract status remains FUNCTIONAL.

## Verification

  $ cargo test -p aprender-train --lib -- build_transformer_config_qwen_init_matches_input
    test result: ok. 1 passed
  $ cargo test -p aprender-train --lib -- falsify_apr_pretrain_arch_004_gqa_7_1_forward_pass_smoke
    test result: ok. 1 passed
  $ cargo test -p aprender-train --lib -- validate_pretrain_init_arch_rejects_encoder
    test result: ok. 1 passed
  $ pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml
    0 error(s), 0 warning(s)

## Five Whys

1. Why did §57's sweep miss these? Used name-fragment grep
   (`::tests::[a-z_]+`) which counted false-negatives on suffix-
   close names like `_constructor` ↔ `_input`.
2. Why is module-path drift a separate class? Because grep against
   the `[a-z_]+` regex captures the FUNCTION name, not the
   `::module::tests::` path. A function with the right name in the
   wrong module passes that audit but fails actual test invocation.
3. Why fix in a separate PR rather than amending PR #1505? PR #1505
   already merged. Per `feedback_falsifier_first_cascade_pattern.md`
   the cleanest cadence is one-bump-per-PR.
4. Why bump to v1.6.0? Same pattern as PR #1505's v1.4 → v1.5: the
   test-binding INVARIANT was broken in v1.5.0 (residual drift) and
   v1.6.0 restores it.
5. Why now (during 5g.1 wait)? Productive use of the 5g.1 (~10hr
   remaining) compute-bound idle time. Each drift fix is small
   (~30 LOC), reduces drift risk for future agents, and restores
   the falsifier-binding invariant. The alternative (manufacture
   bigger work) would risk introducing defects the contract base
   doesn't catch yet.

## Net effects

- Contract v1.5.0 → v1.6.0 FUNCTIONAL.
- 11 falsifiers, all PASS — same count, but FALSIFY-003/004/007
  now reference tests that actually exist.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

This is the SECOND round of drift sweep on this contract. Together
with PRs #1502/#1504/#1505/#1506 (round 1), all known
test-reference drift is closed across the §50.4 cascade contracts.
A future spec amendment could codify a `pv lint --strict-test-binding`
enforcement that prevents drift at contract-merge time.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade,
      contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.6.0,
      PR #1505 (round 1 partial fix), PR #1502/#1504/#1506 (sibling fixes)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* contract(apr-pretrain-arch-polymorphic-v1): also fix FALSIFY-001 (round 2.5 — surfaced by PR #1511)

Round 2 (initial commit on this branch) fixed FALSIFY-003/004/007.
Sub-agent PR #1511 (`pv lint --strict-test-binding`) surfaced a 4th
drift in this same contract:

  FALSIFY-001 cited `qwen2_0_5b_matches_hf_config`
    → does NOT exist on main.
  Actual: `qwen2_0_5b_matches_hf_config_2026_05_04`
    (date-suffix added by impl PR #1474 / commit 9af6e71 — May 4).

The earlier round-2 audit (which focused on suffix + module-path
drift) didn't catch this because the test name has a DATE-SUFFIX
drift class (function name + `_<date>` is a real Rust test, but
the contract truncated to the prefix).

Updates:
- FALSIFY-001 test ref: append `_2026_05_04` suffix.
- v1.6.0 changelog updated to record 4 fixes (was 3).
- Verified: cargo test qwen2_0_5b_matches_hf_config_2026_05_04 PASS.
- pv lint --strict-test-binding contracts/apr-pretrain-arch-polymorphic-v1.yaml: 0 PV-VER-002 (down from 4 pre-fix).

This consolidates round 2 into a single commit on the same branch
+ PR (#1509) rather than spawning a round-3 PR for one extra fix.
The lint hardening in #1511 is what made finding the 4th drift
trivial; future drift will be caught at contract-merge time once
#1511 lands.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade,
      PR #1511 (sub-agent's pv lint --strict-test-binding),
      Issue #1510

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant