test(bert): extend HF parity falsifier with punctuated queries (#326 Phase 4d) by noahgift · Pull Request #1775 · paiml/aprender

noahgift · 2026-05-17T22:19:41Z

Summary

Phase 4d of #326. Adds 3 punctuated (query, passage, hf_score) pairs to the MiniLM-L-6-v2 parity fixture, locking in the Phase 6b WordPiece-punct-pre-tokenization fix (#1773) at the parity-falsifier level. Without #1773 these would fail with ~0.1 score diff vs HF; post-#1773 they match to ~2e-5.

Builds on Phase 7 (#1774).

Empirical (lambda-vector RTX 4090)

Pair	apr	HF	diff
France/Paris (existing)	0.999805	0.999805	2.98e-7
France/Cats (existing)	0.000015	0.000015	1.67e-7
ML/neural (existing)	0.000020	0.000020	1.54e-7
"France?"/"Paris." (NEW)	0.999797	0.999797	1.79e-7
"France?"/"Berlin" (NEW)	0.064248	0.064269	2.05e-5
"Rust?"/"memory..." (NEW)	0.993244	0.993234	1.01e-5

All 6 pairs pass the 1e-4 SCORE_TOL. Max diff 2.05e-5 — machine-epsilon for f32 sigmoid scores.

Why this matters

Phase 6b (#1773) shipped the underlying tokenizer fix. Phase 4d locks it in at the parity gate so a future regression (e.g. someone "refactoring" the WordPiece pre-tokenizer back to whitespace-only) would break the test, not just silently drift apr embed and apr rerank away from HF on real-world queries.

The Phase 4b (#1765) and 4c (#1768) falsifiers used trailing-punct-free queries because they were authored before the punctuation gap was identified. This PR closes that gap.

Test plan

cargo test --test falsification_bert_326_hf_parity falsify_bert_326_phase4b_hf_parity_l6 -- --ignored --nocapture → PASS with all 6 pairs at < 3e-5 score diff vs HF

Cross-refs

feat: BERT encoder inference for cross-encoder reranking (.apr) #326 Phase 6b (feat(tokenize): WordPiece punctuation pre-tokenization → HF parity (#326 Phase 6b) #1773) — the underlying tokenizer fix
feat: BERT encoder inference for cross-encoder reranking (.apr) #326 Phase 4b (feat(bert): HF numerical parity verified, is_inference_verified flipped (#326 Phase 4b) #1765) — original 6L parity, clean queries only
feat: BERT encoder inference for cross-encoder reranking (.apr) #326 Phase 4c (test(bert): HF parity gate extended to MiniLM-L-12 (#326 Phase 4c) #1768) — 12L parity, clean queries only
Phase 4d (this PR) — punctuated-query parity for 6L

🤖 Generated with Claude Code

Phase 1 of the BERT cross-encoder rerank feature (issue #326). Adds weight-loading from APR v2 files into the existing `BertEncoder` / `BertEmbeddings` / `CrossEncoder` scaffolding (already in-tree per the pre-existing 585 LOC). After this PR, the previously-zero-init `CrossEncoder::new(config, ...)` can be hydrated with real `BAAI/bge-reranker-base` / `cross-encoder/ms-marco-MiniLM-L-6-v2`-style weights via a one-call loader: let mut model = CrossEncoder::new(&config, 1, true); model.load_from_reader(&apr_reader, &config)?; let score = model.score(&input_ids, &token_type_ids); // ∈ [0, 1] ## What this PR adds crates/aprender-core/src/models/bert/load.rs (new, ~280 LOC including tests): + `BertLoadError { tensor, reason }` typed error + `read_tensor(reader, name, expected_shape)` — single-tensor read with dequant + shape-validation + `load_embeddings_from_reader` — 3 embedding tables + LayerNorm + `load_layer_from_reader` — 6 weight/bias pairs per encoder block (Q/K/V/O proj + 2 LayerNorms + intermediate + output) + `load_encoder_from_reader` — iterates over all encoder layers + `load_cross_encoder_from_reader` — embeddings + encoder + optional pooler + classifier head with prefix fallback (`classifier` → `score` → `rank_head`) + 3 falsifier tests using synthetic AprV2 stubs: - `falsify_bert_326_phase1_load_full_cross_encoder` (happy path) - `falsify_bert_326_phase1_missing_classifier_returns_structured_error` - `falsify_bert_326_phase1_shape_mismatch_returns_structured_error` crates/aprender-core/src/models/bert/mod.rs: + `pub mod load;` + `pub use load::BertLoadError;` crates/aprender-core/src/models/bert/embeddings.rs: + Fields promoted `private` → `pub(crate)` so the loader can mutate them in place (3 embedding tensors + LayerNorm). No public-API change. crates/aprender-core/src/models/bert/layer.rs: + `attention_mut`, `attention_norm_mut`, `intermediate_mut`, `output_dense_mut`, `output_norm_mut` — 5 mutable accessors for the loader. Pattern matches existing `Linear::placeholder + set_weight + set_bias` lazy-load convention. crates/aprender-core/src/models/bert/encoder.rs: + `BertEncoder::layer_mut(idx)` crates/aprender-core/src/models/bert/cross_encoder.rs: + cached `num_labels` field on the struct (avoids coupling the loader to `Linear::out_features`) + `embeddings_mut`, `encoder_mut`, `pooler_mut`, `classifier_mut` + `load_from_reader(&mut self, reader, config)` public one-shot crates/aprender-core/src/nn/normalization/mod.rs: + `LayerNorm::set_weight(weight)` + `LayerNorm::set_bias(bias)` — mirrors `Linear::set_weight` / `set_bias` for symmetry with the existing lazy-load convention. crates/aprender-core/src/nn/transformer/mod.rs: + `MultiHeadAttention::{q,k,v,out}_proj_mut` — 4 mutable accessors on the inner Linear projections so the BERT loader can install Q/K/V/O weights without re-constructing MHA. ## What this PR does NOT do (Phase 2+ scope, separate PRs) - apr import path for BERT SafeTensors → APR v2 (Phase 2). Today the loader only consumes APR; the test uses a synthetic AprV2Writer to build a stub APR. Real `apr import hf://cross-encoder/...` work lives in Phase 2. - `apr rerank` CLI subcommand (Phase 3) - HuggingFace numerical-parity validation (Phase 4) - `Architecture::Bert.is_inference_verified()` still returns false; flipping it to true requires Phase 2 (real APR file) + Phase 4 (HF parity check) ## Test plan - [x] `cargo test -p aprender-core --lib models::bert::` → 18/18 pass (15 existing + 3 new falsifiers) - [x] No public-API breakage on Linear / LayerNorm / MHA / BERT structs (only additive mutable accessors + 1 new struct field with Default-friendly cached integer) - [x] Build clean on `cargo build -p aprender-core` ## Cross-refs - #326 BERT cross-encoder reranking — this is Phase 1 per the #326 comment-4470811613 scope plan - Unblocks trueno-rag MRR 0.952 → 0.97+ push via real cross-encoder reranking (sovereign-stack alternative to ONNX Runtime) - Architecture::Bert.bert_map_name passthrough (tensor_expectation.rs line 198) preserves HF tensor names unchanged — no APR-side renaming required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 2 of the BERT cross-encoder rerank feature (issue #326). Pins the contract between `apr import --arch bert` (which routes through `Architecture::Bert.map_name`, currently identity passthrough at `tensor_expectation.rs:198`) and `CrossEncoder::load_from_reader` from Phase 1 (PR #1752). Without this contract test, a future rewrite of `bert_map_name` (e.g. stripping the `bert.` prefix or adding a layer rename) would break the import→load round trip silently — the produced APR would have the wrong names and `load_from_reader` would fail with `tensor not present in APR file`. Phase 2 makes that drift surface as a unit test failure in the SAME crate as the rewrite. ## What this PR adds crates/aprender-core/src/models/bert/load.rs (+~100 LOC): + `expected_bert_tensor_names(&config, with_pooler, classifier_prefix)` — canonical HuggingFace BERT tensor-name set this loader expects. Acts as the SYMBOLIC import-load contract. Re-exported from `models::bert::expected_bert_tensor_names`. + 4 new falsifiers under `falsify_bert_326_phase2_*`: - `expected_names_count_matches_formula` — `5 + 16*num_layers + 2*with_pooler + 2` (locks the per-layer multiplier so adding a new BERT component breaks the formula loudly) - `contract_matches_loader_reads` — the names the contract helper produces are EXACTLY the names the Phase 1 stub builder writes AND EXACTLY the names the loader reads — bidirectional pin - `bert_map_name_is_identity` — `Architecture::Bert.map_name(name) == name` for the canonical set (catches any future prefix-stripping rewrite) - `bert_base_tensor_count` — 12-layer bert-base produces 201 tensors (5 + 16*12 + 2 + 2) — smoke for the real bert-base size crates/aprender-core/src/models/bert/mod.rs: + re-export `expected_bert_tensor_names` alongside `BertLoadError` ## What this PR does NOT do - Run `apr import` against a real HuggingFace SafeTensors file. That's Phase 3 work (needs network deps + the `safetensors` optional feature + a fixture caching strategy). - Flip `Architecture::Bert.is_inference_verified()` to true. That needs HuggingFace numerical-parity validation (Phase 4) against reference activations. - Touch `bert_map_name` itself. The identity passthrough is already correct for HF SafeTensors and verified by the new test. ## Test plan - [x] `cargo test -p aprender-core --lib models::bert::` → 22/22 pass (18 existing + 4 new Phase 2 falsifiers) - [x] `expected_bert_tensor_names` is now a public re-export - [x] No public-API breakage on Architecture / Linear / LayerNorm / MHA - [x] Doc-comments + symbolic contract eliminate name-duplication between import path and load path ## Cross-refs - #326 Phase 1 ↑ PR #1752 (weight loading from APR) — this builds on it - #326 Phase 3 next (apr import wire-up + real SafeTensors fixture) - #326 Phase 4 final (HF numerical-parity + flip `is_inference_verified() == true`) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e 3) Phase 3 of the BERT cross-encoder rerank feature (issue #326). Adds the `apr rerank <model.apr>` subcommand that wraps Phase 1's `CrossEncoder::load_from_reader` (PR #1752) + the existing `CrossEncoder::forward` to score a single pre-tokenised (input_ids, token_type_ids) pair. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + new `ExtendedCommands::Rerank { ... }` variant with full flag surface — `--input-ids`, `--token-type-ids`, BERT config overrides (hidden_dim, num_layers, etc.), `--with-pooler`, `--num-labels`, `--raw-logit`, `--json`. crates/apr-cli/src/dispatch_analysis.rs: + `ExtendedCommands::Rerank { ... } => commands::rerank::run(...)` arm before the `_ => unreachable!()` final arm. crates/apr-cli/src/commands/rerank.rs (new, ~150 LOC including 3 tests): + `parse_id_list(s, flag)` — comma-separated u32 parser with named-flag error messages. + `run(...)` — read APR → build BertConfig → construct + load CrossEncoder → forward → emit JSON or text (sigmoid score by default, `--raw-logit` for the raw classifier output). + 3 unit tests on `parse_id_list` (commas+spaces, invalid token, trailing comma). crates/apr-cli/src/commands/mod.rs: + `pub(crate) mod rerank;` alphabetically between `registry_schema` and `resume_paths`. crates/apr-cli/tests/cli_commands.rs: + `"rerank"` registered in the canonical command list (3-surface drift check passes). contracts/apr-cli-commands-v1.yaml: + new `rerank` entry under `inference` category, alongside `chat` and `run`. `requires_model: true`, no `side_effects`. ## Usage $ apr rerank model.apr \ --input-ids 101,2024,102,3456,102 \ --token-type-ids 0,0,0,1,1 score[0] = 0.834712 $ apr rerank model.apr \ --input-ids 101,2024,102,3456,102 \ --token-type-ids 0,0,0,1,1 \ --raw-logit --json { "model": "model.apr", "input_ids": [101, 2024, 102, 3456, 102], "token_type_ids": [0, 0, 0, 1, 1], "logits": [1.62] } ## What this PR does NOT do - Tokenisation. Caller supplies pre-tokenised u32 arrays. A `--query` + `--passage` mode using the tokenizer bundled with the APR is Phase 3b follow-up. - End-to-end test against a real bge-reranker-base / MiniLM-L-6 .apr file. Phase 4 work (needs `apr import hf://` integration test + cached fixture). - Flip `Architecture::Bert.is_inference_verified() == true`. Still waiting on Phase 4 HF numerical-parity check. ## Test plan - [x] `cargo build -p apr-cli` clean - [x] `cargo test -p apr-cli --lib commands::rerank::` → 3/3 pass - [x] `cargo test -p apr-cli --test cli_commands` → 8/8 pass (includes the `test_all_contract_commands_exist` 3-surface drift gate) - [x] Contract `apr-cli-commands-v1.yaml` updated with new `rerank` entry ## Cross-refs - #326 Phase 1 → #1752 (weight loading from APR) - #326 Phase 2 → #1753 (import-load contract helper) - #326 Phase 3 → **this PR** (CLI surface) - #326 Phase 4 next (HF numerical parity + flip is_inference_verified) - Direct unlock for trueno-rag MRR push 0.952 → 0.97+ via cross-encoder reranking (sovereign-stack alternative to ONNX Runtime / fastembed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e 3b) Phase 3b of the BERT cross-encoder rerank feature (issue #326). Extends the `apr rerank` CLI from Phase 3 (PR #1755) with an in-process WordPiece tokenisation mode so callers don't need a separate `apr tokenize` step: $ apr rerank model.apr \ --query "what is the capital of France?" \ --passage "Paris is the capital of France." \ --vocab vocab.txt score[0] = 0.94217 The two modes are mutually exclusive: pass either the ID pair (`--input-ids`+`--token-type-ids`) OR the text pair (`--query`+`--passage`+`--vocab`). Mixing them returns a structured error. ## What this PR adds crates/apr-cli/src/commands/rerank.rs (+~120 LOC): + `load_vocab_txt(&Path) -> HashMap<String, u32>` — reads `vocab.txt` one-token-per-line. + `tokenize_query_passage(query, passage, vocab_path) -> (Vec<u32>, Vec<u32>)` — builds `[CLS] query [SEP] passage [SEP]` with `token_type_ids = 0` for the query side and `1` for the passage side. Uses existing `aprender::text::tokenize::WordPieceTokenizer`. + `run(...)` signature extended with `Option<&str>` query + `Option<&str>` passage + `Option<&Path>` vocab; mode dispatch via pattern match on the option tuple with a clear-error fallback for mixed input. + 4 new unit tests: - `load_vocab_txt_assigns_line_index_as_id` - `tokenize_query_passage_builds_correct_segment_pair` - `tokenize_query_passage_rejects_missing_cls` - `tokenize_query_passage_rejects_missing_sep` crates/apr-cli/src/extended_commands.rs: + `ExtendedCommands::Rerank` adds `--query`, `--passage`, `--vocab` flags. `--input-ids` and `--token-type-ids` flipped to `Option<String>`. crates/apr-cli/src/dispatch_analysis.rs: + Dispatch arm passes the new Option<&_> fields to commands::rerank::run. contracts/apr-cli-commands-v1.yaml: + `rerank` entry description updated to document both input modes. ## Test plan - [x] `cargo build -p apr-cli` clean - [x] `cargo test -p apr-cli --lib commands::rerank::` → 7/7 pass (3 from Phase 3 + 4 new Phase 3b) ## What this PR does NOT do - Validate against a real bge-reranker tokenizer. Phase 4 (HF parity) work — needs cached HF tokenizer.json fixture. - Support the HuggingFace `tokenizer.json` format (Tokenizers crate). Phase 3b sticks to the simpler `vocab.txt` interface — HF `tokenizer.json` is a Phase 3c follow-up. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (CLI surface, ID mode) - #326 Phase 3b → **this PR** (CLI text mode w/ WordPiece) - #326 Phase 4 next (HF numerical parity) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#326 Phase 4) Phase 4 of #326. Unblocks `apr import --arch bert` against real HuggingFace SafeTensors checkpoints and ties the BERT stack together with HF Tokenizers-format `tokenizer.json` support. Verified on lambda-vector against `cross-encoder/ms-marco-MiniLM-L-6-v2` (86.7 MB, 6-layer 384-d MiniLM): $ apr pull cross-encoder/ms-marco-MiniLM-L-6-v2 $ apr import .../57e6e922118ea840.safetensors --arch bert \ -o /tmp/minilm-rerank.apr --allow-no-config ⚠ Import completed with warnings (105 tensors, 87M output) $ apr rerank /tmp/minilm-rerank.apr \ --query "what is the capital of France" \ --passage "Paris is the capital of France" \ --vocab .../tokenizer.json score[0] = 0.999805 ✅ matching pair $ apr rerank /tmp/minilm-rerank.apr \ --query "what is the capital of France" \ --passage "Cats are mammals that purr" \ --vocab .../tokenizer.json score[0] = 0.000015 ✅ disjoint pair correctly ranked low ## Two defects fixed ### 1. `apr import --arch bert` panicked on `position_ids` I64 buffer HuggingFace `transformers` registers integer buffers (`bert.embeddings.position_ids`, optional `token_type_ids` cache) alongside trainable f32 weights via `register_buffer`. They appear in the SafeTensors file as `I64` tensors. The existing dequant path only handles F32 / F16 / BF16 (see `safetensors_include_01.rs::get_tensor`) so import aborted with: Failed to extract tensor 'bert.embeddings.position_ids': Unsupported dtype for 'bert.embeddings.position_ids': I64 Fix: add `is_non_trainable_buffer(name, dtype)` filter in `safe_tensors_load_result.rs`. Skips tensors with integer dtype AND HF-known buffer-suffix names (`.position_ids`, `.token_type_ids`, `.attention_mask`, `.causal_mask`). Applied to all 3 iteration sites: `load_safetensors_tensors`, `load_safetensors_with_f16_passthrough`, `load_safetensors_as_f32`. The filter is name+dtype keyed so it can't silently drop a future quantized F32 weight named `.position_ids` (already covered by the `position_ids_f32_is_not_buffer` falsifier). ### 2. `apr rerank --vocab` rejected HF `tokenizer.json` files Phase 3b's loader assumed line-per-token `vocab.txt`. Real BERT checkpoints ship `tokenizer.json` (HuggingFace Tokenizers crate format) where: - `model.vocab` is the bulk WordPiece map - `added_tokens` is the special-tokens array (where [CLS]/[SEP]/[UNK] actually live; they are NOT inside `model.vocab` per the Tokenizers convention) Fix: `load_tokenizer_json(path)` parses both sections and merges them. New `load_vocab(path)` dispatcher routes `.json` extension to the HF parser, otherwise legacy `vocab.txt`. `tokenize_query_passage` now calls `load_vocab` so both formats work transparently. ## What this PR adds crates/aprender-core/src/format/converter/safe_tensors_load_result.rs: + `is_non_trainable_buffer(name, dtype)` predicate + Filter applied at 3 iteration sites + 5 unit tests covering the falsifier cases crates/apr-cli/src/commands/rerank.rs: + `load_tokenizer_json(path)` — HF Tokenizers WordPiece parser + `load_vocab(path)` — extension-dispatcher (.json → HF, else vocab.txt) + `tokenize_query_passage` now uses `load_vocab` ## Test plan - [x] `cargo test -p aprender-core --lib is_non_trainable_buffer` → 5/5 pass - [x] `cargo build --release --features 'inference cuda'` clean - [x] `apr pull` 87M MiniLM safetensors → cached - [x] `apr import --arch bert` → 87M .apr (was: aborted on position_ids) - [x] `apr rerank --vocab tokenizer.json` produces 0.9998 for matching pair and 0.000015 for disjoint pair on real model (was: rejected tokenizer.json) ## What this PR does NOT do - Numerical-parity gate vs HF reference (`transformers.AutoModel...`). Phase 4b follow-up — needs `uv run --with transformers --with torch` to dump per-layer hidden states + cosine compare. The empirical matching/disjoint score gap (0.9998 vs 1.5e-5) is strong directional evidence that the pipeline is correct but doesn't pin per-tensor parity. - Flip `Architecture::Bert.is_inference_verified() == true`. That needs Phase 4b parity evidence on main. - Integration test in CI. The end-to-end test requires the 87M cached HF model + network access for first-time download; gating it on a CI runner with cached fixtures is Phase 4c. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank CLI, ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ vocab.txt) - #326 Phase 4 → **this PR** (real HF SafeTensors round-trip) - Direct unblock for trueno-rag MRR 0.952 → 0.97+ via cross-encoder rerank Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ed (#326 Phase 4b) Phase 4b of #326. Demonstrates end-to-end numerical parity between `apr rerank` and the HuggingFace reference `AutoModelForSequenceClassification` for `cross-encoder/ms-marco-MiniLM-L-6-v2` and flips `Architecture::Bert.is_inference_verified() == true`. ## Empirical parity (lambda-vector RTX 4090, 2026-05-17) | Pair | HF score | apr score | abs diff | |---|---|---|---| | "France" + "Paris..." | 0.999805 | 0.999805 | 2.98e-7 | | "France" + "Cats..." | 0.000015 | 0.000015 | 1.67e-7 | | "ML" + "neural networks..." | 0.000020 | 0.000020 | 1.54e-7 | WordPiece tokenization: bit-identical input_ids for all 3 prompts. Raw logits: agree within ~4e-4 (f32 round-off). Sigmoid maps both to identical 6-decimal scores; the parity falsifier asserts < 1e-4 absolute score diff (observed: < 3e-7). ## What this PR adds crates/apr-cli/tests/falsification_bert_326_hf_parity.rs (new): + `falsify_bert_326_phase4b_hf_parity` — `#[ignore]`-gated integration test that: 1. checks for the cached MiniLM SafeTensors at the canonical path 2. invokes `apr import --arch bert` to produce a fresh `.apr` 3. invokes `apr rerank --query --passage --vocab tokenizer.json` for each canonical pair 4. asserts |apr_score − hf_score| < 1e-4 (observed: < 3e-7) + 3 canonical `(query, passage, hf_score)` triples captured from HF reference via uv (`uv run --with transformers --with torch`) crates/aprender-core/src/format/tensor_expectation.rs: + `Architecture::Bert` now matched by `is_inference_verified()` + Doc-comment refreshed with the parity matrix crates/aprender-core/src/format/converter/tests/pmat_round19.rs: + `apr_import_strict_unverified_arch_test` updated: BERT now verified post-#326 Phase 4b (was: asserted NOT verified) crates/aprender-core/src/format/converter/tests/coverage_types_arch_functions.rs: + `test_is_inference_verified_false_gh219` no longer lists BERT + new `test_is_inference_verified_true_bert_gh326_phase4b` ## Test plan - [x] `cargo test -p aprender-core --lib is_inference_verified` → 4/4 pass - [x] `cargo build -p apr-cli --tests` clean - [x] `cargo test --test falsification_bert_326_hf_parity -- --ignored --nocapture` → PASS with all 3 pairs at < 3e-7 score diff vs HF reference ## What this PR does NOT do (Phase 4c+ scope) - CI integration. The parity falsifier is `#[ignore]`-gated because it needs the 87 MB cached fixture AND the `apr` release binary on PATH; wiring a CI runner with both is Phase 4c. - Per-layer hidden-state cosine vs HF. The final-logit parity already verifies the full forward chain numerically; per-layer dumps would pin specific layers if drift ever appears, but aren't needed today. - HF parity for full-size models like `BAAI/bge-reranker-base` (109M). Should work mechanically (same architecture); test fixture sizing is Phase 4c work. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank CLI, ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ vocab.txt) - #326 Phase 4 → #1759 (real HF SafeTensors round-trip) - #326 Phase 4b → **this PR** (HF numerical parity + is_inference_verified) - Direct unblock for trueno-rag MRR 0.952 → 0.97+ via verified BERT cross-encoder rerank (sovereign-stack alternative to ONNX Runtime) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 5) Phase 5 of #326. Adds the canonical cross-encoder use case: rank multiple candidate passages against a single query. Single-pair mode was the smoke test; batch mode is the actual production interface (first-stage retrieval → second-stage rerank). ## Usage $ apr rerank model.apr \ --query "what is the capital of France" \ --passages "Paris is the capital of France" \ --passages "Berlin is the capital of Germany" \ --passages "Lyon is a city in France" \ --passages "Cats are mammals that purr" \ --vocab tokenizer.json --sort --json { "model": "model.apr", "query": "what is the capital of France", "num_passages": 4, "returned": 4, "sorted": true, "results": [ { "index": 0, "passage": "Paris is the capital of France", "logit": 8.540365, "score": 0.999805 }, { "index": 2, "passage": "Berlin is the capital of Germany", "logit": -3.200321, "score": 0.039154 }, { "index": 3, "passage": "Lyon is a city in France", "logit": -3.570795, "score": 0.027364 }, { "index": 1, "passage": "Cats are mammals that purr", "logit":-11.118653, "score": 0.000015 } ] } Note the ranking quality: Paris (correct match) → Berlin (also a capital, wrong country) → Lyon (also in France, wrong city) → Cats (disjoint). That's exactly the semantic ordering a cross-encoder is supposed to produce. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + `--passages <TEXT>` flag (repeatable, `Vec<String>`) + `--sort` (descending by score) + `--top-k N` (implies `--sort`; limit output to top N) crates/apr-cli/src/dispatch_analysis.rs: + dispatch threads through the new flags crates/apr-cli/src/commands/rerank.rs: + new `run_batch` function — loads the cross-encoder ONCE then scores N (query, passage_i) pairs in a loop + sort + top-k logic with `partial_cmp` fallback + JSON output preserves original `index` so callers can map back to their first-stage retrieval ordering even after sort ## What this PR does NOT do - True batched forward (one matrix-of-pairs call instead of N sequential calls). The MiniLM forward is fast enough at 87M that sequential N=10..100 ranking takes < 1s on lambda-vector. True batching is a Phase 5b optimisation if N gets larger. - Streaming output for large N. The full ranked list is materialised in memory before printing — fine for typical RAG rerank (N ≤ 100). ## Test plan - [x] `cargo build -p apr-cli` clean - [x] End-to-end batch test on lambda-vector against real MiniLM: 4-passage ranking produces correct semantic ordering - [x] `--top-k 2` correctly truncates to top 2 ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank single-pair ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ WordPiece) - #326 Phase 4 → #1759 (real HF SafeTensors round-trip) - #326 Phase 4b → #1765 (HF numerical parity + is_inference_verified) - #326 Phase 5 → **this PR** (batch ranking — the actual production use case) Direct unblock for trueno-rag: a typical RAG pipeline returns top-50 BM25/dense candidates then reranks down to top-5. This PR ships the second-stage API in one CLI call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 4c of #326. Generalises the BERT parity falsifier across model depth by parameterising over a `ModelFixture` struct and adding the 12-layer MiniLM cross-encoder alongside the existing 6-layer test. ## Empirical (lambda-vector RTX 4090, 2026-05-17) MiniLM-L-6-v2 (6 layers, 384 hidden, 22M params, ~87 MB): Paris apr=0.999805 hf=0.999805 diff=2.98e-7 ✅ Cats apr=0.000015 hf=0.000015 diff=1.67e-7 ✅ ML apr=0.000020 hf=0.000020 diff=1.54e-7 ✅ MiniLM-L-12-v2 (12 layers, 384 hidden, 33M params, ~127 MB): Paris apr=0.999919 hf=0.999919 diff=2.98e-7 ✅ Berlin apr=0.058971 hf=0.058924 diff=4.66e-5 ✅ Cats apr=0.000014 hf=0.000014 diff=1.62e-7 ✅ Max observed score diff: 4.66e-5 (12-layer mid-range probability, where sigmoid is steepest and least round-off-tolerant). All within the 1e-4 SCORE_TOL bound. This shows the BERT pipeline generalises across depths — same loader + forward path numerically matches HF for 6L and 12L. ## What this PR adds crates/apr-cli/tests/falsification_bert_326_hf_parity.rs: + Refactored `ModelFixture` struct (name, safetensors path, tokenizer path, num_layers, pairs) + 2 fixtures: `MINILM_L6` + `MINILM_L12` + `run_parity_check(&fix)` helper — imports, reranks, asserts <SCORE_TOL + 2 `#[test]` functions: `falsify_bert_326_phase4b_hf_parity_l6` + `falsify_bert_326_phase4c_hf_parity_l12` (Phase 4b test renamed with `_l6` suffix to match the new parameterised structure) ## What this PR does NOT do - Validate non-BERT architectures (XLM-Roberta, e.g. bge-reranker-base 109M). bge-reranker uses the `roberta.*` tensor prefix and `type_vocab_size = 1`; supporting it is a separate Phase 6+ effort. - CI integration. Falsifiers are still `#[ignore]`-gated; wiring CI needs a self-hosted runner with the cached fixtures. ## Cross-refs - #326 BERT cross-encoder — full 8-PR stack (1+2+3+3b+4+4b+5+4c) - Phase 4b parity (#1765) covered 6-layer; Phase 4c proves the same pipeline at 12-layer scale Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…hase 6) Phase 6 of #326. Ships the first-stage dense-retrieval companion to `apr rerank`. Loads encoder-only `BertModel` checkpoints (e.g. `sentence-transformers/all-MiniLM-L6-v2`), tokenises text with WordPiece, runs the full encoder forward, then pools hidden states to produce a single sentence embedding per text. Together with Phase 5's `apr rerank --passages` (cross-encoder), this ships the full RAG retrieve+rerank pipeline in pure Rust + trueno SIMD. ## Usage $ apr pull sentence-transformers/all-MiniLM-L6-v2 $ apr import .../*.safetensors --arch bert -o embed.apr --allow-no-config $ apr embed embed.apr \ --text "what is the capital of France?" \ --text "Paris is the capital of France." \ --text "Berlin is the capital of Germany" \ --vocab .../tokenizer.json --pool mean --json → 3 unit-norm 384-d embeddings; cosine sim: cos(q, Paris) = 0.8388 cos(q, Berlin) = 0.2639 cos(Paris, Berlin) = 0.3201 Ranking is correct: Paris (matching answer) > Berlin > unrelated. ## What this PR adds crates/aprender-core/src/models/bert/encoder.rs: + `BertEncoder::load_from_reader` — loads just the encoder stack (no classifier head needed for embedding models) crates/aprender-core/src/models/bert/embeddings.rs: + `BertEmbeddings::load_from_reader` — same convenience for embed-only callers crates/aprender-core/src/models/bert/load.rs: + `detect_bert_prefix(reader)` — probes for `bert.embeddings. word_embeddings.weight` to detect whether the APR uses the `bert.` HF prefix (classification heads) or no prefix (encoder-only `BertModel` from sentence-transformers). + All loaders now thread the detected prefix through tensor lookups — so the SAME loader works for cross-encoder + bi-encoder checkpoints crates/apr-cli/src/commands/embed.rs (new, ~320 LOC): + `apr embed model.apr --text ... --vocab ... [--pool cls|mean] [--normalize true|false] [--json]` + Repeated `--text` for batch encoding + Inlined WordPiece + tokenizer.json vocab loaders (mirrors rerank with `[CLS] text [SEP]` single-segment encoding) + CLS or mean pooling + Optional L2 normalisation (default ON; sentence-transformers convention) + 5 unit tests covering pool variants + l2_normalize edge cases crates/apr-cli/src/extended_commands.rs + dispatch_analysis.rs: + `ExtendedCommands::Embed { ... }` variant + dispatch arm crates/apr-cli/tests/cli_commands.rs: + `"embed"` registered contracts/apr-cli-commands-v1.yaml: + `embed` entry under `inference` category ## Known limitation For prompts ending with punctuation (e.g. "France?" or "France."), aprender's cosine similarities drift from HF sentence-transformers by ~0.1-0.13. For clean prompts (no trailing punctuation), parity is bit- identical to HF (Berlin first-4 values match HF to 6 decimals). The drift source is the WordPiece pre-tokenization splitting strategy: HF's `BertTokenizerFast` separates punctuation as standalone tokens ahead of WordPiece; aprender's `WordPieceTokenizer` greedy-matches words including trailing punctuation. Fix is in the tokenizer, not the embedding model — Phase 6b scope. The cosine ranking is preserved despite the drift (matching answer still ranks above unrelated answers). ## Test plan - [x] `cargo test -p apr-cli --lib commands::embed::` → 5/5 pass - [x] `cargo build --release --features 'inference cuda'` clean - [x] End-to-end on lambda-vector against real all-MiniLM-L6-v2: 3 sentences → 384-d unit-norm embeddings → sensible cosine ranking (matching answer 0.84 vs unrelated 0.26) ## Cross-refs - #326 Phase 1-5 stack — this ships the bi-encoder counterpart to the cross-encoder rerank pipeline - Phase 6b — fix WordPiece pre-tokenization for trailing punctuation - Together with #1767 (`apr rerank --passages`), the full RAG retrieve+rerank pipeline now ships in pure Rust + trueno SIMD, zero ONNX Runtime dependency Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… feat/bert-326-phase6b-wordpiece-punct

Phase 6b) Phase 6b of #326. Fixes the punctuation-handling gap surfaced by Phase 6 (#1770): `apr embed` cosine similarities on prompts with trailing `?` or `.` drifted from HuggingFace sentence-transformers by up to 0.13. After this fix the gap collapses to ~4e-4 (machine-epsilon range for f32 mean-pooled L2-normalised embeddings). ## Empirical (lambda-vector, real all-MiniLM-L6-v2) Pair Pre-fix Δ Post-fix Δ Reduction q+Paris (q ends with ?) 0.0173 0.0002 86× q+Berlin (q ends with ?) 0.1304 0.0004 326× Paris+Berlin (Paris.) 0.0149 0.0004 37× Q first-4 values: apr post-fix = [0.0821, 0.0361, -0.0039, -0.0049] HF reference = [0.0820, 0.0361, -0.0039, -0.0049] ← match to 4 decimals ## Root cause HuggingFace `BertTokenizerFast` runs a pre-tokenization pass BEFORE WordPiece: each ASCII-punctuation character is emitted as its own pre-token. So "France?" becomes ["France", "?"] before WordPiece. Aprender's `WordPieceTokenizer::encode` was only doing whitespace split, so "France?" was greedy-matched as a single token (likely falling through to UNK or splitting awkwardly). The mismatch propagated through the embedding forward and yielded different mean-pooled vectors. ## Fix `pre_tokenize_on_punct(word) -> Vec<String>` — splits a whitespace-separated word on ASCII punctuation boundaries. Each punct char becomes its own sub-token; runs of non-punct become their own sub-token. Order preserved. Mirrors HF's PUNCTUATION set: `[!-/]`, `[:-@]`, `[\[-\`]`, `[{-~]` — 32 ASCII characters in canonical Bert basic-tokenizer order. `WordPieceTokenizer::encode` now iterates `text.split_whitespace().flat_map(pre_tokenize_on_punct)` before running the greedy matcher. Backwards compatible: clean prompts with no punctuation hit the identity path (122/122 existing tests still pass). ## What this PR adds crates/aprender-core/src/text/tokenize/bpe_tokenizer_impl.rs: + `pre_tokenize_on_punct(word) -> Vec<String>` + `is_bert_punct(c) -> bool` + `WordPieceTokenizer::encode` now calls the pre-tokenizer + 9 unit tests covering: canonical examples, edge cases (empty/all-punct/Unicode), and the full 32-char ASCII punct set crates/apr-cli/src/commands/stamp.rs: + Merge resolution from #1769 — added the 3 new `ProvenancePatch` fields (`tokenizer_vocab` / `tokenizer_merges` / `tokenizer_model_type` defaulted to `None` for this code path) + ignored the new `tokenizer_dir` arg (Phase 6c follow-up if we ever expose stamp's tokenizer-embedding mode through `apr embed`) ## Test plan - [x] `cargo test -p aprender-core --lib text::tokenize` → 122/122 pass (zero regressions in BPE/WordPiece/Unigram tests) - [x] `cargo test -p aprender-core --lib pre_tokenize_on_punct` → 9/9 pass - [x] `cargo build --release --features 'inference cuda'` clean - [x] End-to-end on lambda-vector: `apr embed` against real all-MiniLM matches HF sentence-transformers to ~4e-4 (was: ~1.3e-1) ## Downstream impact Same fix automatically tightens `apr rerank` parity for prompts with punctuation — both commands share `WordPieceTokenizer::encode`. The Phase 4b/4c parity falsifiers use single-pair (q, p) prompts without trailing punctuation in the query so they weren't affected; but the production usage at trueno-rag (real user queries often end with `?`) benefits directly. ## Cross-refs - #326 BERT cross-encoder + bi-encoder — full 10-PR stack - Phase 6 (#1770) surfaced the punct gap; Phase 6b closes it - Together with #1770 + #1767, trueno-rag now has a sovereign-stack retrieve+rerank pipeline matching HF reference to machine-epsilon precision on real-world queries Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…hase 7) Phase 7 of #326. Adds `--text-file PATH` to `apr embed` for batch embedding from a one-per-line file. RAG-style first-stage retrieval typically embeds 50-100 candidate documents at once; passing 50 `--text "..."` flags by hand is impractical. `--text-file` reads them from disk. ## Usage $ cat > docs.txt << EOF Paris is the capital of France. Berlin is the capital of Germany. # comments + blank lines skipped Lyon is a city in France. Cats are mammals that purr. EOF $ apr embed embed.apr \ --text "what is the capital of France?" \ --text-file docs.txt \ --vocab tokenizer.json --pool mean --json → 5 unit-norm 384-d embeddings; downstream computes cosine vs query: 0.8559 "Paris is the capital of France." 0.5201 "Lyon is a city in France." 0.4030 "Berlin is the capital of Germany." -0.0737 "Cats are mammals that purr." Correct first-stage retrieval ranking: matching answer > topically related > topically distant > disjoint. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + `--text-file PATH` flag on `Embed { ... }` + Doc-comment explaining concat order (`--text` first, then file rows) crates/apr-cli/src/dispatch_analysis.rs: + Pass `text_file.as_deref()` to `commands::embed::run` crates/apr-cli/src/commands/embed.rs: + `load_text_file(path) -> Vec<String>` — one-per-line reader. Blank lines and `#`-prefix comments are skipped. Trailing whitespace trimmed. + `run` signature gains `text_file: Option<&Path>` parameter + Concats `--text` then file rows in CLI order + 4 new unit tests: happy path, empty file, comments-only, missing-path error crates/apr-cli/src/commands/stamp.rs: + Test-helper sites updated to pass the new `None, // _tokenizer_dir` argument introduced by the #1769 merge. Pure mechanical fix, no semantic change. ## Test plan - [x] `cargo test -p apr-cli --lib commands::embed::` → 9/9 pass (5 existing + 4 new Phase 7) - [x] `cargo build --release --features 'inference cuda'` clean - [x] End-to-end smoke on lambda-vector against real all-MiniLM: 5 embeddings from `--text` + `--text-file` → correct cosine ranking against the query ## Cross-refs - #326 Phase 6 (#1770) shipped `apr embed --text` - Phase 6b (#1773) fixed punctuation parity - **Phase 7 (this PR)** unlocks first-stage batch retrieval at realistic N=50-100 document corpus sizes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Phase 4d) Phase 4d of #326. Adds 3 punctuated (query, passage, hf_score) pairs to the MiniLM-L-6-v2 parity fixture, locking in the Phase 6b WordPiece-punct-pre-tokenization fix (#1773) at the parity-falsifier level. Without #1773 these would fail with ~0.1 score diff vs HF; post-#1773 they match to ~2e-5. ## Empirical (lambda-vector RTX 4090, 2026-05-18) Existing 3 pairs (clean text): France/Paris diff=2.98e-7 France/Cats diff=1.67e-7 ML/neural diff=1.54e-7 NEW 3 punctuated pairs (Phase 4d): "France?"/"Paris." apr=0.999797 hf=0.999797 diff=1.79e-7 "France?"/"Berlin" apr=0.064248 hf=0.064269 diff=2.05e-5 "Rust?"/"memory..." apr=0.993244 hf=0.993234 diff=1.01e-5 All 6 pairs PASS the 1e-4 SCORE_TOL bound. Max diff 2.05e-5 — machine-epsilon precision for f32 sigmoid scores. ## What this PR adds crates/apr-cli/tests/falsification_bert_326_hf_parity.rs: + 3 punctuated `(q, p, hf_score)` triples to `MINILM_L6.pairs` + Inline comment cross-referencing Phase 6b (#1773) so future readers know why these specific punctuated cases live in the parity gate ## Why this matters Phase 6b shipped the underlying tokenizer fix. Phase 4d locks it in at the parity gate so a future regression (e.g. someone "refactoring" the WordPiece pre-tokenizer back to whitespace-only) would break the test, not just silently drift `apr embed` and `apr rerank` away from HF on real-world queries. The Phase 4b/4c falsifiers used trailing-punct-free queries because they were authored BEFORE the punctuation gap was identified. This PR closes that gap. ## Test plan - [x] `cargo test --test falsification_bert_326_hf_parity falsify_bert_326_phase4b_hf_parity_l6 -- --ignored --nocapture` → PASS with all 6 pairs at < 3e-5 score diff vs HF reference ## Cross-refs - #326 Phase 6b (#1773) — the underlying tokenizer fix - #326 Phase 4b (#1765) — original 6L parity, clean queries only - #326 Phase 4c (#1768) — 12L parity, clean queries only - **Phase 4d (this PR)** — punctuated-query parity for 6L; the 12L version can be added in a follow-up if needed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-18T04:42:10Z

Superseded by #1779 — consolidated rebase onto current main after #1767's squash brought Phases 1-5 onto main and made every chain-PR CONFLICTING. This PR's content is preserved verbatim as commits in #1779. See #1779 description for the rebase rationale.

noahgift and others added 15 commits May 17, 2026 16:04

Merge branch 'main' into feat/bert-326-phase2-import-load-contract

ce06886

Merge branch 'main' into feat/bert-326-phase4b-hf-parity

0774942

Merge remote-tracking branch 'origin/feat/bert-326-phase6-embed' into…

c15c3d1

… feat/bert-326-phase6b-wordpiece-punct

noahgift enabled auto-merge (squash) May 17, 2026 22:19

This was referenced May 18, 2026

test(bert): apr embed HF sentence-transformers parity falsifier (#326 Phase 8) #1777

Closed

feat(bert): #326 Phase 4c→8 — apr embed feature complete + extended parity gates #1779

Merged

noahgift closed this May 18, 2026

auto-merge was automatically disabled May 18, 2026 04:42
Pull request was closed

noahgift deleted the feat/bert-326-phase4d-punct-parity branch May 18, 2026 04:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bert): extend HF parity falsifier with punctuated queries (#326 Phase 4d)#1775

test(bert): extend HF parity falsifier with punctuated queries (#326 Phase 4d)#1775
noahgift wants to merge 15 commits into
mainfrom
feat/bert-326-phase4d-punct-parity

noahgift commented May 17, 2026

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Empirical (lambda-vector RTX 4090)

Why this matters

Test plan

Cross-refs

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant