feat(bert): cross-encoder weight loading from APR v2 (#326 Phase 1) by noahgift · Pull Request #1752 · paiml/aprender

noahgift · 2026-05-17T14:04:26Z

Summary

Phase 1 of the BERT cross-encoder rerank feature (#326). Adds weight-loading from APR v2 files into the existing 585 LOC of BertEncoder / BertEmbeddings / CrossEncoder scaffolding.

After this PR, the previously-zero-init CrossEncoder::new(config, ...) can be hydrated:

let mut model = CrossEncoder::new(&config, 1, true);
model.load_from_reader(&apr_reader, &config)?;
let score = model.score(&input_ids, &token_type_ids);  // ∈ [0, 1]

Test plan

cargo test -p aprender-core --lib models::bert:: → 18/18 pass (15 existing + 3 new falsifiers)
No public-API breakage — all additive mutable accessors
cargo build -p aprender-core clean

What this PR adds

crates/aprender-core/src/models/bert/load.rs (new, ~280 LOC) — BertLoadError + per-component loaders + 3 falsifier tests using AprV2Writer-built synthetic APR stubs
Linear::set_weight/set_bias already existed; adds LayerNorm::set_weight/set_bias + MultiHeadAttention::{q,k,v,out}_proj_mut
BertLayer / BertEncoder / BertEmbeddings / CrossEncoder — mutable accessors for the loader
CrossEncoder::load_from_reader(&mut self, reader, config) — one-shot public entrypoint

Falsifier coverage

falsify_bert_326_phase1_load_full_cross_encoder          ✅ happy path
falsify_bert_326_phase1_missing_classifier_returns_error ✅ structured BertLoadError
falsify_bert_326_phase1_shape_mismatch_returns_error     ✅ shape-mismatch detection

The full-cross-encoder falsifier builds a tiny synthetic APR (vocab=32, hidden=8, 2 layers) via AprV2Writer, loads it via CrossEncoder::load_from_reader, then forwards on a 3-token input — verifies the loaded model runs end-to-end without panic.

What this PR does NOT do (Phase 2+ scope)

Phase 2 — apr import hf://cross-encoder/... → APR v2 conversion. Today the loader consumes APR; real-checkpoint import is the next PR.
Phase 3 — apr rerank CLI subcommand
Phase 4 — HuggingFace numerical-parity validation against reference activations
Architecture::Bert.is_inference_verified() still returns false; flipping requires Phase 2 + Phase 4

Cross-refs

feat: BERT encoder inference for cross-encoder reranking (.apr) #326 scope plan posted as comment-4470811613
Direct unblock for trueno-rag MRR 0.952 → 0.97+ push (sovereign-stack alternative to ONNX Runtime / fastembed)
Architecture::Bert.bert_map_name (tensor_expectation.rs:198) preserves HF tensor names unchanged — no APR-side renaming required

🤖 Generated with Claude Code

Phase 1 of the BERT cross-encoder rerank feature (issue #326). Adds weight-loading from APR v2 files into the existing `BertEncoder` / `BertEmbeddings` / `CrossEncoder` scaffolding (already in-tree per the pre-existing 585 LOC). After this PR, the previously-zero-init `CrossEncoder::new(config, ...)` can be hydrated with real `BAAI/bge-reranker-base` / `cross-encoder/ms-marco-MiniLM-L-6-v2`-style weights via a one-call loader: let mut model = CrossEncoder::new(&config, 1, true); model.load_from_reader(&apr_reader, &config)?; let score = model.score(&input_ids, &token_type_ids); // ∈ [0, 1] ## What this PR adds crates/aprender-core/src/models/bert/load.rs (new, ~280 LOC including tests): + `BertLoadError { tensor, reason }` typed error + `read_tensor(reader, name, expected_shape)` — single-tensor read with dequant + shape-validation + `load_embeddings_from_reader` — 3 embedding tables + LayerNorm + `load_layer_from_reader` — 6 weight/bias pairs per encoder block (Q/K/V/O proj + 2 LayerNorms + intermediate + output) + `load_encoder_from_reader` — iterates over all encoder layers + `load_cross_encoder_from_reader` — embeddings + encoder + optional pooler + classifier head with prefix fallback (`classifier` → `score` → `rank_head`) + 3 falsifier tests using synthetic AprV2 stubs: - `falsify_bert_326_phase1_load_full_cross_encoder` (happy path) - `falsify_bert_326_phase1_missing_classifier_returns_structured_error` - `falsify_bert_326_phase1_shape_mismatch_returns_structured_error` crates/aprender-core/src/models/bert/mod.rs: + `pub mod load;` + `pub use load::BertLoadError;` crates/aprender-core/src/models/bert/embeddings.rs: + Fields promoted `private` → `pub(crate)` so the loader can mutate them in place (3 embedding tensors + LayerNorm). No public-API change. crates/aprender-core/src/models/bert/layer.rs: + `attention_mut`, `attention_norm_mut`, `intermediate_mut`, `output_dense_mut`, `output_norm_mut` — 5 mutable accessors for the loader. Pattern matches existing `Linear::placeholder + set_weight + set_bias` lazy-load convention. crates/aprender-core/src/models/bert/encoder.rs: + `BertEncoder::layer_mut(idx)` crates/aprender-core/src/models/bert/cross_encoder.rs: + cached `num_labels` field on the struct (avoids coupling the loader to `Linear::out_features`) + `embeddings_mut`, `encoder_mut`, `pooler_mut`, `classifier_mut` + `load_from_reader(&mut self, reader, config)` public one-shot crates/aprender-core/src/nn/normalization/mod.rs: + `LayerNorm::set_weight(weight)` + `LayerNorm::set_bias(bias)` — mirrors `Linear::set_weight` / `set_bias` for symmetry with the existing lazy-load convention. crates/aprender-core/src/nn/transformer/mod.rs: + `MultiHeadAttention::{q,k,v,out}_proj_mut` — 4 mutable accessors on the inner Linear projections so the BERT loader can install Q/K/V/O weights without re-constructing MHA. ## What this PR does NOT do (Phase 2+ scope, separate PRs) - apr import path for BERT SafeTensors → APR v2 (Phase 2). Today the loader only consumes APR; the test uses a synthetic AprV2Writer to build a stub APR. Real `apr import hf://cross-encoder/...` work lives in Phase 2. - `apr rerank` CLI subcommand (Phase 3) - HuggingFace numerical-parity validation (Phase 4) - `Architecture::Bert.is_inference_verified()` still returns false; flipping it to true requires Phase 2 (real APR file) + Phase 4 (HF parity check) ## Test plan - [x] `cargo test -p aprender-core --lib models::bert::` → 18/18 pass (15 existing + 3 new falsifiers) - [x] No public-API breakage on Linear / LayerNorm / MHA / BERT structs (only additive mutable accessors + 1 new struct field with Default-friendly cached integer) - [x] Build clean on `cargo build -p aprender-core` ## Cross-refs - #326 BERT cross-encoder reranking — this is Phase 1 per the #326 comment-4470811613 scope plan - Unblocks trueno-rag MRR 0.952 → 0.97+ push via real cross-encoder reranking (sovereign-stack alternative to ONNX Runtime) - Architecture::Bert.bert_map_name passthrough (tensor_expectation.rs line 198) preserves HF tensor names unchanged — no APR-side renaming required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 2 of the BERT cross-encoder rerank feature (issue #326). Pins the contract between `apr import --arch bert` (which routes through `Architecture::Bert.map_name`, currently identity passthrough at `tensor_expectation.rs:198`) and `CrossEncoder::load_from_reader` from Phase 1 (PR #1752). Without this contract test, a future rewrite of `bert_map_name` (e.g. stripping the `bert.` prefix or adding a layer rename) would break the import→load round trip silently — the produced APR would have the wrong names and `load_from_reader` would fail with `tensor not present in APR file`. Phase 2 makes that drift surface as a unit test failure in the SAME crate as the rewrite. ## What this PR adds crates/aprender-core/src/models/bert/load.rs (+~100 LOC): + `expected_bert_tensor_names(&config, with_pooler, classifier_prefix)` — canonical HuggingFace BERT tensor-name set this loader expects. Acts as the SYMBOLIC import-load contract. Re-exported from `models::bert::expected_bert_tensor_names`. + 4 new falsifiers under `falsify_bert_326_phase2_*`: - `expected_names_count_matches_formula` — `5 + 16*num_layers + 2*with_pooler + 2` (locks the per-layer multiplier so adding a new BERT component breaks the formula loudly) - `contract_matches_loader_reads` — the names the contract helper produces are EXACTLY the names the Phase 1 stub builder writes AND EXACTLY the names the loader reads — bidirectional pin - `bert_map_name_is_identity` — `Architecture::Bert.map_name(name) == name` for the canonical set (catches any future prefix-stripping rewrite) - `bert_base_tensor_count` — 12-layer bert-base produces 201 tensors (5 + 16*12 + 2 + 2) — smoke for the real bert-base size crates/aprender-core/src/models/bert/mod.rs: + re-export `expected_bert_tensor_names` alongside `BertLoadError` ## What this PR does NOT do - Run `apr import` against a real HuggingFace SafeTensors file. That's Phase 3 work (needs network deps + the `safetensors` optional feature + a fixture caching strategy). - Flip `Architecture::Bert.is_inference_verified()` to true. That needs HuggingFace numerical-parity validation (Phase 4) against reference activations. - Touch `bert_map_name` itself. The identity passthrough is already correct for HF SafeTensors and verified by the new test. ## Test plan - [x] `cargo test -p aprender-core --lib models::bert::` → 22/22 pass (18 existing + 4 new Phase 2 falsifiers) - [x] `expected_bert_tensor_names` is now a public re-export - [x] No public-API breakage on Architecture / Linear / LayerNorm / MHA - [x] Doc-comments + symbolic contract eliminate name-duplication between import path and load path ## Cross-refs - #326 Phase 1 ↑ PR #1752 (weight loading from APR) — this builds on it - #326 Phase 3 next (apr import wire-up + real SafeTensors fixture) - #326 Phase 4 final (HF numerical-parity + flip `is_inference_verified() == true`) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e 3) Phase 3 of the BERT cross-encoder rerank feature (issue #326). Adds the `apr rerank <model.apr>` subcommand that wraps Phase 1's `CrossEncoder::load_from_reader` (PR #1752) + the existing `CrossEncoder::forward` to score a single pre-tokenised (input_ids, token_type_ids) pair. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + new `ExtendedCommands::Rerank { ... }` variant with full flag surface — `--input-ids`, `--token-type-ids`, BERT config overrides (hidden_dim, num_layers, etc.), `--with-pooler`, `--num-labels`, `--raw-logit`, `--json`. crates/apr-cli/src/dispatch_analysis.rs: + `ExtendedCommands::Rerank { ... } => commands::rerank::run(...)` arm before the `_ => unreachable!()` final arm. crates/apr-cli/src/commands/rerank.rs (new, ~150 LOC including 3 tests): + `parse_id_list(s, flag)` — comma-separated u32 parser with named-flag error messages. + `run(...)` — read APR → build BertConfig → construct + load CrossEncoder → forward → emit JSON or text (sigmoid score by default, `--raw-logit` for the raw classifier output). + 3 unit tests on `parse_id_list` (commas+spaces, invalid token, trailing comma). crates/apr-cli/src/commands/mod.rs: + `pub(crate) mod rerank;` alphabetically between `registry_schema` and `resume_paths`. crates/apr-cli/tests/cli_commands.rs: + `"rerank"` registered in the canonical command list (3-surface drift check passes). contracts/apr-cli-commands-v1.yaml: + new `rerank` entry under `inference` category, alongside `chat` and `run`. `requires_model: true`, no `side_effects`. ## Usage $ apr rerank model.apr \ --input-ids 101,2024,102,3456,102 \ --token-type-ids 0,0,0,1,1 score[0] = 0.834712 $ apr rerank model.apr \ --input-ids 101,2024,102,3456,102 \ --token-type-ids 0,0,0,1,1 \ --raw-logit --json { "model": "model.apr", "input_ids": [101, 2024, 102, 3456, 102], "token_type_ids": [0, 0, 0, 1, 1], "logits": [1.62] } ## What this PR does NOT do - Tokenisation. Caller supplies pre-tokenised u32 arrays. A `--query` + `--passage` mode using the tokenizer bundled with the APR is Phase 3b follow-up. - End-to-end test against a real bge-reranker-base / MiniLM-L-6 .apr file. Phase 4 work (needs `apr import hf://` integration test + cached fixture). - Flip `Architecture::Bert.is_inference_verified() == true`. Still waiting on Phase 4 HF numerical-parity check. ## Test plan - [x] `cargo build -p apr-cli` clean - [x] `cargo test -p apr-cli --lib commands::rerank::` → 3/3 pass - [x] `cargo test -p apr-cli --test cli_commands` → 8/8 pass (includes the `test_all_contract_commands_exist` 3-surface drift gate) - [x] Contract `apr-cli-commands-v1.yaml` updated with new `rerank` entry ## Cross-refs - #326 Phase 1 → #1752 (weight loading from APR) - #326 Phase 2 → #1753 (import-load contract helper) - #326 Phase 3 → **this PR** (CLI surface) - #326 Phase 4 next (HF numerical parity + flip is_inference_verified) - Direct unlock for trueno-rag MRR push 0.952 → 0.97+ via cross-encoder reranking (sovereign-stack alternative to ONNX Runtime / fastembed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e 3b) Phase 3b of the BERT cross-encoder rerank feature (issue #326). Extends the `apr rerank` CLI from Phase 3 (PR #1755) with an in-process WordPiece tokenisation mode so callers don't need a separate `apr tokenize` step: $ apr rerank model.apr \ --query "what is the capital of France?" \ --passage "Paris is the capital of France." \ --vocab vocab.txt score[0] = 0.94217 The two modes are mutually exclusive: pass either the ID pair (`--input-ids`+`--token-type-ids`) OR the text pair (`--query`+`--passage`+`--vocab`). Mixing them returns a structured error. ## What this PR adds crates/apr-cli/src/commands/rerank.rs (+~120 LOC): + `load_vocab_txt(&Path) -> HashMap<String, u32>` — reads `vocab.txt` one-token-per-line. + `tokenize_query_passage(query, passage, vocab_path) -> (Vec<u32>, Vec<u32>)` — builds `[CLS] query [SEP] passage [SEP]` with `token_type_ids = 0` for the query side and `1` for the passage side. Uses existing `aprender::text::tokenize::WordPieceTokenizer`. + `run(...)` signature extended with `Option<&str>` query + `Option<&str>` passage + `Option<&Path>` vocab; mode dispatch via pattern match on the option tuple with a clear-error fallback for mixed input. + 4 new unit tests: - `load_vocab_txt_assigns_line_index_as_id` - `tokenize_query_passage_builds_correct_segment_pair` - `tokenize_query_passage_rejects_missing_cls` - `tokenize_query_passage_rejects_missing_sep` crates/apr-cli/src/extended_commands.rs: + `ExtendedCommands::Rerank` adds `--query`, `--passage`, `--vocab` flags. `--input-ids` and `--token-type-ids` flipped to `Option<String>`. crates/apr-cli/src/dispatch_analysis.rs: + Dispatch arm passes the new Option<&_> fields to commands::rerank::run. contracts/apr-cli-commands-v1.yaml: + `rerank` entry description updated to document both input modes. ## Test plan - [x] `cargo build -p apr-cli` clean - [x] `cargo test -p apr-cli --lib commands::rerank::` → 7/7 pass (3 from Phase 3 + 4 new Phase 3b) ## What this PR does NOT do - Validate against a real bge-reranker tokenizer. Phase 4 (HF parity) work — needs cached HF tokenizer.json fixture. - Support the HuggingFace `tokenizer.json` format (Tokenizers crate). Phase 3b sticks to the simpler `vocab.txt` interface — HF `tokenizer.json` is a Phase 3c follow-up. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (CLI surface, ID mode) - #326 Phase 3b → **this PR** (CLI text mode w/ WordPiece) - #326 Phase 4 next (HF numerical parity) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#326 Phase 4) Phase 4 of #326. Unblocks `apr import --arch bert` against real HuggingFace SafeTensors checkpoints and ties the BERT stack together with HF Tokenizers-format `tokenizer.json` support. Verified on lambda-vector against `cross-encoder/ms-marco-MiniLM-L-6-v2` (86.7 MB, 6-layer 384-d MiniLM): $ apr pull cross-encoder/ms-marco-MiniLM-L-6-v2 $ apr import .../57e6e922118ea840.safetensors --arch bert \ -o /tmp/minilm-rerank.apr --allow-no-config ⚠ Import completed with warnings (105 tensors, 87M output) $ apr rerank /tmp/minilm-rerank.apr \ --query "what is the capital of France" \ --passage "Paris is the capital of France" \ --vocab .../tokenizer.json score[0] = 0.999805 ✅ matching pair $ apr rerank /tmp/minilm-rerank.apr \ --query "what is the capital of France" \ --passage "Cats are mammals that purr" \ --vocab .../tokenizer.json score[0] = 0.000015 ✅ disjoint pair correctly ranked low ## Two defects fixed ### 1. `apr import --arch bert` panicked on `position_ids` I64 buffer HuggingFace `transformers` registers integer buffers (`bert.embeddings.position_ids`, optional `token_type_ids` cache) alongside trainable f32 weights via `register_buffer`. They appear in the SafeTensors file as `I64` tensors. The existing dequant path only handles F32 / F16 / BF16 (see `safetensors_include_01.rs::get_tensor`) so import aborted with: Failed to extract tensor 'bert.embeddings.position_ids': Unsupported dtype for 'bert.embeddings.position_ids': I64 Fix: add `is_non_trainable_buffer(name, dtype)` filter in `safe_tensors_load_result.rs`. Skips tensors with integer dtype AND HF-known buffer-suffix names (`.position_ids`, `.token_type_ids`, `.attention_mask`, `.causal_mask`). Applied to all 3 iteration sites: `load_safetensors_tensors`, `load_safetensors_with_f16_passthrough`, `load_safetensors_as_f32`. The filter is name+dtype keyed so it can't silently drop a future quantized F32 weight named `.position_ids` (already covered by the `position_ids_f32_is_not_buffer` falsifier). ### 2. `apr rerank --vocab` rejected HF `tokenizer.json` files Phase 3b's loader assumed line-per-token `vocab.txt`. Real BERT checkpoints ship `tokenizer.json` (HuggingFace Tokenizers crate format) where: - `model.vocab` is the bulk WordPiece map - `added_tokens` is the special-tokens array (where [CLS]/[SEP]/[UNK] actually live; they are NOT inside `model.vocab` per the Tokenizers convention) Fix: `load_tokenizer_json(path)` parses both sections and merges them. New `load_vocab(path)` dispatcher routes `.json` extension to the HF parser, otherwise legacy `vocab.txt`. `tokenize_query_passage` now calls `load_vocab` so both formats work transparently. ## What this PR adds crates/aprender-core/src/format/converter/safe_tensors_load_result.rs: + `is_non_trainable_buffer(name, dtype)` predicate + Filter applied at 3 iteration sites + 5 unit tests covering the falsifier cases crates/apr-cli/src/commands/rerank.rs: + `load_tokenizer_json(path)` — HF Tokenizers WordPiece parser + `load_vocab(path)` — extension-dispatcher (.json → HF, else vocab.txt) + `tokenize_query_passage` now uses `load_vocab` ## Test plan - [x] `cargo test -p aprender-core --lib is_non_trainable_buffer` → 5/5 pass - [x] `cargo build --release --features 'inference cuda'` clean - [x] `apr pull` 87M MiniLM safetensors → cached - [x] `apr import --arch bert` → 87M .apr (was: aborted on position_ids) - [x] `apr rerank --vocab tokenizer.json` produces 0.9998 for matching pair and 0.000015 for disjoint pair on real model (was: rejected tokenizer.json) ## What this PR does NOT do - Numerical-parity gate vs HF reference (`transformers.AutoModel...`). Phase 4b follow-up — needs `uv run --with transformers --with torch` to dump per-layer hidden states + cosine compare. The empirical matching/disjoint score gap (0.9998 vs 1.5e-5) is strong directional evidence that the pipeline is correct but doesn't pin per-tensor parity. - Flip `Architecture::Bert.is_inference_verified() == true`. That needs Phase 4b parity evidence on main. - Integration test in CI. The end-to-end test requires the 87M cached HF model + network access for first-time download; gating it on a CI runner with cached fixtures is Phase 4c. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank CLI, ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ vocab.txt) - #326 Phase 4 → **this PR** (real HF SafeTensors round-trip) - Direct unblock for trueno-rag MRR 0.952 → 0.97+ via cross-encoder rerank Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ed (#326 Phase 4b) Phase 4b of #326. Demonstrates end-to-end numerical parity between `apr rerank` and the HuggingFace reference `AutoModelForSequenceClassification` for `cross-encoder/ms-marco-MiniLM-L-6-v2` and flips `Architecture::Bert.is_inference_verified() == true`. ## Empirical parity (lambda-vector RTX 4090, 2026-05-17) | Pair | HF score | apr score | abs diff | |---|---|---|---| | "France" + "Paris..." | 0.999805 | 0.999805 | 2.98e-7 | | "France" + "Cats..." | 0.000015 | 0.000015 | 1.67e-7 | | "ML" + "neural networks..." | 0.000020 | 0.000020 | 1.54e-7 | WordPiece tokenization: bit-identical input_ids for all 3 prompts. Raw logits: agree within ~4e-4 (f32 round-off). Sigmoid maps both to identical 6-decimal scores; the parity falsifier asserts < 1e-4 absolute score diff (observed: < 3e-7). ## What this PR adds crates/apr-cli/tests/falsification_bert_326_hf_parity.rs (new): + `falsify_bert_326_phase4b_hf_parity` — `#[ignore]`-gated integration test that: 1. checks for the cached MiniLM SafeTensors at the canonical path 2. invokes `apr import --arch bert` to produce a fresh `.apr` 3. invokes `apr rerank --query --passage --vocab tokenizer.json` for each canonical pair 4. asserts |apr_score − hf_score| < 1e-4 (observed: < 3e-7) + 3 canonical `(query, passage, hf_score)` triples captured from HF reference via uv (`uv run --with transformers --with torch`) crates/aprender-core/src/format/tensor_expectation.rs: + `Architecture::Bert` now matched by `is_inference_verified()` + Doc-comment refreshed with the parity matrix crates/aprender-core/src/format/converter/tests/pmat_round19.rs: + `apr_import_strict_unverified_arch_test` updated: BERT now verified post-#326 Phase 4b (was: asserted NOT verified) crates/aprender-core/src/format/converter/tests/coverage_types_arch_functions.rs: + `test_is_inference_verified_false_gh219` no longer lists BERT + new `test_is_inference_verified_true_bert_gh326_phase4b` ## Test plan - [x] `cargo test -p aprender-core --lib is_inference_verified` → 4/4 pass - [x] `cargo build -p apr-cli --tests` clean - [x] `cargo test --test falsification_bert_326_hf_parity -- --ignored --nocapture` → PASS with all 3 pairs at < 3e-7 score diff vs HF reference ## What this PR does NOT do (Phase 4c+ scope) - CI integration. The parity falsifier is `#[ignore]`-gated because it needs the 87 MB cached fixture AND the `apr` release binary on PATH; wiring a CI runner with both is Phase 4c. - Per-layer hidden-state cosine vs HF. The final-logit parity already verifies the full forward chain numerically; per-layer dumps would pin specific layers if drift ever appears, but aren't needed today. - HF parity for full-size models like `BAAI/bge-reranker-base` (109M). Should work mechanically (same architecture); test fixture sizing is Phase 4c work. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank CLI, ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ vocab.txt) - #326 Phase 4 → #1759 (real HF SafeTensors round-trip) - #326 Phase 4b → **this PR** (HF numerical parity + is_inference_verified) - Direct unblock for trueno-rag MRR 0.952 → 0.97+ via verified BERT cross-encoder rerank (sovereign-stack alternative to ONNX Runtime) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 5) Phase 5 of #326. Adds the canonical cross-encoder use case: rank multiple candidate passages against a single query. Single-pair mode was the smoke test; batch mode is the actual production interface (first-stage retrieval → second-stage rerank). ## Usage $ apr rerank model.apr \ --query "what is the capital of France" \ --passages "Paris is the capital of France" \ --passages "Berlin is the capital of Germany" \ --passages "Lyon is a city in France" \ --passages "Cats are mammals that purr" \ --vocab tokenizer.json --sort --json { "model": "model.apr", "query": "what is the capital of France", "num_passages": 4, "returned": 4, "sorted": true, "results": [ { "index": 0, "passage": "Paris is the capital of France", "logit": 8.540365, "score": 0.999805 }, { "index": 2, "passage": "Berlin is the capital of Germany", "logit": -3.200321, "score": 0.039154 }, { "index": 3, "passage": "Lyon is a city in France", "logit": -3.570795, "score": 0.027364 }, { "index": 1, "passage": "Cats are mammals that purr", "logit":-11.118653, "score": 0.000015 } ] } Note the ranking quality: Paris (correct match) → Berlin (also a capital, wrong country) → Lyon (also in France, wrong city) → Cats (disjoint). That's exactly the semantic ordering a cross-encoder is supposed to produce. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + `--passages <TEXT>` flag (repeatable, `Vec<String>`) + `--sort` (descending by score) + `--top-k N` (implies `--sort`; limit output to top N) crates/apr-cli/src/dispatch_analysis.rs: + dispatch threads through the new flags crates/apr-cli/src/commands/rerank.rs: + new `run_batch` function — loads the cross-encoder ONCE then scores N (query, passage_i) pairs in a loop + sort + top-k logic with `partial_cmp` fallback + JSON output preserves original `index` so callers can map back to their first-stage retrieval ordering even after sort ## What this PR does NOT do - True batched forward (one matrix-of-pairs call instead of N sequential calls). The MiniLM forward is fast enough at 87M that sequential N=10..100 ranking takes < 1s on lambda-vector. True batching is a Phase 5b optimisation if N gets larger. - Streaming output for large N. The full ranked list is materialised in memory before printing — fine for typical RAG rerank (N ≤ 100). ## Test plan - [x] `cargo build -p apr-cli` clean - [x] End-to-end batch test on lambda-vector against real MiniLM: 4-passage ranking produces correct semantic ordering - [x] `--top-k 2` correctly truncates to top 2 ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank single-pair ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ WordPiece) - #326 Phase 4 → #1759 (real HF SafeTensors round-trip) - #326 Phase 4b → #1765 (HF numerical parity + is_inference_verified) - #326 Phase 5 → **this PR** (batch ranking — the actual production use case) Direct unblock for trueno-rag: a typical RAG pipeline returns top-50 BM25/dense candidates then reranks down to top-5. This PR ships the second-stage API in one CLI call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 5) (#1767) * feat(bert): cross-encoder weight loading from APR v2 (#326 Phase 1) Phase 1 of the BERT cross-encoder rerank feature (issue #326). Adds weight-loading from APR v2 files into the existing `BertEncoder` / `BertEmbeddings` / `CrossEncoder` scaffolding (already in-tree per the pre-existing 585 LOC). After this PR, the previously-zero-init `CrossEncoder::new(config, ...)` can be hydrated with real `BAAI/bge-reranker-base` / `cross-encoder/ms-marco-MiniLM-L-6-v2`-style weights via a one-call loader: let mut model = CrossEncoder::new(&config, 1, true); model.load_from_reader(&apr_reader, &config)?; let score = model.score(&input_ids, &token_type_ids); // ∈ [0, 1] ## What this PR adds crates/aprender-core/src/models/bert/load.rs (new, ~280 LOC including tests): + `BertLoadError { tensor, reason }` typed error + `read_tensor(reader, name, expected_shape)` — single-tensor read with dequant + shape-validation + `load_embeddings_from_reader` — 3 embedding tables + LayerNorm + `load_layer_from_reader` — 6 weight/bias pairs per encoder block (Q/K/V/O proj + 2 LayerNorms + intermediate + output) + `load_encoder_from_reader` — iterates over all encoder layers + `load_cross_encoder_from_reader` — embeddings + encoder + optional pooler + classifier head with prefix fallback (`classifier` → `score` → `rank_head`) + 3 falsifier tests using synthetic AprV2 stubs: - `falsify_bert_326_phase1_load_full_cross_encoder` (happy path) - `falsify_bert_326_phase1_missing_classifier_returns_structured_error` - `falsify_bert_326_phase1_shape_mismatch_returns_structured_error` crates/aprender-core/src/models/bert/mod.rs: + `pub mod load;` + `pub use load::BertLoadError;` crates/aprender-core/src/models/bert/embeddings.rs: + Fields promoted `private` → `pub(crate)` so the loader can mutate them in place (3 embedding tensors + LayerNorm). No public-API change. crates/aprender-core/src/models/bert/layer.rs: + `attention_mut`, `attention_norm_mut`, `intermediate_mut`, `output_dense_mut`, `output_norm_mut` — 5 mutable accessors for the loader. Pattern matches existing `Linear::placeholder + set_weight + set_bias` lazy-load convention. crates/aprender-core/src/models/bert/encoder.rs: + `BertEncoder::layer_mut(idx)` crates/aprender-core/src/models/bert/cross_encoder.rs: + cached `num_labels` field on the struct (avoids coupling the loader to `Linear::out_features`) + `embeddings_mut`, `encoder_mut`, `pooler_mut`, `classifier_mut` + `load_from_reader(&mut self, reader, config)` public one-shot crates/aprender-core/src/nn/normalization/mod.rs: + `LayerNorm::set_weight(weight)` + `LayerNorm::set_bias(bias)` — mirrors `Linear::set_weight` / `set_bias` for symmetry with the existing lazy-load convention. crates/aprender-core/src/nn/transformer/mod.rs: + `MultiHeadAttention::{q,k,v,out}_proj_mut` — 4 mutable accessors on the inner Linear projections so the BERT loader can install Q/K/V/O weights without re-constructing MHA. ## What this PR does NOT do (Phase 2+ scope, separate PRs) - apr import path for BERT SafeTensors → APR v2 (Phase 2). Today the loader only consumes APR; the test uses a synthetic AprV2Writer to build a stub APR. Real `apr import hf://cross-encoder/...` work lives in Phase 2. - `apr rerank` CLI subcommand (Phase 3) - HuggingFace numerical-parity validation (Phase 4) - `Architecture::Bert.is_inference_verified()` still returns false; flipping it to true requires Phase 2 (real APR file) + Phase 4 (HF parity check) ## Test plan - [x] `cargo test -p aprender-core --lib models::bert::` → 18/18 pass (15 existing + 3 new falsifiers) - [x] No public-API breakage on Linear / LayerNorm / MHA / BERT structs (only additive mutable accessors + 1 new struct field with Default-friendly cached integer) - [x] Build clean on `cargo build -p aprender-core` ## Cross-refs - #326 BERT cross-encoder reranking — this is Phase 1 per the #326 comment-4470811613 scope plan - Unblocks trueno-rag MRR 0.952 → 0.97+ push via real cross-encoder reranking (sovereign-stack alternative to ONNX Runtime) - Architecture::Bert.bert_map_name passthrough (tensor_expectation.rs line 198) preserves HF tensor names unchanged — no APR-side renaming required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(bert): import-load contract helper + 4 falsifiers (#326 Phase 2) Phase 2 of the BERT cross-encoder rerank feature (issue #326). Pins the contract between `apr import --arch bert` (which routes through `Architecture::Bert.map_name`, currently identity passthrough at `tensor_expectation.rs:198`) and `CrossEncoder::load_from_reader` from Phase 1 (PR #1752). Without this contract test, a future rewrite of `bert_map_name` (e.g. stripping the `bert.` prefix or adding a layer rename) would break the import→load round trip silently — the produced APR would have the wrong names and `load_from_reader` would fail with `tensor not present in APR file`. Phase 2 makes that drift surface as a unit test failure in the SAME crate as the rewrite. ## What this PR adds crates/aprender-core/src/models/bert/load.rs (+~100 LOC): + `expected_bert_tensor_names(&config, with_pooler, classifier_prefix)` — canonical HuggingFace BERT tensor-name set this loader expects. Acts as the SYMBOLIC import-load contract. Re-exported from `models::bert::expected_bert_tensor_names`. + 4 new falsifiers under `falsify_bert_326_phase2_*`: - `expected_names_count_matches_formula` — `5 + 16*num_layers + 2*with_pooler + 2` (locks the per-layer multiplier so adding a new BERT component breaks the formula loudly) - `contract_matches_loader_reads` — the names the contract helper produces are EXACTLY the names the Phase 1 stub builder writes AND EXACTLY the names the loader reads — bidirectional pin - `bert_map_name_is_identity` — `Architecture::Bert.map_name(name) == name` for the canonical set (catches any future prefix-stripping rewrite) - `bert_base_tensor_count` — 12-layer bert-base produces 201 tensors (5 + 16*12 + 2 + 2) — smoke for the real bert-base size crates/aprender-core/src/models/bert/mod.rs: + re-export `expected_bert_tensor_names` alongside `BertLoadError` ## What this PR does NOT do - Run `apr import` against a real HuggingFace SafeTensors file. That's Phase 3 work (needs network deps + the `safetensors` optional feature + a fixture caching strategy). - Flip `Architecture::Bert.is_inference_verified()` to true. That needs HuggingFace numerical-parity validation (Phase 4) against reference activations. - Touch `bert_map_name` itself. The identity passthrough is already correct for HF SafeTensors and verified by the new test. ## Test plan - [x] `cargo test -p aprender-core --lib models::bert::` → 22/22 pass (18 existing + 4 new Phase 2 falsifiers) - [x] `expected_bert_tensor_names` is now a public re-export - [x] No public-API breakage on Architecture / Linear / LayerNorm / MHA - [x] Doc-comments + symbolic contract eliminate name-duplication between import path and load path ## Cross-refs - #326 Phase 1 ↑ PR #1752 (weight loading from APR) — this builds on it - #326 Phase 3 next (apr import wire-up + real SafeTensors fixture) - #326 Phase 4 final (HF numerical-parity + flip `is_inference_verified() == true`) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-cli): apr rerank — BERT cross-encoder scoring CLI (#326 Phase 3) Phase 3 of the BERT cross-encoder rerank feature (issue #326). Adds the `apr rerank <model.apr>` subcommand that wraps Phase 1's `CrossEncoder::load_from_reader` (PR #1752) + the existing `CrossEncoder::forward` to score a single pre-tokenised (input_ids, token_type_ids) pair. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + new `ExtendedCommands::Rerank { ... }` variant with full flag surface — `--input-ids`, `--token-type-ids`, BERT config overrides (hidden_dim, num_layers, etc.), `--with-pooler`, `--num-labels`, `--raw-logit`, `--json`. crates/apr-cli/src/dispatch_analysis.rs: + `ExtendedCommands::Rerank { ... } => commands::rerank::run(...)` arm before the `_ => unreachable!()` final arm. crates/apr-cli/src/commands/rerank.rs (new, ~150 LOC including 3 tests): + `parse_id_list(s, flag)` — comma-separated u32 parser with named-flag error messages. + `run(...)` — read APR → build BertConfig → construct + load CrossEncoder → forward → emit JSON or text (sigmoid score by default, `--raw-logit` for the raw classifier output). + 3 unit tests on `parse_id_list` (commas+spaces, invalid token, trailing comma). crates/apr-cli/src/commands/mod.rs: + `pub(crate) mod rerank;` alphabetically between `registry_schema` and `resume_paths`. crates/apr-cli/tests/cli_commands.rs: + `"rerank"` registered in the canonical command list (3-surface drift check passes). contracts/apr-cli-commands-v1.yaml: + new `rerank` entry under `inference` category, alongside `chat` and `run`. `requires_model: true`, no `side_effects`. ## Usage $ apr rerank model.apr \ --input-ids 101,2024,102,3456,102 \ --token-type-ids 0,0,0,1,1 score[0] = 0.834712 $ apr rerank model.apr \ --input-ids 101,2024,102,3456,102 \ --token-type-ids 0,0,0,1,1 \ --raw-logit --json { "model": "model.apr", "input_ids": [101, 2024, 102, 3456, 102], "token_type_ids": [0, 0, 0, 1, 1], "logits": [1.62] } ## What this PR does NOT do - Tokenisation. Caller supplies pre-tokenised u32 arrays. A `--query` + `--passage` mode using the tokenizer bundled with the APR is Phase 3b follow-up. - End-to-end test against a real bge-reranker-base / MiniLM-L-6 .apr file. Phase 4 work (needs `apr import hf://` integration test + cached fixture). - Flip `Architecture::Bert.is_inference_verified() == true`. Still waiting on Phase 4 HF numerical-parity check. ## Test plan - [x] `cargo build -p apr-cli` clean - [x] `cargo test -p apr-cli --lib commands::rerank::` → 3/3 pass - [x] `cargo test -p apr-cli --test cli_commands` → 8/8 pass (includes the `test_all_contract_commands_exist` 3-surface drift gate) - [x] Contract `apr-cli-commands-v1.yaml` updated with new `rerank` entry ## Cross-refs - #326 Phase 1 → #1752 (weight loading from APR) - #326 Phase 2 → #1753 (import-load contract helper) - #326 Phase 3 → **this PR** (CLI surface) - #326 Phase 4 next (HF numerical parity + flip is_inference_verified) - Direct unlock for trueno-rag MRR push 0.952 → 0.97+ via cross-encoder reranking (sovereign-stack alternative to ONNX Runtime / fastembed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-cli): apr rerank --query/--passage WordPiece mode (#326 Phase 3b) Phase 3b of the BERT cross-encoder rerank feature (issue #326). Extends the `apr rerank` CLI from Phase 3 (PR #1755) with an in-process WordPiece tokenisation mode so callers don't need a separate `apr tokenize` step: $ apr rerank model.apr \ --query "what is the capital of France?" \ --passage "Paris is the capital of France." \ --vocab vocab.txt score[0] = 0.94217 The two modes are mutually exclusive: pass either the ID pair (`--input-ids`+`--token-type-ids`) OR the text pair (`--query`+`--passage`+`--vocab`). Mixing them returns a structured error. ## What this PR adds crates/apr-cli/src/commands/rerank.rs (+~120 LOC): + `load_vocab_txt(&Path) -> HashMap<String, u32>` — reads `vocab.txt` one-token-per-line. + `tokenize_query_passage(query, passage, vocab_path) -> (Vec<u32>, Vec<u32>)` — builds `[CLS] query [SEP] passage [SEP]` with `token_type_ids = 0` for the query side and `1` for the passage side. Uses existing `aprender::text::tokenize::WordPieceTokenizer`. + `run(...)` signature extended with `Option<&str>` query + `Option<&str>` passage + `Option<&Path>` vocab; mode dispatch via pattern match on the option tuple with a clear-error fallback for mixed input. + 4 new unit tests: - `load_vocab_txt_assigns_line_index_as_id` - `tokenize_query_passage_builds_correct_segment_pair` - `tokenize_query_passage_rejects_missing_cls` - `tokenize_query_passage_rejects_missing_sep` crates/apr-cli/src/extended_commands.rs: + `ExtendedCommands::Rerank` adds `--query`, `--passage`, `--vocab` flags. `--input-ids` and `--token-type-ids` flipped to `Option<String>`. crates/apr-cli/src/dispatch_analysis.rs: + Dispatch arm passes the new Option<&_> fields to commands::rerank::run. contracts/apr-cli-commands-v1.yaml: + `rerank` entry description updated to document both input modes. ## Test plan - [x] `cargo build -p apr-cli` clean - [x] `cargo test -p apr-cli --lib commands::rerank::` → 7/7 pass (3 from Phase 3 + 4 new Phase 3b) ## What this PR does NOT do - Validate against a real bge-reranker tokenizer. Phase 4 (HF parity) work — needs cached HF tokenizer.json fixture. - Support the HuggingFace `tokenizer.json` format (Tokenizers crate). Phase 3b sticks to the simpler `vocab.txt` interface — HF `tokenizer.json` is a Phase 3c follow-up. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (CLI surface, ID mode) - #326 Phase 3b → **this PR** (CLI text mode w/ WordPiece) - #326 Phase 4 next (HF numerical parity) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(bert): apr import + rerank work end-to-end on real HF SafeTensors (#326 Phase 4) Phase 4 of #326. Unblocks `apr import --arch bert` against real HuggingFace SafeTensors checkpoints and ties the BERT stack together with HF Tokenizers-format `tokenizer.json` support. Verified on lambda-vector against `cross-encoder/ms-marco-MiniLM-L-6-v2` (86.7 MB, 6-layer 384-d MiniLM): $ apr pull cross-encoder/ms-marco-MiniLM-L-6-v2 $ apr import .../57e6e922118ea840.safetensors --arch bert \ -o /tmp/minilm-rerank.apr --allow-no-config ⚠ Import completed with warnings (105 tensors, 87M output) $ apr rerank /tmp/minilm-rerank.apr \ --query "what is the capital of France" \ --passage "Paris is the capital of France" \ --vocab .../tokenizer.json score[0] = 0.999805 ✅ matching pair $ apr rerank /tmp/minilm-rerank.apr \ --query "what is the capital of France" \ --passage "Cats are mammals that purr" \ --vocab .../tokenizer.json score[0] = 0.000015 ✅ disjoint pair correctly ranked low ## Two defects fixed ### 1. `apr import --arch bert` panicked on `position_ids` I64 buffer HuggingFace `transformers` registers integer buffers (`bert.embeddings.position_ids`, optional `token_type_ids` cache) alongside trainable f32 weights via `register_buffer`. They appear in the SafeTensors file as `I64` tensors. The existing dequant path only handles F32 / F16 / BF16 (see `safetensors_include_01.rs::get_tensor`) so import aborted with: Failed to extract tensor 'bert.embeddings.position_ids': Unsupported dtype for 'bert.embeddings.position_ids': I64 Fix: add `is_non_trainable_buffer(name, dtype)` filter in `safe_tensors_load_result.rs`. Skips tensors with integer dtype AND HF-known buffer-suffix names (`.position_ids`, `.token_type_ids`, `.attention_mask`, `.causal_mask`). Applied to all 3 iteration sites: `load_safetensors_tensors`, `load_safetensors_with_f16_passthrough`, `load_safetensors_as_f32`. The filter is name+dtype keyed so it can't silently drop a future quantized F32 weight named `.position_ids` (already covered by the `position_ids_f32_is_not_buffer` falsifier). ### 2. `apr rerank --vocab` rejected HF `tokenizer.json` files Phase 3b's loader assumed line-per-token `vocab.txt`. Real BERT checkpoints ship `tokenizer.json` (HuggingFace Tokenizers crate format) where: - `model.vocab` is the bulk WordPiece map - `added_tokens` is the special-tokens array (where [CLS]/[SEP]/[UNK] actually live; they are NOT inside `model.vocab` per the Tokenizers convention) Fix: `load_tokenizer_json(path)` parses both sections and merges them. New `load_vocab(path)` dispatcher routes `.json` extension to the HF parser, otherwise legacy `vocab.txt`. `tokenize_query_passage` now calls `load_vocab` so both formats work transparently. ## What this PR adds crates/aprender-core/src/format/converter/safe_tensors_load_result.rs: + `is_non_trainable_buffer(name, dtype)` predicate + Filter applied at 3 iteration sites + 5 unit tests covering the falsifier cases crates/apr-cli/src/commands/rerank.rs: + `load_tokenizer_json(path)` — HF Tokenizers WordPiece parser + `load_vocab(path)` — extension-dispatcher (.json → HF, else vocab.txt) + `tokenize_query_passage` now uses `load_vocab` ## Test plan - [x] `cargo test -p aprender-core --lib is_non_trainable_buffer` → 5/5 pass - [x] `cargo build --release --features 'inference cuda'` clean - [x] `apr pull` 87M MiniLM safetensors → cached - [x] `apr import --arch bert` → 87M .apr (was: aborted on position_ids) - [x] `apr rerank --vocab tokenizer.json` produces 0.9998 for matching pair and 0.000015 for disjoint pair on real model (was: rejected tokenizer.json) ## What this PR does NOT do - Numerical-parity gate vs HF reference (`transformers.AutoModel...`). Phase 4b follow-up — needs `uv run --with transformers --with torch` to dump per-layer hidden states + cosine compare. The empirical matching/disjoint score gap (0.9998 vs 1.5e-5) is strong directional evidence that the pipeline is correct but doesn't pin per-tensor parity. - Flip `Architecture::Bert.is_inference_verified() == true`. That needs Phase 4b parity evidence on main. - Integration test in CI. The end-to-end test requires the 87M cached HF model + network access for first-time download; gating it on a CI runner with cached fixtures is Phase 4c. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank CLI, ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ vocab.txt) - #326 Phase 4 → **this PR** (real HF SafeTensors round-trip) - Direct unblock for trueno-rag MRR 0.952 → 0.97+ via cross-encoder rerank Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(bert): HF numerical parity verified, is_inference_verified flipped (#326 Phase 4b) Phase 4b of #326. Demonstrates end-to-end numerical parity between `apr rerank` and the HuggingFace reference `AutoModelForSequenceClassification` for `cross-encoder/ms-marco-MiniLM-L-6-v2` and flips `Architecture::Bert.is_inference_verified() == true`. ## Empirical parity (lambda-vector RTX 4090, 2026-05-17) | Pair | HF score | apr score | abs diff | |---|---|---|---| | "France" + "Paris..." | 0.999805 | 0.999805 | 2.98e-7 | | "France" + "Cats..." | 0.000015 | 0.000015 | 1.67e-7 | | "ML" + "neural networks..." | 0.000020 | 0.000020 | 1.54e-7 | WordPiece tokenization: bit-identical input_ids for all 3 prompts. Raw logits: agree within ~4e-4 (f32 round-off). Sigmoid maps both to identical 6-decimal scores; the parity falsifier asserts < 1e-4 absolute score diff (observed: < 3e-7). ## What this PR adds crates/apr-cli/tests/falsification_bert_326_hf_parity.rs (new): + `falsify_bert_326_phase4b_hf_parity` — `#[ignore]`-gated integration test that: 1. checks for the cached MiniLM SafeTensors at the canonical path 2. invokes `apr import --arch bert` to produce a fresh `.apr` 3. invokes `apr rerank --query --passage --vocab tokenizer.json` for each canonical pair 4. asserts |apr_score − hf_score| < 1e-4 (observed: < 3e-7) + 3 canonical `(query, passage, hf_score)` triples captured from HF reference via uv (`uv run --with transformers --with torch`) crates/aprender-core/src/format/tensor_expectation.rs: + `Architecture::Bert` now matched by `is_inference_verified()` + Doc-comment refreshed with the parity matrix crates/aprender-core/src/format/converter/tests/pmat_round19.rs: + `apr_import_strict_unverified_arch_test` updated: BERT now verified post-#326 Phase 4b (was: asserted NOT verified) crates/aprender-core/src/format/converter/tests/coverage_types_arch_functions.rs: + `test_is_inference_verified_false_gh219` no longer lists BERT + new `test_is_inference_verified_true_bert_gh326_phase4b` ## Test plan - [x] `cargo test -p aprender-core --lib is_inference_verified` → 4/4 pass - [x] `cargo build -p apr-cli --tests` clean - [x] `cargo test --test falsification_bert_326_hf_parity -- --ignored --nocapture` → PASS with all 3 pairs at < 3e-7 score diff vs HF reference ## What this PR does NOT do (Phase 4c+ scope) - CI integration. The parity falsifier is `#[ignore]`-gated because it needs the 87 MB cached fixture AND the `apr` release binary on PATH; wiring a CI runner with both is Phase 4c. - Per-layer hidden-state cosine vs HF. The final-logit parity already verifies the full forward chain numerically; per-layer dumps would pin specific layers if drift ever appears, but aren't needed today. - HF parity for full-size models like `BAAI/bge-reranker-base` (109M). Should work mechanically (same architecture); test fixture sizing is Phase 4c work. ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank CLI, ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ vocab.txt) - #326 Phase 4 → #1759 (real HF SafeTensors round-trip) - #326 Phase 4b → **this PR** (HF numerical parity + is_inference_verified) - Direct unblock for trueno-rag MRR 0.952 → 0.97+ via verified BERT cross-encoder rerank (sovereign-stack alternative to ONNX Runtime) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-cli): apr rerank --passages batch mode + --sort + --top-k (#326 Phase 5) Phase 5 of #326. Adds the canonical cross-encoder use case: rank multiple candidate passages against a single query. Single-pair mode was the smoke test; batch mode is the actual production interface (first-stage retrieval → second-stage rerank). ## Usage $ apr rerank model.apr \ --query "what is the capital of France" \ --passages "Paris is the capital of France" \ --passages "Berlin is the capital of Germany" \ --passages "Lyon is a city in France" \ --passages "Cats are mammals that purr" \ --vocab tokenizer.json --sort --json { "model": "model.apr", "query": "what is the capital of France", "num_passages": 4, "returned": 4, "sorted": true, "results": [ { "index": 0, "passage": "Paris is the capital of France", "logit": 8.540365, "score": 0.999805 }, { "index": 2, "passage": "Berlin is the capital of Germany", "logit": -3.200321, "score": 0.039154 }, { "index": 3, "passage": "Lyon is a city in France", "logit": -3.570795, "score": 0.027364 }, { "index": 1, "passage": "Cats are mammals that purr", "logit":-11.118653, "score": 0.000015 } ] } Note the ranking quality: Paris (correct match) → Berlin (also a capital, wrong country) → Lyon (also in France, wrong city) → Cats (disjoint). That's exactly the semantic ordering a cross-encoder is supposed to produce. ## What this PR adds crates/apr-cli/src/extended_commands.rs: + `--passages <TEXT>` flag (repeatable, `Vec<String>`) + `--sort` (descending by score) + `--top-k N` (implies `--sort`; limit output to top N) crates/apr-cli/src/dispatch_analysis.rs: + dispatch threads through the new flags crates/apr-cli/src/commands/rerank.rs: + new `run_batch` function — loads the cross-encoder ONCE then scores N (query, passage_i) pairs in a loop + sort + top-k logic with `partial_cmp` fallback + JSON output preserves original `index` so callers can map back to their first-stage retrieval ordering even after sort ## What this PR does NOT do - True batched forward (one matrix-of-pairs call instead of N sequential calls). The MiniLM forward is fast enough at 87M that sequential N=10..100 ranking takes < 1s on lambda-vector. True batching is a Phase 5b optimisation if N gets larger. - Streaming output for large N. The full ranked list is materialised in memory before printing — fine for typical RAG rerank (N ≤ 100). ## Test plan - [x] `cargo build -p apr-cli` clean - [x] End-to-end batch test on lambda-vector against real MiniLM: 4-passage ranking produces correct semantic ordering - [x] `--top-k 2` correctly truncates to top 2 ## Cross-refs - #326 Phase 1 → #1752 (weight loading) - #326 Phase 2 → #1753 (import-load contract) - #326 Phase 3 → #1755 (apr rerank single-pair ID mode) - #326 Phase 3b → #1756 (apr rerank text mode w/ WordPiece) - #326 Phase 4 → #1759 (real HF SafeTensors round-trip) - #326 Phase 4b → #1765 (HF numerical parity + is_inference_verified) - #326 Phase 5 → **this PR** (batch ranking — the actual production use case) Direct unblock for trueno-rag: a typical RAG pipeline returns top-50 BM25/dense candidates then reranks down to top-5. This PR ships the second-stage API in one CLI call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-18T04:31:40Z

Subsumed by #1767 (Phase 5) squash-merge — Phase 5's squash included this PR's commits as ancestors when it landed on main. Verified by comparing 'crates/aprender-core/src/models/bert/load.rs' on main vs this branch (757 lines on main, includes all Phase 1-5 content). Closing per squash-merge-post-verify protocol; no rebase needed since content is already on main.

noahgift enabled auto-merge (squash) May 17, 2026 14:04

This was referenced May 17, 2026

feat: BERT encoder inference for cross-encoder reranking (.apr) #326

Closed

feat(bert): import-load contract helper + 4 falsifiers (#326 Phase 2) #1753

Closed

Merge branch 'main' into feat/bert-326-phase1-weight-loading

64b0ffa

noahgift added 2 commits May 17, 2026 18:04

Merge branch 'main' into feat/bert-326-phase1-weight-loading

d6f26db

Merge branch 'main' into feat/bert-326-phase1-weight-loading

727c8fb

noahgift closed this May 18, 2026

auto-merge was automatically disabled May 18, 2026 04:31
Pull request was closed

noahgift deleted the feat/bert-326-phase1-weight-loading branch May 18, 2026 04:31

noahgift mentioned this pull request May 18, 2026

feat(bert): #326 Phase 4c→8 — apr embed feature complete + extended parity gates #1779

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bert): cross-encoder weight loading from APR v2 (#326 Phase 1)#1752

feat(bert): cross-encoder weight loading from APR v2 (#326 Phase 1)#1752
noahgift wants to merge 4 commits into
mainfrom
feat/bert-326-phase1-weight-loading

noahgift commented May 17, 2026

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Test plan

What this PR adds

Falsifier coverage

What this PR does NOT do (Phase 2+ scope)

Cross-refs

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant