feat(redact): vocab-pruned text PII model v45_phase5 (266MB → 149MB)#3909
Merged
Conversation
Swaps the local text redactor from v45_phase4 (266MB INT8) to v45_phase5_pruned (149MB): the 250k multilingual embedding is vocab-pruned to the ~81k tokens used across the broad training corpus, the dominant slice of the model. Tokenizer + config are byte-identical; a shipped remap.json maps full-vocab token ids to the sliced embedding rows, applied in run_window (offsets/decoding unaffected). Validated: identical spans to v45_phase4 on names/emails/phones/SSNs/persons (python parity + rust e2e). ~44% smaller text model => proportional cut to the redact worker's RSS (ort arena scales with weights). The kept vocab is corpus + frequency-buffer driven to limit real-world recall loss on the long tail; the deterministic detector layer backstops structured PII regardless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
louis030195
pushed a commit
that referenced
this pull request
Jun 8, 2026
Ships the vocab-pruned text PII model (v45_phase5_pruned, 266MB -> 149MB INT8) from #3909 — halves the local text-redactor model's RAM. Tokenizer/config unchanged; input-id remap applied in run_window. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
louis030195
pushed a commit
that referenced
this pull request
Jun 8, 2026
Collapse load_remap signature to one line to satisfy cargo fmt --check (regressed in #3909, breaking Code Quality on main). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
what
Swaps the local text PII model from
v45_phase4_onnx(266 MB INT8) tov45_phase5_pruned(149 MB, −44%).The 250k multilingual embedding is the dominant slice; it's pruned to the ~81k tokens that actually appear across the broad training corpus (+ a frequency buffer). Tokenizer + config are byte-identical to phase4; a shipped
remap.jsonmaps full-vocab token ids → the sliced embedding rows, applied once inrun_window(byte offsets / BIO decode are unaffected).validation
Jane Doe→Person,john.smith@acme.com→Email,415-555-0192→Phone,123-45-6789→Id,Maria Garcia→Person. ✅screenpipe/pii-redactor/v45_phase5_pruned), SHA-256 pinned inFILES.tradeoff (honest)
Vocab pruning maps the dropped long-tail tokens to UNK, so recall on rare / non-Latin-script PII can dip vs the full model (the in-distribution bench can't fully measure this since its corpus seeds the kept vocab). Mitigations: the kept vocab is corpus+frequency driven (covers the common case), and the deterministic detector layer catches structured PII regardless. Fully revertable — bump
ONNX_REDACTOR_VERSIONback to 4 + restore the phase4FILES.🤖 Generated with Claude Code