Skip to content

feat(redact): vocab-pruned text PII model v45_phase5 (266MB → 149MB)#3909

Merged
louis030195 merged 1 commit into
mainfrom
feat/pruned-text-pii-model
Jun 8, 2026
Merged

feat(redact): vocab-pruned text PII model v45_phase5 (266MB → 149MB)#3909
louis030195 merged 1 commit into
mainfrom
feat/pruned-text-pii-model

Conversation

@louis030195

Copy link
Copy Markdown
Collaborator

what

Swaps the local text PII model from v45_phase4_onnx (266 MB INT8) to v45_phase5_pruned (149 MB, −44%).

v45_phase4 (266MB)                    v45_phase5_pruned (149MB)
─────────────────────                 ─────────────────────────
embedding  250k×768  ~192MB    ──►    embedding  81k×768   ~75MB   (vocab-pruned)
transformer          ~74MB            transformer          ~74MB   (unchanged)
                                      + remap.json (1.2MB) full-id → sliced row

The 250k multilingual embedding is the dominant slice; it's pruned to the ~81k tokens that actually appear across the broad training corpus (+ a frequency buffer). Tokenizer + config are byte-identical to phase4; a shipped remap.json maps full-vocab token ids → the sliced embedding rows, applied once in run_window (byte offsets / BIO decode are unaffected).

validation

  • Python parity: identical spans vs full phase4 on names / emails / phones / SSNs / channels.
  • Rust e2e: ran the redactor on the pruned model — Jane Doe→Person, john.smith@acme.com→Email, 415-555-0192→Phone, 123-45-6789→Id, Maria Garcia→Person. ✅
  • Model live on HF (screenpipe/pii-redactor/v45_phase5_pruned), SHA-256 pinned in FILES.

tradeoff (honest)

Vocab pruning maps the dropped long-tail tokens to UNK, so recall on rare / non-Latin-script PII can dip vs the full model (the in-distribution bench can't fully measure this since its corpus seeds the kept vocab). Mitigations: the kept vocab is corpus+frequency driven (covers the common case), and the deterministic detector layer catches structured PII regardless. Fully revertable — bump ONNX_REDACTOR_VERSION back to 4 + restore the phase4 FILES.

🤖 Generated with Claude Code

Swaps the local text redactor from v45_phase4 (266MB INT8) to v45_phase5_pruned
(149MB): the 250k multilingual embedding is vocab-pruned to the ~81k tokens used
across the broad training corpus, the dominant slice of the model. Tokenizer +
config are byte-identical; a shipped remap.json maps full-vocab token ids to the
sliced embedding rows, applied in run_window (offsets/decoding unaffected).

Validated: identical spans to v45_phase4 on names/emails/phones/SSNs/persons
(python parity + rust e2e). ~44% smaller text model => proportional cut to the
redact worker's RSS (ort arena scales with weights). The kept vocab is corpus +
frequency-buffer driven to limit real-world recall loss on the long tail; the
deterministic detector layer backstops structured PII regardless.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@louis030195 louis030195 merged commit 1b98d53 into main Jun 8, 2026
18 of 20 checks passed
louis030195 pushed a commit that referenced this pull request Jun 8, 2026
Ships the vocab-pruned text PII model (v45_phase5_pruned, 266MB -> 149MB INT8)
from #3909 — halves the local text-redactor model's RAM. Tokenizer/config
unchanged; input-id remap applied in run_window.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
louis030195 pushed a commit that referenced this pull request Jun 8, 2026
Collapse load_remap signature to one line to satisfy cargo fmt --check
(regressed in #3909, breaking Code Quality on main).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant