feat(redact): vocab-pruned text PII model v45_phase5 (266MB → 149MB) by louis030195 · Pull Request #3909 · screenpipe/screenpipe

louis030195 · 2026-06-08T16:18:09Z

what

Swaps the local text PII model from v45_phase4_onnx (266 MB INT8) to v45_phase5_pruned (149 MB, −44%).

v45_phase4 (266MB)                    v45_phase5_pruned (149MB)
─────────────────────                 ─────────────────────────
embedding  250k×768  ~192MB    ──►    embedding  81k×768   ~75MB   (vocab-pruned)
transformer          ~74MB            transformer          ~74MB   (unchanged)
                                      + remap.json (1.2MB) full-id → sliced row

The 250k multilingual embedding is the dominant slice; it's pruned to the ~81k tokens that actually appear across the broad training corpus (+ a frequency buffer). Tokenizer + config are byte-identical to phase4; a shipped remap.json maps full-vocab token ids → the sliced embedding rows, applied once in run_window (byte offsets / BIO decode are unaffected).

validation

Python parity: identical spans vs full phase4 on names / emails / phones / SSNs / channels.
Rust e2e: ran the redactor on the pruned model — Jane Doe→Person, john.smith@acme.com→Email, 415-555-0192→Phone, 123-45-6789→Id, Maria Garcia→Person. ✅
Model live on HF (screenpipe/pii-redactor/v45_phase5_pruned), SHA-256 pinned in FILES.

tradeoff (honest)

Vocab pruning maps the dropped long-tail tokens to UNK, so recall on rare / non-Latin-script PII can dip vs the full model (the in-distribution bench can't fully measure this since its corpus seeds the kept vocab). Mitigations: the kept vocab is corpus+frequency driven (covers the common case), and the deterministic detector layer catches structured PII regardless. Fully revertable — bump ONNX_REDACTOR_VERSION back to 4 + restore the phase4 FILES.

🤖 Generated with Claude Code

Swaps the local text redactor from v45_phase4 (266MB INT8) to v45_phase5_pruned (149MB): the 250k multilingual embedding is vocab-pruned to the ~81k tokens used across the broad training corpus, the dominant slice of the model. Tokenizer + config are byte-identical; a shipped remap.json maps full-vocab token ids to the sliced embedding rows, applied in run_window (offsets/decoding unaffected). Validated: identical spans to v45_phase4 on names/emails/phones/SSNs/persons (python parity + rust e2e). ~44% smaller text model => proportional cut to the redact worker's RSS (ort arena scales with weights). The kept vocab is corpus + frequency-buffer driven to limit real-world recall loss on the long tail; the deterministic detector layer backstops structured PII regardless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Ships the vocab-pruned text PII model (v45_phase5_pruned, 266MB -> 149MB INT8) from #3909 — halves the local text-redactor model's RAM. Tokenizer/config unchanged; input-id remap applied in run_window. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Collapse load_remap signature to one line to satisfy cargo fmt --check (regressed in #3909, breaking Code Quality on main). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

louis030195 merged commit 1b98d53 into main Jun 8, 2026
18 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(redact): vocab-pruned text PII model v45_phase5 (266MB → 149MB)#3909

feat(redact): vocab-pruned text PII model v45_phase5 (266MB → 149MB)#3909
louis030195 merged 1 commit into
mainfrom
feat/pruned-text-pii-model

louis030195 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

louis030195 commented Jun 8, 2026

what

validation

tradeoff (honest)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant