Skip to content

[EPIC][SECURITY][PLUGINS]: PII Advanced filter (Presidio + pattern library) #2553

@crivetimihai

Description

@crivetimihai

🛡️ Epic: PII Advanced Filter Plugin (Presidio + Pattern Library + Compliance)

Goal

Deliver a production‑grade PII detection and anonymization plugin using Microsoft Presidio, with:

  • high‑confidence detection for common and regulated entities
  • deterministic masking/redaction strategies
  • configurable thresholds and allow/deny lists
  • strong test coverage and real‑world examples

The plugin should operate as both native gateway plugin and standalone MCP server (stdio/HTTP), and be safe to run by default.

Why Now?

PII handling is a top compliance and security requirement for customers, and ContextForge already provides hooks ideal for this:

  1. Regulatory pressure: GDPR, HIPAA, PCI‑DSS require enforceable masking controls
  2. Enterprise adoption: Large tenants demand consistent PII handling across prompts, tools, resources
  3. A2A + Federation: PII needs to be scrubbed before leaving trust boundaries
  4. Operational safety: Baseline plugin must be reliable and easy to enable without false positives or odd replacements
  5. Developer velocity: A first‑class PII plugin reduces custom per‑deployment work

📖 User Stories

US‑1: Platform Admin - Enable PII protection globally

As a Platform Administrator
I want to enable a single PII plugin that covers prompts, tools, and resources
So that all traffic is scrubbed automatically before leaving the gateway

Acceptance Criteria:

Given plugins are enabled in the gateway
When I configure PIIAdvancedPlugin with hooks:
  - prompt_post_fetch
  - tool_post_invoke
  - resource_post_fetch
Then PII is detected and anonymized consistently
And the response contains no raw PII
US‑2: Security Engineer - Reliable SSN + Date Handling

As a Security Engineer
I want SSNs to be detected reliably, and dates to be masked clearly
So that users never see odd or misleading replacements

Acceptance Criteria:

Given use_pattern_library=true and US_SSN enabled
When input contains "My SSN is 123-45-6789"
Then SSN is detected and masked to ***-**-6789

Given DATE_TIME masking is configured as [DATE]
When input contains "2024-01-01"
Then it is replaced with [DATE]
US‑3: Compliance Officer - Entity Policies + Audit Trail

As a Compliance Officer
I want entity‑specific strategies and thresholds with audit logging
So that I can prove compliance and tune sensitivity

Acceptance Criteria:

Given entity_thresholds and anonymization_strategies are configured
When the plugin runs
Then only entities above threshold are returned
And each entity uses its configured strategy
And audit logging records decisions
US‑4: Developer - Extend with Custom Recognizers

As a Developer
I want to add custom recognizers via config
So that I can detect organization‑specific identifiers

Acceptance Criteria:

Given a custom recognizer for EMPLOYEE_ID
When input contains EMP-123456
Then EMPLOYEE_ID is detected and masked

✅ Acceptance Criteria (Epic)

  • Presidio‑based detection with spaCy NLP integration
  • Pattern library for regex‑based detection (SSN, phone, etc.)
  • Entity‑specific thresholds and allow/deny lists
  • Deterministic masking strategies (incl. DATE_TIME placeholder)
  • Configurable anonymization strategies per entity
  • Works for prompt/tool/resource hooks
  • Standalone usage supported (CLI + MCP server)
  • Tests for core detection and edge cases
  • Documentation with sample patterns and troubleshooting

🧠 Design Notes

Key Config Options

use_pattern_library: true
anonymization_strategies:
  EMAIL_ADDRESS: "mask"
  PHONE_NUMBER: "mask"
  US_SSN: "mask"
  DATE_TIME: "mask"
masking_patterns:
  DATE_TIME: "[DATE]"
entity_thresholds:
  PHONE_NUMBER: 0.4

Behavior Principles

  • No odd replacements: avoid static fake values for dates by default
  • Deterministic placeholders: allow explicit [DATE], [SSN], etc.
  • Safe defaults: pattern library on, key entities enabled
  • Extensible: custom recognizers via config

🧰 THE WORKS! (Implementation Checklist)

  • Core detector: Presidio analyzer + spaCy NLP
  • Pattern library with common + regulated patterns
  • Per‑entity thresholds + allow/deny lists
  • Anonymization strategies (mask/redact/hash/encrypt)
  • DATE_TIME placeholder support
  • Gateway config defaults updated
  • Standalone usage (CLI + MCP server integration)
  • Test suite expanded (SSN, phone, credit card, date)
  • Documentation: README + TESTING guide
  • Benchmarks and validation script

🔗 Related

  • External plugins via MCP (stdio/HTTP)
  • Plugin framework hooks
  • PII basic filter

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseenhancementNew feature or requestepicLarge feature spanning multiple issuespluginspythonPython / backend development (FastAPI)securityImproves security

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions