Research implementation of a stateful, risk-aware de-identification architecture for streaming multimodal systems.
This project demonstrates an alternative to static, document-level anonymization. Instead of treating privacy protection as a one-time preprocessing step, the system models cumulative identity exposure over time and dynamically adjusts masking strength in response to quantified re-identification risk.
Most de-identification pipelines operate per document:
detect PHI -> remove PHI -> store result
This approach assumes that risk is isolated within individual records. In practice, re-identification risk accumulates across events, modalities, and time.
A name fragment, identifier token, or cross-modal linkage that appears harmless in isolation may become identifying when repeated or combined with other signals.
This repository implements a stateful exposure-aware controller that:
- Maintains subject-level exposure state
- Computes rolling re-identification risk
- Incorporates recency and cross-modal linkage signals
- Dynamically selects masking strength
- Supports pseudonym versioning upon risk escalation
- Produces structured, reproducible audit logs
De-identification becomes a longitudinal control problem rather than a static transformation.
The system differs from conventional masking pipelines in several concrete ways:
Longitudinal Exposure Tracking: Identity exposure is accumulated and tracked over time at the subject level.
Risk-Governed Policy Selection: Masking strength is selected dynamically based on quantified risk thresholds.
Cross-Modal Linkage Modeling: Signals from text, ASR transcripts, image proxies, waveform headers, and audio metadata are aggregated to evaluate identity-level exposure.
Localized Retokenization When risk increases, pseudonym tokens can be versioned forward, containing linkage continuity without global reprocessing.
Auditability: All masking decisions are logged with structured metadata and can be reproduced deterministically from exposure state.
The repository includes a fully synthetic streaming simulation.
Five policies are evaluated:
- raw
- weak
- pseudo
- redact
- adaptive
The adaptive controller escalates masking strength only when cumulative exposure justifies it.
Outputs include:
policy_metrics.csvlatency_summary.csvaudit_log.jsonlEXPERIMENT_REPORT.mdprivacy_utility_curve.pngsample_dag.png
All experiments are reproducible from source using synthetic data generated within the repository.
Run:
python -m amphi_rl_dpgraph.run_demo
Results are written to the results/ directory.
Run the test suite with verbose output:
pytest -vvFor explicit installation (recommended for notebooks/Colab):
pip install -e .
pytest -vvTo generate a machine-readable report plus a markdown summary:
pytest -vv --junitxml .artifacts/pytest.xml
python scripts/generate_test_report.py .artifacts/pytest.xml TEST_RESULTS.mdThe latest checked-in summary is in TEST_RESULTS.md.
This repository does not contain real clinical data, personal information, or protected health information.
All experiments operate on synthetically generated streams designed to simulate longitudinal healthcare data workflows. The synthetic data includes structured representations of:
- Clinical note text
- Speech transcription output
- Image proxy signals
- Waveform and monitoring features
The streams are constructed to model realistic structural properties relevant to privacy evaluation, including:
- Repeated subject mentions over time
- Identifier recurrence
- Variable disclosure frequency
- Cross-modal co-occurrence patterns
These properties allow controlled evaluation of cumulative identity exposure and adaptive masking behavior without exposing real individuals.
Synthetic data is used to ensure reproducibility, transparency, and safe public distribution of the research implementation.
The demo evaluates:
- Residual PHI leakage
- Utility proxy metrics
- Latency distribution
- Adaptive escalation behavior
The objective is not to eliminate utility through maximal redaction, but to demonstrate controlled escalation based on exposure accumulation.
This repository is intended for:
- Research in privacy-preserving machine learning
- Streaming system design
- Exposure-aware masking strategies
- Longitudinal risk modeling
- Reproducible evaluation of privacy–utility tradeoffs
It is not a production-ready compliance system.
This repository must not contain real patient data, protected health information (PHI), or identifiable personal data.
All demonstrations and experiments run exclusively on synthetic data generated within the repository or on publicly permitted datasets.
The following must never be uploaded:
- Clinical notes derived from real individuals
- Hospital records or EHR exports
- Medical images associated with identifiable persons
- Audio recordings of patients
- Any dataset containing direct or indirect identifiers
If sensitive data is discovered, do not open a public issue. Contact the maintainer directly for immediate removal.
This project studies adaptive privacy control mechanisms for streaming and multimodal systems.
It does not collect, process, or distribute real clinical data.
The methods demonstrated here are intended to strengthen privacy protection. They are not designed to weaken safeguards or enable re-identification.
When adapting this code to real-world systems, implementers must ensure:
- Institutional and regulatory compliance
- Independent security controls
- Data governance review
- Validation under applicable legal frameworks
Privacy protection in regulated domains requires layered safeguards. This repository addresses one technical layer: exposure-aware masking.
It should not be treated as a substitute for comprehensive compliance infrastructure.
If you use this software in academic or technical work, please cite it via the included CITATION.cff file.
Title:
Stateful Exposure-Aware De-Identification for Multimodal Streaming Data
This repository is associated with a U.S. provisional patent application filed on 2025-07-05.
Public release (GitHub): 2026-03-02.
MIT License. See LICENSE.