Add support for schema-driven redaction#2383
Conversation
Introduces a new redaction feature that allows data to be redacted from JSON documents based on `redact` schema annotations. The implementation supports two strategies: - Block: Removes the annotated node entirely from the document - Sha256: Replaces string values with their SHA-256 hash digest
Inference now tracks `redact` annotations, similarly to its handling for `reduce` annotations. Add two new inspection errors: - When blocking a document location that must exist - When sha256-hashing a location that cannot be a string Also update `validation` to ensure that redacted locations are not used as keys.
Apply redaction to all documents before they exit MemTable through any path (drain, spill, or validation error). This ensures sensitive data marked with redact annotations is properly blocked / hashed before being exposed in logs or error messages. - Refactor validation API to split Valid/Invalid types and remove document from Validation struct - Add redaction during spill operations for non-front documents - Add redaction during drain operations after reduction - Apply redaction before surfacing validation errors - Test cases for redaction integration in MemTable
The salt generation follows this precedence: 1. Manual: Users can specify redact_salt in capture/derivation models as base64-encoded values 2. Existing: Preserves existing salts from live specs during updates 3. Generated: Creates deterministic salts using xxhash of init_vector + task name Runtime tasks (capture/derive) now pass redact_salt to their combiners. Also rename build::validate() => build::local, to clarify its purpose as a close stand-in for "production" validations that can be run locally. It uses a constant initialization vector for stable snapshots.
c0a7cea to
ae895d3
Compare
No changes to crate APIs, but build in support for new Projection.Inference fields as well as the new `redact` annotation.
| .map_err(|err| { | ||
| Error::FailedValidation(self.spec.names[rhs.meta.binding()].clone(), err) | ||
| .map_err(|invalid| { | ||
| // Best-effort redaction using available outcomes, prior to generating an error. |
There was a problem hiding this comment.
This was a great idea to redact the documents in the validation errors. I'm wondering under what circumstances we wouldn't be able to redact a field here. Given a document with {"redactMe": ...}, it's hard to think of a scenario where the validation outcome wouldn't have the necessary information for redacting redactMe, unless you were using conditionals in the schema. I suppose what makes this "best effort" is primarily just that the incoming document could put the sensitive data in a location other than /redactMe?
There was a problem hiding this comment.
Yea, the kind of case I was thinking of is a redact annotation that applies to (for example) pattern: .*@.*.com, but it turns out you have a .dev. Or perhaps the redactMe subschema itself fails to validate (wrong type or format perhaps). It will apply as broadly as possible without failing-fast, but there may be errors of expression that violate the users's intention, etc.
| } | ||
| let inactive_bindings = live_bindings_spec.values().map(|v| (*v).clone()).collect(); | ||
|
|
||
| // Use manual salt if provided, otherwise the live salt, otherwise generate a new one. |
There was a problem hiding this comment.
What do you think about requiring a backfill if the salt changes? Kinda seems like if you care enough to set the salt manually, then you'd want to backfill if it changes. Definitely doesn't seem necessary, especially in the short term, though.
There was a problem hiding this comment.
It's a fairly advanced feature so I don't have a handle on when or why people will use it. Rotation? Alignment with another capture? Matching a migrated table ? 🤷
## What's Changed This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process. It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows. ## New Contributors * @danielnelson made their first contribution in estuary/flow#2403 **Full Changelog**: estuary/flow@v0.5.21...v0.5.22
## What's Changed This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process. It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows. ## New Contributors * @danielnelson made their first contribution in estuary/flow#2403 **Full Changelog**: estuary/flow@v0.5.21...v0.5.22
## What's Changed This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process. It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows. ## New Contributors * @danielnelson made their first contribution in estuary/flow#2403 **Full Changelog**: estuary/flow@v0.5.21...v0.5.22
Description:
This PR introduces schema-driven data redaction for Flow, allowing sensitive data to be automatically redacted from documents based on JSON Schema
redactannotations. The feature supports two redaction strategies:Redaction is applied before documents leave the runtime's MemTable, preventing writes to disk or being surfaced in error messages. Salt is generated and managed for capture and derivation tasks to ensure deterministic hashing, with an ability to specify a manual salt value.
Workflow steps:
redactannotations to the collection schema / writeSchema:{ "$defs": { }, "$ref": "flow://connector-schema", "properties": { "ssn": { "redact": {"strategy": "sha256"} }, "internal_id": { "redact": {"strategy": "block"} } } }redactSalt(base64). An existing salt is passed-through from a live spec, and (on creation) a salt is generated deterministically viaxxhash(init_vector + task_name).Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
redactSalt's in production, to reduce the blast radius of issues upon release.redactorreducewired in, to avoid breaking the current UI.flow-webversion to preventredactannotations from breaking the UI.This change is