Skip to content

Add support for schema-driven redaction#2383

Merged
jgraettinger merged 5 commits intomasterfrom
johnny/redact
Sep 10, 2025
Merged

Add support for schema-driven redaction#2383
jgraettinger merged 5 commits intomasterfrom
johnny/redact

Conversation

@jgraettinger
Copy link
Copy Markdown
Member

@jgraettinger jgraettinger commented Sep 4, 2025

Description:

This PR introduces schema-driven data redaction for Flow, allowing sensitive data to be automatically redacted from documents based on JSON Schema redact annotations. The feature supports two redaction strategies:

  • block: Removes annotated properties/items entirely from documents
  • sha256: Replaces values with salted SHA-256 hash digests

Redaction is applied before documents leave the runtime's MemTable, preventing writes to disk or being surfaced in error messages. Salt is generated and managed for capture and derivation tasks to ensure deterministic hashing, with an ability to specify a manual salt value.

Workflow steps:

  1. Add top-level redact annotations to the collection schema / writeSchema:
 {
   "$defs": { },
   "$ref": "flow://connector-schema",
   "properties": {
     "ssn": {        
       "redact": {"strategy": "sha256"}                                                                                                                          
     },                                                                                                                                                          
     "internal_id": {                                                                                                                                            
       "redact": {"strategy": "block"}                                                                                                                           
     }
   }                                                                                                                                                             
 } 
  1. Salt configuration (optional): For captures/derivations, salts can be specified as redactSalt (base64). An existing salt is passed-through from a live spec, and (on creation) a salt is generated deterministically via xxhash(init_vector + task_name).

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

  • Validation prevents redaction of document locations used as keys (would break collection semantics)
  • New fuzz test on redaction logic
  • Redaction also uses the new tape-length optimization to avoid traversing un-redacted portions of the document
  • Existing live specs do NOT initialize salt values.
    • This was a implementation trade-off to slow the rate of introducing redactSalt's in production, to reduce the blast radius of issues upon release.
  • Similarly Projection.Inference does not yet have redact or reduce wired in, to avoid breaking the current UI.
  • UI must update to next flow-web version to prevent redact annotations from breaking the UI.

This change is Reviewable

Introduces a new redaction feature that allows data to be
redacted from JSON documents based on `redact` schema annotations.

The implementation supports two strategies:
- Block: Removes the annotated node entirely from the document
- Sha256: Replaces string values with their SHA-256 hash digest
Inference now tracks `redact` annotations, similarly to its handling for
`reduce` annotations. Add two new inspection errors:
- When blocking a document location that must exist
- When sha256-hashing a location that cannot be a string

Also update `validation` to ensure that redacted locations are not used
as keys.
Apply redaction to all documents before they exit MemTable through any
path (drain, spill, or validation error). This ensures sensitive data
marked with redact annotations is properly blocked / hashed before being
exposed in logs or error messages.

- Refactor validation API to split Valid/Invalid types and remove
  document from Validation struct
- Add redaction during spill operations for non-front documents
- Add redaction during drain operations after reduction
- Apply redaction before surfacing validation errors
- Test cases for redaction integration in MemTable
The salt generation follows this precedence:
1. Manual: Users can specify redact_salt in capture/derivation models as base64-encoded values
2. Existing: Preserves existing salts from live specs during updates
3. Generated: Creates deterministic salts using xxhash of init_vector + task name

Runtime tasks (capture/derive) now pass redact_salt to their combiners.

Also rename build::validate() => build::local, to clarify its purpose as
a close stand-in for "production" validations that can be run locally.
It uses a constant initialization vector for stable snapshots.
@jgraettinger jgraettinger changed the title implement redaction Add support for schema-driven redaction Sep 5, 2025
@jgraettinger jgraettinger marked this pull request as ready for review September 5, 2025 03:10
@jgraettinger jgraettinger requested a review from psFried September 5, 2025 03:10
No changes to crate APIs, but build in support for new
Projection.Inference fields as well as the new `redact` annotation.
Copy link
Copy Markdown
Contributor

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.map_err(|err| {
Error::FailedValidation(self.spec.names[rhs.meta.binding()].clone(), err)
.map_err(|invalid| {
// Best-effort redaction using available outcomes, prior to generating an error.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a great idea to redact the documents in the validation errors. I'm wondering under what circumstances we wouldn't be able to redact a field here. Given a document with {"redactMe": ...}, it's hard to think of a scenario where the validation outcome wouldn't have the necessary information for redacting redactMe, unless you were using conditionals in the schema. I suppose what makes this "best effort" is primarily just that the incoming document could put the sensitive data in a location other than /redactMe?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the kind of case I was thinking of is a redact annotation that applies to (for example) pattern: .*@.*.com, but it turns out you have a .dev. Or perhaps the redactMe subschema itself fails to validate (wrong type or format perhaps). It will apply as broadly as possible without failing-fast, but there may be errors of expression that violate the users's intention, etc.

}
let inactive_bindings = live_bindings_spec.values().map(|v| (*v).clone()).collect();

// Use manual salt if provided, otherwise the live salt, otherwise generate a new one.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about requiring a backfill if the salt changes? Kinda seems like if you care enough to set the salt manually, then you'd want to backfill if it changes. Definitely doesn't seem necessary, especially in the short term, though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fairly advanced feature so I don't have a handle on when or why people will use it. Rotation? Alignment with another capture? Matching a migrated table ? 🤷

@jgraettinger jgraettinger merged commit e7cfcdc into master Sep 10, 2025
13 checks passed
@jgraettinger jgraettinger deleted the johnny/redact branch September 10, 2025 15:09
github-actions bot pushed a commit to estuary/homebrew-flowctl that referenced this pull request Sep 18, 2025
## What's Changed

This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process.

It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows.

## New Contributors
* @danielnelson made their first contribution in estuary/flow#2403

**Full Changelog**: estuary/flow@v0.5.21...v0.5.22
github-actions bot pushed a commit to estuary/homebrew-flowctl that referenced this pull request Sep 18, 2025
## What's Changed

This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process.

It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows.

## New Contributors
* @danielnelson made their first contribution in estuary/flow#2403

**Full Changelog**: estuary/flow@v0.5.21...v0.5.22
jgraettinger added a commit to estuary/homebrew-flowctl that referenced this pull request Sep 18, 2025
## What's Changed

This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process.

It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows.

## New Contributors
* @danielnelson made their first contribution in estuary/flow#2403

**Full Changelog**: estuary/flow@v0.5.21...v0.5.22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants