Skip to content

validation: enforce data plane alignment with storage mappings and add flowctl discover#2404

Merged
jgraettinger merged 6 commits intomasterfrom
johnny/validate-prefix
Sep 18, 2025
Merged

validation: enforce data plane alignment with storage mappings and add flowctl discover#2404
jgraettinger merged 6 commits intomasterfrom
johnny/validate-prefix

Conversation

@jgraettinger
Copy link
Copy Markdown
Member

@jgraettinger jgraettinger commented Sep 15, 2025

Description:

This PR implements enforcement of our emergent "Prefix" concept - a jointly verified tuple of (catalog-prefix, storage-buckets, admissible-data-planes). The validation layer now enforces that data planes must be drawn from the data_planes field of the best-covering storage mapping, with additional consistency checks to ensure storage mapping alignment.

Changes:

  • Tasks automatically use the first data plane from their covering storage mapping as default
  • Users no longer need to specify a data plane for control-plane validation unless overriding the default
  • flowctl now fetches actual storage mappings and data planes, enabling local verification of data plane assignments
  • New flowctl discover command submits discovery jobs to the control plane for CLI-driven discovery

See individual commits for more detail.

Workflow steps:

  • In the garden-path case, users no longer need to pass a data plane during validation - the system automatically selects the first data plane from the covering storage mapping
  • Users can still explicitly override the data plane selection when needed (validated against the storage mapping's allowed list)
  • flowctl discover submits discovery jobs to the control plane's discovers table and fetches the resultant draft for local development
  • flowctl no longer sends a data plane parameter unless explicitly specified by the user via --init-data-plane, which is renamed from --default-data-plane.

Example of flowctl discover:

$ RUST_LOG=info flowctl discover --source flow.yaml                                                                                                                                                                               
2025-09-15T02:32:59.855084Z  INFO flowctl::discover: using data-plane for discovery data_plane_name=ops/dp/public/gcp-us-central1-c2                                                                                                                          
2025-09-15T02:33:00.813874Z  INFO flowctl::draft::encrypt: successfully encrypted endpoint configuration task_name=johnny/new-thing/source-google-sheets-native task_type=capture                                                                             
2025-09-15T02:33:00.988897Z  INFO flowctl::draft: created draft draft_id=1261e4b794348000                                                                                                                                                                     
2025-09-15T02:33:01.046392Z  INFO flowctl::discover: created draft for discovery draft_id=1261e4b794348000                                                                                                                                                    
2025-09-15T02:33:01.122151Z  INFO flowctl::discover: submitted discovery job discover_id=1261e4b7ceb48800 logs_token=8b092972-6a42-46ef-a8d4-4510f6da4fe6                                                                                                     
2025-09-15T02:33:01.122175Z  INFO flowctl::poll: Waiting for discovers job id=1261e4b7ceb48800 logs_token=8b092972-6a42-46ef-a8d4-4510f6da4fe6                                                                                                                
2025-09-15T02:33:05Z discover>  INFO: started connector container    image="ghcr.io/estuary/source-google-sheets-native:v1" image_inspection={"id":"edbcbfbb2203609279aed67bfdaed23d650873a2c23be13664c55d8193484387","image_created_at":"2025-09-12T22:04:46.
083244997Z","network_ports":[],"runtime_protocol":"capture","usage_rate":0.0,"usage_rate_source":"dev.estuary.usage-rate"} init_address="http://10.2.249.201:49092" ip_addr="10.2.249.201" mapped_host_ports={} module="runtime::container" name="fc_ac68526f"
 task_name="connector-proxy-1757903581845864627" task_type="Capture"                                                                                                                                                                                          
2025-09-15T02:33:06Z discover>  INFO: started connector container    image="ghcr.io/estuary/source-google-sheets-native:v1" image_inspection={"id":"edbcbfbb2203609279aed67bfdaed23d650873a2c23be13664c55d8193484387","image_created_at":"2025-09-12T22:04:46.
083244997Z","network_ports":[],"runtime_protocol":"capture","usage_rate":0.0,"usage_rate_source":"dev.estuary.usage-rate"} init_address="http://10.2.249.202:49092" ip_addr="10.2.249.202" mapped_host_ports={} module="runtime::container" name="fc_7c716362"
 task_name="connector-proxy-1757903581845864627" task_type="Capture"                                                                                                                                                                                          
2025-09-15T02:33:13.079123Z  INFO build: wrote file path=/home/johnny/tmp2/flow.yaml                                                                                                                                                                          
2025-09-15T02:33:13.079250Z  INFO build: wrote file path=/home/johnny/tmp2/johnny/flow.yaml                                                                                                                                                                   
2025-09-15T02:33:13.079334Z  INFO build: wrote file path=/home/johnny/tmp2/johnny/new-thing/AnotherSheet.write.schema.yaml                                                                                                                                    
2025-09-15T02:33:13.079389Z  INFO build: wrote file path=/home/johnny/tmp2/johnny/new-thing/Sheet1.write.schema.yaml
2025-09-15T02:33:13.079442Z  INFO build: wrote file path=/home/johnny/tmp2/johnny/new-thing/flow.yaml     
2025-09-15T02:33:13.079531Z  INFO build: wrote file path=/home/johnny/tmp2/source-google-sheets-native.config.yaml                                                                                                                                            
Wrote 3 specifications under file:///home/johnny/tmp2/flow.yaml.                                                                                                                                                                                              
2025-09-15T02:33:13.248413Z  INFO flowctl::discover: discovery completed successfully discover_id=1261e4b7ceb48800
2025-09-15T02:33:13.314716Z  INFO flowctl::draft: deleted draft draft_id=1261e4b794348000

These were historically included because they were part of
models::Catalog, and were thus covered by `validation` as part of
integrated snapshot testing with the `sources` crate.

That's no longer true: storage mappings are injected into the live
catalog and are not part of the draft. So, remove them from snapshots as
the original rationale for including them no longer holds.
…gnment

Storage mappings already included a data_planes field that wasn't being used.
This change activates that functionality - tasks are now assigned to data planes
based on their storage mapping's data_planes list, with the first entry as default.
Users can still explicitly override with a specific data plane if needed.

Key changes:
- Remove `is_default` field from data planes table
- Enforce that tasks use data planes from their storage mapping's list
- Use first data plane in storage mapping as default for task initialization
- Add explicit_plane parameter to allow user overrides (validated against mapping)
- Validate alignment between partition and recovery storage mappings
- Add better error messages when data planes are missing or mismatched
Previously, flowctl's local_specs::Resolver used NoOpCatalogResolver which
provided a placeholder storage mapping and data plane. This change updates it
to query actual storage mappings and data planes from the control plane via
PostgREST APIs.

This brings local validation into closer alignment with that of production
and makes it possible for `flowctl` to understand the data-plane that a
new specification should be submitted to.

`--default-data-plane` is renamed to `--init-data-plane` and is now optional,
and None by default. When None, new specifications are placed in the first
data-plane of its covering storage mapping.
Implement `flowctl discover` command that submits discovery jobs to the
control plane rather than running connectors locally. The command:

- Loads and validates source specifications
- Creates a draft with encrypted endpoint configurations
- Submits discovery job to the discovers table with appropriate data-plane
- Polls the job while streaming logs until completion
- Downloads the updated draft to local files

This provides a similar UX to `flowctl raw discover` but leverages the
control plane's discovery infrastructure rather than local connector execution.
@jgraettinger
Copy link
Copy Markdown
Member Author

@psFried what's the workflow to update versioned sqlx queries?

~/estuary/flow$ DATABASE_URL=postgresql://postgres:postgres@127.0.0.1:5432/postgres cargo sqlx prepare --workspace
... (trim) ...
warning: no queries found

And no changes to .sqlx/

@jgraettinger jgraettinger changed the title verify storage mapping data_planes and add flowctl discover validation: enforce data plane alignment with storage mappings and add flowctl discover Sep 15, 2025
@psFried
Copy link
Copy Markdown
Contributor

psFried commented Sep 15, 2025

@jgraettinger I think your cargo sqlx might be too recent. I'm using 0.6.3, and there's no --workspace option. I'm running cargo sqlx prepare --merged.

Copy link
Copy Markdown
Contributor

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

D::ModelDef, // Model to validate.
models::Id, // Live control-plane ID.
models::Id, // Assigned data-plane.
&'a tables::DataPlane, // Assigned data-plane.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return type is getting pretty unwieldy, and I'm thinking that at some point it'll probably be better to declare a TransitionOk<'a, D, L, B> struct. Doesn't need to be now, though.

auto_approve: bool,
/// Data-plane into which created specifications will be placed.
#[clap(long, default_value = "ops/dp/public/gcp-us-central1-c2")]
default_data_plane: String,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's docs in site/docs/guides/flowctl/ci-cd.md that refer to this argument, and will need updated.

May or may not be worth it, but we could also leave this argument here, but hidden, and print an error message if it gets passed. Just thinking about users upgrading, and wanting to make sure we communicate the breaking change effectively. Would also be good if we remember to include this in the release notes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docs, and I added back --default-data-plane as an alias for --init-data-plane.

Small update to flowctl documentation as well.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 18, 2025

PR Preview Action v1.6.2
Preview removed because the pull request was closed.
2025-09-18 19:45 UTC

@jgraettinger jgraettinger merged commit e713023 into master Sep 18, 2025
13 checks passed
@jgraettinger jgraettinger deleted the johnny/validate-prefix branch September 18, 2025 19:43
github-actions bot pushed a commit to estuary/homebrew-flowctl that referenced this pull request Sep 18, 2025
## What's Changed

This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process.

It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows.

## New Contributors
* @danielnelson made their first contribution in estuary/flow#2403

**Full Changelog**: estuary/flow@v0.5.21...v0.5.22
github-actions bot pushed a commit to estuary/homebrew-flowctl that referenced this pull request Sep 18, 2025
## What's Changed

This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process.

It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows.

## New Contributors
* @danielnelson made their first contribution in estuary/flow#2403

**Full Changelog**: estuary/flow@v0.5.21...v0.5.22
jgraettinger added a commit to estuary/homebrew-flowctl that referenced this pull request Sep 18, 2025
## What's Changed

This release introduces support for the [new `redact` annotation](estuary/flow#2383), which enables blocking or hashing portions of documents very early in capture process.

It also adds a new [`flowctl discover` subcommand](estuary/flow#2404), which enables CLI-driven capture creation workflows.

## New Contributors
* @danielnelson made their first contribution in estuary/flow#2403

**Full Changelog**: estuary/flow@v0.5.21...v0.5.22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants