Skip to content

feat: design out-of-process providers#5544

Open
moolen wants to merge 10 commits intomainfrom
mj-design-oopp
Open

feat: design out-of-process providers#5544
moolen wants to merge 10 commits intomainfrom
mj-design-oopp

Conversation

@moolen
Copy link
Copy Markdown
Member

@moolen moolen commented Nov 5, 2025

PoC implementation over here: #5527

Overview

Adds comprehensive design documentation for out-of-process providers—a feature enabling providers to run as separate gRPC server processes rather than in-process within the External Secrets Operator core.

Design Documents Added (7 files, 1,662 lines)

README.md — Framework and scope

  • Defines problem statements (provider isolation, RBAC separation, reducing core dependencies)
  • Sets requirements for plugin architecture

API.md — Provider referencing architecture

  • Presents two options: Provider Indirection Layer (centralized common fields via intermediate Provider/ClusterProvider CRD) vs Direct Reference (simpler model with provider-specific CRDs)
  • Covers service discovery approaches and trade-offs

ADAPTER.md — gRPC adapter pattern

  • Client-side adapter: wraps gRPC client to implement SecretsClient interface
  • Server-side adapter: translates gRPC requests to v1 provider calls
  • Introduces global connection pooling for gRPC clients

mTLS.md — Security and service discovery

  • Push-based mTLS model: ESO as CA, server certs for providers
  • Label-based service discovery on Kubernetes Services
  • Hot certificate reload capability with file watchers
  • 90-day validity for server/client certs; 1-year for CA

TESTING.md — Testing strategy

  • E2E test migration from v1 to v2 providers
  • Certificate management and mTLS lifecycle tests
  • Provider Conformance Suite (Go-based, versioned testing library)
  • Performance/load testing with baseline comparisons and connection pool metrics

THREAT_MODEL.md — Security analysis

  • Defines assets, trust boundaries, and threat actors
  • Covers 10 threat scenarios (T1–T10) with mitigations and residual risk assessment

PROVIDER_AUTOGEN.md — Code generation for provider boilerplate

  • Template-based generator for provider main.go and Dockerfile
  • Schema-driven validation via provider.yaml
  • Reduces duplicative startup code across providers

Key Decisions

  • In-cluster focus initially; external provider support deferred
  • Custom SANs out of scope for initial implementation
  • Single unified codebase for both in-process and out-of-process providers with transparent adapter layer

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
@@ -0,0 +1,175 @@
# Provider Code Generation
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a goal for this ADR? It feels this is already touching how to build / maintain out-of-tree providers as opposed to making everything work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is part of it. Solving that complexity is one problem we need to solve in order to make the providers we have to day manageable. And i don't think a) there are that many options and 2) it's not something set in stone. This can evolve over time, as it is not a contract that (external) providers must implement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I don't like about this is that it shifts the maintenance responsibility of every existing provider back to external-secrets org 😰. More things to maintain.

Copy link
Copy Markdown
Member Author

@moolen moolen Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, absolutely not. This is just for making the out-of-process providers work with the providers we have today.
If we decide to change the responsibility/ownership/whatever model people are totally free to get rid of this autogen stuff. It is optional, it is not an API contract.

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
- Which v1 provider implementations to wrap
- GVK (Group/Version/Kind) mappings for stores and generators
- Package imports for API types and implementations
- Spec mapper logic to convert provider resources to `SecretStoreSpec`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to go back to SecretStoreSpec? To avoid changing the GenericStore interface?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was the Intention. Though If there's a better way to do it i'm open to it :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about leveraging GenericStore being an interface itself, we could add a GenericSpec Interface with a From/To method that every provider Spec implements - but every single provider (v1) would need to be updated to use that as opposed to handcrank the storeSpec. 🤔

It would make the generation process with minimal manual steps though, with the cost of a slight v1 change on the provider level.

No preferences really - my original question was for clarification.

Copy link
Copy Markdown
Contributor

@evrardjp evrardjp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of comments, only covering ADAPTERS.

- Reduced controller binary size and memory footprint
- Runtime provider discovery and configuration

This architectural shift introduces a network hop between the controller and secret providers. A critical requirement is maintaining a single codebase for provider implementations—we cannot fork provider implementations into separate "in-process" and "out-of-process" versions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a rewording for clarity. I am sure I won't be able to remember why in 6 months time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what in particular is unclear? can you suggest something?


1. **Interface Compatibility:** The controller expects all providers to implement `SecretsClient`, but out-of-process providers communicate via gRPC rather than direct method calls.

2. **Code Reuse:** Provider implementations must work both as standalone gRPC servers and as libraries usable by in-process controllers without maintaining duplicate codebases.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's good to be DRY, but is avoiding duplicate codebases really a bad thing if they are easy to maintain?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can not have two separate versions of an AWS provider, a Vault provider etc. Features will come in, bugs need to be squashed etc. We can not afford to implement things twice - once in each codebase.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: some of the providers are not easy to maintain, they are a mess.

Copy link
Copy Markdown
Contributor

@evrardjp evrardjp Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise why I was misunderstood :) I need to clarify with an user oriented example. I will write that down.
And if they are a mess, that's maybe something we need to clean up to improve the maintainability? Do you want me to tackle that?

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Nov 5, 2025

@moolen moolen marked this pull request as ready for review November 5, 2025 23:10
Copy link
Copy Markdown
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some initial questions :D I'm half-way through or so :)

1. How do we establish a secure connection with a provider?
2. How do we discover providers within a Kubernetes cluster?
3. Do we support providers outside of a Kubernetes cluster? How?
4. How do we handle certificate rotation?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the certificate manager handle that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. right now we don't have anything in place which creates mTLS credentials and distributes them.
  2. Also in detail: how do we handle rotation? leaf certificates is easy, but how about the CA cert? do we generate it with 10y expiry and pray? if it is somewhat short-lived (1y) how do we rotate it in ESO core and in the provider? This is not easy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the trust manager be able to do that? I mean isn't that the literal purpose of trust manager? To distribute bundles and watch for things that change and slowly rolling them out?

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Nov 8, 2025

@Skarlso i think we should do a dedicated session to clarify your questions around discovery and authentication/registration.

Maybe next week, Wednesday?

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 8, 2025

Yep. Sounds good. 😌


## Problem Description

ESO Core needs to establish a secure and authenticated communication channel with out-of-process providers. It has to be encrypted in transit because we transmit secrets over the network. It must to be authenticated, otherwise a malicious actor could call GetSecret() to retrieve secrets and take advantage over the out-of-process provider's service account to read secrets or other resources within the cluster.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that a lot of users of ESO will also have an existing mTLS solution as part of their service mesh, so this risks to be redundant.

On the other hand, additional configuration will be required to allow communication with providers at the mesh level.

I am not sure this is ESO responsibility, but we are looking at a significant potential user experience hit given the popularity of service meshes these days.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add more chaos juice here, some service mesh setups out there depend on external-secrets to run 😆

Copy link
Copy Markdown
Member

@gusfcarvalho gusfcarvalho Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also confirming: user experience impact means adding the appropriate config / exceptions as opposed to just follow the standard mesh config, - correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO: Service Meshes are supposed to be transparent, there's nothing we need to take into consideration design-wise.

@github-actions
Copy link
Copy Markdown

This pr is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale This issue/Pull Request is stale and will be automatically closed label Feb 18, 2026
@moolen moolen removed the Stale This issue/Pull Request is stale and will be automatically closed label Feb 18, 2026
@evrardjp evrardjp added the priority/important-longterm These is to tag issues that should require attention from maintainers label Feb 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

Walkthrough

The PR introduces comprehensive design documentation for an out-of-process provider architecture, covering API patterns, adapter mechanisms, mTLS security, code generation templates, testing strategies, and threat modeling. No code implementation changes are included.

Changes

Cohort / File(s) Summary
Core Architecture & Design
design/014/README.md, design/014/API.md, design/014/ADAPTER.md, design/014/mTLS.md
Establishes out-of-process provider foundation: overview and scope; two API options for ExternalSecret/PushSecret provider references (indirection layer vs. direct reference); bidirectional adapter pattern wrapping gRPC clients/servers; and mTLS-based secure communication with certificate lifecycle, hot reload, and RBAC considerations.
Implementation, Testing & Security
design/014/PROVIDER_AUTOGEN.md, design/014/TESTING.md, design/014/THREAT_MODEL.md
Covers supporting implementation and operational concerns: template-driven code generation to reduce provider boilerplate; multi-layered testing strategy (E2E migration, certificate management, conformance suite, performance/load, chaos testing); and detailed threat model identifying 10 attack scenarios with mitigations, assets, and residual risks.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Mar 9, 2026

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@design/014/ADAPTER.md`:
- Around line 98-100: The design currently implies constructing a v1
SecretsClient on every RPC (e.g., "constructs v1 provider client" and calls
client.GetSecret(ctx, ref)), which can repeat expensive auth/session setup and
ignores the NewClient()/Close(ctx) lifecycle; update ADAPTER.md to explicitly
define lifecycle semantics: state that the server-side adapter SHOULD cache a
SecretsClient per provider reference (keyed by provider ID/credentials), reuse
it across RPCs, call NewClient() when creating the cached client, and invoke
Close(ctx) when evicting or shutting down (describe eviction policy or adapter
shutdown hook). Also update the text around lines 105–107 to reflect caching and
closure behavior so callers know when sessions are created and torn down.

In `@design/014/API.md`:
- Around line 20-34: The document presents Option 1 and Option 2 as if they can
be adopted without API changes, but they require breaking/backward-compatible
modifications to SecretStoreRef (e.g., allowing secretStoreRef.kind: Provider vs
current enum SecretStore|ClusterSecretStore) and adding an apiVersion field used
in Option 2; update the design text around ExternalSecret, secretStoreRef.kind,
Provider/ClusterProvider, and apiVersion to explicitly call out each required
API change and compatibility impact (what must be added/relaxed in
SecretStoreRef, how Provider/ClusterProvider types are introduced, and
migration/compatibility considerations) so readers know these are not drop-in
options.
- Around line 132-134: Replace the placeholder under the "## Recommendation"
heading with a concrete, single-line decision: state which option from the
Alternatives section you are choosing (or a conditional decision rule), a short
rationale tied to the documented tradeoffs, explicit acceptance criteria that
define when this ADR is satisfied, a brief migration/implementation plan with
any phased steps, and an owner for the follow-up work; reference the ADR's
Alternatives and Tradeoffs sections to ensure consistency and ensure the
recommendation is actionable rather than descriptive.

In `@design/014/mTLS.md`:
- Around line 290-307: Update the trust-boundary wording that currently states
"ESO does not transmit secrets over gRPC - providers fetch and return them":
make it explicit that secret material can and does transit the gRPC channel in
both directions depending on flow (read flows: provider -> ESO; push flows: ESO
-> provider), and then note the protections (mTLS for in-transit
confidentiality/authentication, strict RBAC for providers, avoid storing
plaintext secrets on either side, and audit/logging controls). Edit the
paragraph containing that sentence (the one referencing "ESO does not transmit
secrets over gRPC - providers fetch and return them") to accurately describe the
two flow types and the mitigations (mTLS, no distribution of client certs,
minimal exposure/storage, and audit), so consumers of the threat model
understand the channel sensitivity and applied controls.
- Around line 311-326: The design text must call out that the current RBAC
(cluster-wide Service watch + cross-namespace Secret writes for resource
"secrets" and the fixed secret name external-secrets-provider-tls) breaks scoped
Role-based installs; update the mTLS section to explicitly state that scoped
installs are unsupported for the auto-managed provider cert flow or provide an
alternative approach (e.g., per-namespace discovery using a namespaced Service
watch + a controller-per-namespace, or a secret-sync mechanism to copy
external-secrets-provider-tls into scoped namespaces) and document which
controller behaviors (Service watch, Secret create/update/patch) would change
under that alternative.

In `@design/014/README.md`:
- Line 24: The subsection headings like "Risk of Supply-Chain Attacks"
(currently written as "#### Risk of Supply-Chain Attacks") and the other
affected heading at the indicated location should be demoted one level to "###"
so they are proper subheadings under the current `##` sections; update all
occurrences of `#### ` to `### ` for those subsections (e.g., the "Risk of
Supply-Chain Attacks" heading and the other instance called out) to fix the
markdown hierarchy and satisfy markdownlint.
- Line 22: The ADR currently embeds a meeting URL with an embedded password
token (the URL containing "password="), which leaks credentials; edit the README
entry to remove the password query parameter (or replace the entire sensitive
Zoom link) and instead include a sanitized reference such as a non-sensitive
public summary, the meeting date/time and participant list, a link to the
project's private meeting notes location, or a short summary of key decisions;
locate the string in the README paragraph that contains the meeting URL (the
link with "password=") and replace it accordingly so no sensitive token remains
in the committed text.

In `@design/014/THREAT_MODEL.md`:
- Around line 167-175: T3's current "Low" impact understates a rogue Provider:
if ESO is pointed at an attacker-controlled Provider it can observe request
metadata and receive secret payloads on push paths—this should be reclassified
to High (or High/Confidentiality impact) and mitigations tightened; update the
T3 entry to reflect data-exfiltration risk, require enforcement controls on
Provider resources including a validating admission webhook for Provider
resources, stronger RBAC (deny create/modify except explicit principals),
provider allowlist/cryptographic attestation or signed manifests for Providers,
require end-to-end encryption/signing of secret payloads or explicit operator
approval for push paths, restrict service labels and egress policies, and add
audit/alerting for Provider changes and unexpected service labels (reference T3,
ESO, Provider resources, admission webhook, RBAC, service labels).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 460851c5-9459-4c64-8e90-23f8b7509730

📥 Commits

Reviewing files that changed from the base of the PR and between 81078c9 and bafc2a4.

⛔ Files ignored due to path filters (2)
  • design/014/out-of-process-api-direct-ref.png is excluded by !**/*.png
  • design/014/out-of-process-api-provider.png is excluded by !**/*.png
📒 Files selected for processing (7)
  • design/014/ADAPTER.md
  • design/014/API.md
  • design/014/PROVIDER_AUTOGEN.md
  • design/014/README.md
  • design/014/TESTING.md
  • design/014/THREAT_MODEL.md
  • design/014/mTLS.md

Comment on lines +98 to +100
7. Server-side adapter receives request
8. Server-side adapter constructs v1 provider client
9. Server-side adapter calls `client.GetSecret(ctx, ref)` on v1 implementation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define the server-side SecretsClient lifecycle explicitly.

Right now the data flow reads as “construct a v1 client per RPC”. That would forfeit the existing NewClient()/Close(ctx) lifecycle and can repeat expensive provider auth/session setup on every call. The design should state whether the adapter caches SecretsClients per provider reference and when they are closed.

Also applies to: 105-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/ADAPTER.md` around lines 98 - 100, The design currently implies
constructing a v1 SecretsClient on every RPC (e.g., "constructs v1 provider
client" and calls client.GetSecret(ctx, ref)), which can repeat expensive
auth/session setup and ignores the NewClient()/Close(ctx) lifecycle; update
ADAPTER.md to explicitly define lifecycle semantics: state that the server-side
adapter SHOULD cache a SecretsClient per provider reference (keyed by provider
ID/credentials), reuse it across RPCs, call NewClient() when creating the cached
client, and invoke Close(ctx) when evicting or shutting down (describe eviction
policy or adapter shutdown hook). Also update the text around lines 105–107 to
reflect caching and closure behavior so callers know when sessions are created
and torn down.

Comment on lines +20 to +34
### Option 1: Provider Indirection Layer

ExternalSecret references a `Provider` or `ClusterProvider` resource, which then references a provider-specific custom resource.

**Architecture:**
```
ExternalSecret
├─> secretStoreRef.kind: Provider
└─> secretStoreRef.name: my-provider
└─> Provider
├─> spec.address: grpc://provider-service:8080
└─> spec.providerRef
├─> kind: SecretsManager
└─> name: team-blue-eu-west-2
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Both options assume SecretStoreRef changes that are not representable today.

secretStoreRef.kind: Provider conflicts with the current enum (SecretStore|ClusterSecretStore), and Option 2 adds an apiVersion field that does not exist on the current type. The doc should call out the required API/backward-compatibility changes explicitly instead of comparing these as drop-in options.

Also applies to: 65-75

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/API.md` around lines 20 - 34, The document presents Option 1 and
Option 2 as if they can be adopted without API changes, but they require
breaking/backward-compatible modifications to SecretStoreRef (e.g., allowing
secretStoreRef.kind: Provider vs current enum SecretStore|ClusterSecretStore)
and adding an apiVersion field used in Option 2; update the design text around
ExternalSecret, secretStoreRef.kind, Provider/ClusterProvider, and apiVersion to
explicitly call out each required API change and compatibility impact (what must
be added/relaxed in SecretStoreRef, how Provider/ClusterProvider types are
introduced, and migration/compatibility considerations) so readers know these
are not drop-in options.

Comment on lines +132 to +134
## Recommendation

[To be filled in after discussion]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The ADR needs an actual recommendation before merge.

Without selecting an option—or at least documenting concrete decision criteria—this stays descriptive rather than actionable and leaves the follow-up implementation direction ambiguous.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/API.md` around lines 132 - 134, Replace the placeholder under the
"## Recommendation" heading with a concrete, single-line decision: state which
option from the Alternatives section you are choosing (or a conditional decision
rule), a short rationale tied to the documented tradeoffs, explicit acceptance
criteria that define when this ADR is satisfied, a brief
migration/implementation plan with any phased steps, and an owner for the
follow-up work; reference the ADR's Alternatives and Tradeoffs sections to
ensure consistency and ensure the recommendation is actionable rather than
descriptive.

Comment on lines +290 to +307
1. ESO acts as Certificate Authority for provider communication only. The CA certificate for out-of-process Providers must be distinct from any other certificates used by ESO (e.g. conversion/validating webhook)
2. Providers present server certificates to ESO (ESO is the client)
3. ESO presents client certificates to providers
4. Mutual authentication is enforced (mTLS required)
5. Providers retrieve secrets from external APIs using their own RBAC
6. ESO does not transmit secrets over gRPC - providers fetch and return them

### Why ESO Can Trust Providers

ESO can trust any provider. ESO can trust anyone who is able to create a service object with the appropriate labels.

- Providers do not have access to ESO's secrets
- Providers only return data they fetch from external APIs
- Provider RBAC is separate (AWS IAM roles, Kubernetes service accounts, etc.)
- ESO signing **server** certificates for providers doesn't grant additional privileges
- mTLS ensures only authenticated providers can communicate

**However:** We must not distribute client certificates anyhwere, as this will allow anyone with access to a client certificate + key to fetch secrets from a provider.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The trust-boundary section is wrong about secret transit.

Line 295 says ESO “does not transmit secrets over gRPC”, but read flows return secret values from the provider to ESO over gRPC, and push flows send secret material the other direction. That sentence understates the channel’s sensitivity and should be corrected before others build on this threat model.

🧰 Tools
🪛 LanguageTool

[style] ~299-~299: As a shorter alternative for ‘able to’, consider using “can”.
Context: ... any provider. ESO can trust anyone who is able to create a service object with the approp...

(BE_ABLE_TO)


[grammar] ~307-~307: Ensure spelling is correct
Context: ...must not distribute client certificates anyhwere, as this will allow anyone with access ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/mTLS.md` around lines 290 - 307, Update the trust-boundary wording
that currently states "ESO does not transmit secrets over gRPC - providers fetch
and return them": make it explicit that secret material can and does transit the
gRPC channel in both directions depending on flow (read flows: provider -> ESO;
push flows: ESO -> provider), and then note the protections (mTLS for in-transit
confidentiality/authentication, strict RBAC for providers, avoid storing
plaintext secrets on either side, and audit/logging controls). Edit the
paragraph containing that sentence (the one referencing "ESO does not transmit
secrets over gRPC - providers fetch and return them") to accurately describe the
two flow types and the mitigations (mTLS, no distribution of client certs,
minimal exposure/storage, and audit), so consumers of the threat model
understand the channel sensitivity and applied controls.

Comment on lines +311 to +326
ESO controller requires:

```yaml
# Watch services cluster-wide
- apiGroups: [""]
resources: ["services"]
verbs: ["get", "list", "watch"]

# Manage secrets in any namespace
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "update", "patch", "get", "list", "watch"]
```

**Security consideration:** Cross-namespace secret write is privileged. ESO only writes to the fixed secret name `external-secrets-provider-tls` and only in namespaces with labeled services.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Call out that auto-managed provider certs break scoped installs.

This RBAC model requires cluster-wide Service watch plus cross-namespace Secret writes, which is incompatible with the repo’s existing scoped Role-based deployment mode. The design should explicitly mark scoped installs unsupported for this feature, or define an alternative discovery/distribution path for them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/mTLS.md` around lines 311 - 326, The design text must call out
that the current RBAC (cluster-wide Service watch + cross-namespace Secret
writes for resource "secrets" and the fixed secret name
external-secrets-provider-tls) breaks scoped Role-based installs; update the
mTLS section to explicitly state that scoped installs are unsupported for the
auto-managed provider cert flow or provide an alternative approach (e.g.,
per-namespace discovery using a namespaced Service watch + a
controller-per-namespace, or a secret-sync mechanism to copy
external-secrets-provider-tls into scoped namespaces) and document which
controller behaviors (Service watch, Secret create/update/patch) would change
under that alternative.


## Problem Statements

For detailed discussion, see: [recording](https://zoom.us/rec/share/fioR9a-blopYRqALdvphBU2rXN-To3BAuMOZK1FdpxUIIe31Qam3oWDxGWFHKTT9.UmkUa-3W5abZww8-), [meeting notes](https://zoom-lfx.platform.linuxfoundation.org/meeting/95703312466-1761822000000/summaries?password=47fac69c-506a-408d-a424-ff3c1c69a0dc), [drawings](https://excalidraw.com/#room=bf94104fddcd05c32c20,BTAv5naTXFc-6m0275D0_g), and [this previous PR](https://github.com/external-secrets/external-secrets/pull/4792). There's also an [old issue](https://github.com/external-secrets/external-secrets/issues/696) with more research and history, although outdated.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove password-bearing meeting URLs from the ADR.

Committing a meeting link with an embedded password= token leaks access credentials into the repository history and will age poorly as long-term design documentation. Please replace these with a sanitized summary or non-sensitive references.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/README.md` at line 22, The ADR currently embeds a meeting URL with
an embedded password token (the URL containing "password="), which leaks
credentials; edit the README entry to remove the password query parameter (or
replace the entire sensitive Zoom link) and instead include a sanitized
reference such as a non-sensitive public summary, the meeting date/time and
participant list, a link to the project's private meeting notes location, or a
short summary of key decisions; locate the string in the README paragraph that
contains the meeting URL (the link with "password=") and replace it accordingly
so no sensitive token remains in the committed text.


For detailed discussion, see: [recording](https://zoom.us/rec/share/fioR9a-blopYRqALdvphBU2rXN-To3BAuMOZK1FdpxUIIe31Qam3oWDxGWFHKTT9.UmkUa-3W5abZww8-), [meeting notes](https://zoom-lfx.platform.linuxfoundation.org/meeting/95703312466-1761822000000/summaries?password=47fac69c-506a-408d-a424-ff3c1c69a0dc), [drawings](https://excalidraw.com/#room=bf94104fddcd05c32c20,BTAv5naTXFc-6m0275D0_g), and [this previous PR](https://github.com/external-secrets/external-secrets/pull/4792). There's also an [old issue](https://github.com/external-secrets/external-secrets/issues/696) with more research and history, although outdated.

#### Risk of Supply-Chain Attacks
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the heading hierarchy.

These subsections jump from ## to ####, which keeps markdownlint failing. They should be ### under the current structure.

Also applies to: 42-42

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 24-24: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/README.md` at line 24, The subsection headings like "Risk of
Supply-Chain Attacks" (currently written as "#### Risk of Supply-Chain Attacks")
and the other affected heading at the indicated location should be demoted one
level to "###" so they are proper subheadings under the current `##` sections;
update all occurrences of `#### ` to `### ` for those subsections (e.g., the
"Risk of Supply-Chain Attacks" heading and the other instance called out) to fix
the markdown hierarchy and satisfy markdownlint.

Comment on lines +167 to +175
**Mitigation:**
- ✅ Provider-specific configuration still required (can't serve AWS secrets without AWS credentials)
- ✅ RBAC controls who can create Provider resources
- ⚠️ Consider admission webhook to validate Provider resources
- ⚠️ Monitor for unexpected service labels

**Impact:** Low - attacker can't access external secrets without proper credentials. Provider impersonation may be a DOS vector.

**Residual Risk:** Low
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

T3 understates the impact of a rogue provider.

This is not only a DoS vector. Once ESO is pointed at an attacker-controlled provider, that service can observe request metadata and, on push paths, receive secret payloads from ESO without needing any external-cloud credentials. Please reclassify the impact and tighten the mitigations accordingly.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 172-172: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/014/THREAT_MODEL.md` around lines 167 - 175, T3's current "Low" impact
understates a rogue Provider: if ESO is pointed at an attacker-controlled
Provider it can observe request metadata and receive secret payloads on push
paths—this should be reclassified to High (or High/Confidentiality impact) and
mitigations tightened; update the T3 entry to reflect data-exfiltration risk,
require enforcement controls on Provider resources including a validating
admission webhook for Provider resources, stronger RBAC (deny create/modify
except explicit principals), provider allowlist/cryptographic attestation or
signed manifests for Providers, require end-to-end encryption/signing of secret
payloads or explicit operator approval for push paths, restrict service labels
and egress policies, and add audit/alerting for Provider changes and unexpected
service labels (reference T3, ESO, Provider resources, admission webhook, RBAC,
service labels).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm These is to tag issues that should require attention from maintainers size/l

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants