Skip to content

xds: restructure CertificateProvider fields#17201

Merged
lizan merged 9 commits intoenvoyproxy:mainfrom
markdroth:xds_tls_cert_provider_api_fixes
Aug 4, 2021
Merged

xds: restructure CertificateProvider fields#17201
lizan merged 9 commits intoenvoyproxy:mainfrom
markdroth:xds_tls_cert_provider_api_fixes

Conversation

@markdroth
Copy link
Copy Markdown
Contributor

@markdroth markdroth commented Jun 30, 2021

Signed-off-by: Mark D. Roth roth@google.com

Commit Message: xds: restructure CertificateProvider fields
Additional Description: When these fields were originally added, we didn't fully understand how the CertificateValidationContext configuration worked. A CertificateProvider cannot provide a full CertificateValidationContext proto; it provides only the CA certs (i.e., the trusted_ca field within the CertificateValidationContext proto). So this PR makes the following changes:

  • Duplicates the CertificateProvider and CertificateProviderInstance messages from CommonTlsContext as CertificateProviderPlugin and CertificateProviderPluginInstance outside of CommonTlsContext (in common.proto instead of tls.proto). (Note that this also paves the way for reusing these messages for SPIFFE validation in the future.)
  • Adds new CertificateProviderPlugin and CertificateProviderPluginInstance fields to CertificateValidationContext, as an alternative to the existing trusted_ca field. (Added annotations to make this a oneof in the future.)
  • In CommonTlsContext, I have deprecated the validation_context_certificate_provider and validation_context_certificate_provider_instance fields both in the validation_context_type oneof and in CombinedValidationContext.
  • In CommonTlsContext, I have replaced the tls_certificate_certificate_provider and tls_certificate_certificate_provider_instance fields with new fields that use the new messages outside of CommonTlsContext.

Risk Level: Low
Testing: N/A
Docs Changes: Included in PR
Release Notes: N/A
Platform Specific Features: N/A

CC @sanjaypujare @ejona86 @dfawley @yashykt @easwars @jaychenatr

Signed-off-by: Mark D. Roth <roth@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #17201 was opened by markdroth.

see: more, trace.

Signed-off-by: Mark D. Roth <roth@google.com>
@sanjaypujare
Copy link
Copy Markdown
Contributor

Some more context from my memory.

About the comment:

A CertificateProvider cannot provide a full CertificateValidationContext proto; it provides only the CA certs (i.e., the trusted_ca field within the CertificateValidationContext proto).

It is true but it also happens to be true for the field validation_context_sds_secret_config (either the one inside combined_validation_context or the one inside oneof validation_context_type). When I looked at how it was used by the Istio Node agent for example (https://github.com/istio/istio/blob/master/security/pkg/nodeagent/sds/sdsservice.go#L208) the Node agent just populates the trusted_ca of the validation_context and nothing else. So even the SDS server is just returning the certificate part of the validation_context. So ideally even the field validation_context_sds_secret_config should really have been named ca_sds_secret_config and should not have been part of the validation_context_type oneof proto and documented to say that it just provides the root certificate and no other validation_context info. But may be there is some more context behind this that I am not aware of.

In any case this change might still be desirable for the reasons stated in the PR.

@sanjaypujare
Copy link
Copy Markdown
Contributor

After this change is made, and deprecated fields removed, the control plane (Traffic Director for example) will then have to send the CertificateProviderInstance info for gRPC in ca_certificate_provider_instance field of CommonTlsContext but the equivalent certificate info for Envoy (validation_context_sds_secret_config) it will continue to send inside combined_validation_context or it can also just use the outer validation_context_sds_secret_config.

So tagging @karthikbox ...

@markdroth
Copy link
Copy Markdown
Contributor Author

It is true but it also happens to be true for the field validation_context_sds_secret_config (either the one inside combined_validation_context or the one inside oneof validation_context_type). When I looked at how it was used by the Istio Node agent for example (https://github.com/istio/istio/blob/master/security/pkg/nodeagent/sds/sdsservice.go#L208) the Node agent just populates the trusted_ca of the validation_context and nothing else. So even the SDS server is just returning the certificate part of the validation_context.

The fact that Istio's SDS server may happen to populate only the trusted_ca field does not mean that's the only field that it could populate. The SDS resource does actually contain a CertificateValidationContext message when used in these fields, so the server could populate any field it wanted to, and Envoy would use whatever it returned. That's the difference between SDS and a CertificateProvider.

After this change is made, and deprecated fields removed, the control plane (Traffic Director for example) will then have to send the CertificateProviderInstance info for gRPC in ca_certificate_provider_instance field of CommonTlsContext but the equivalent certificate info for Envoy (validation_context_sds_secret_config) it will continue to send inside combined_validation_context or it can also just use the outer validation_context_sds_secret_config.

This PR does not change anything at all for Envoy. It was already the case that Envoy had not implemented any of the CertificateProvider fields, and control planes were already populating the SDS fields.

}

// Certificate provider for fetching CA certs. This will populate the
// :ref:`CertificateValidationContext.trusted_ca field
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this in CertificateValidationContext to be a sibling of trusted_ca?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason tls_certificate_certificate_provider is not part of TlsCertificate to be a sibling of certificate_chain field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Eric, it makes a lot more sense to put it in CertificateValidationContext. There it will effectively be a oneof, rather than being something we need to merge in later.

Sanjay, a CertificateProvider is an alternative to TlsCertificate; it provides the same information in a different way. It would not make sense to have it inside of TlsCertificate.

I've gone ahead and made this change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...CertificateProvider is an alternative to TlsCertificate...

CertificateValidationContext does not use TlsCertificate but instead uses trusted_ca (which is of type DataSource) so the field trusted_ca in CertificateValidationContext serves the exact same purpose as certificate_chain in TlsCertificate for sending the actual certificate.

Your comment also says "Only one of trusted_ca, ca_certificate_provider, and ca_certificate_provider_instance".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TlsCertificate contains both the certs and the private keys. A CertificateProvider provdes both, so it's an alternative to the entire TlsCertificate message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I forgot the private key part. So the TlsCertificate and CertificateValidationContext are clearly differentiated that way but a CertificateProvider is used to get a cert + key for the identity certificate (like a TlsCertificate) or just the trust/root certificate (like the trusted_ca of CertificateValidationContext). Makes sense

@htuch
Copy link
Copy Markdown
Member

htuch commented Jul 1, 2021

Since these are "not implemented hide", you can avoid deprecation if gRPC isn't already using this in the wild FWIW.

Signed-off-by: Mark D. Roth <roth@google.com>
@markdroth
Copy link
Copy Markdown
Contributor Author

Since these are "not implemented hide", you can avoid deprecation if gRPC isn't already using this in the wild FWIW.

My initial commit actually did that, but it turns out that all of the fields are actually in use, so we need to deprecate them.

@sanjaypujare
Copy link
Copy Markdown
Contributor

...
This PR does not change anything at all for Envoy. It was already the case that Envoy had not implemented any of the CertificateProvider fields, and control planes were already populating the SDS fields.

It doesn't change anything for Envoy but it changes a few things for the control plane that is driving both Envoy and gRPC. So the current config sent by a control plane looks like this:

          "commonTlsContext": {
            "tlsCertificateSdsSecretConfigs": [{
              "name": "tls_sds",
              "sdsConfig": {
                "path": "/etc/envoy/tls_certificate_context_sds_secret.yaml"
              }
            }],
            "combinedValidationContext": {
              "defaultValidationContext": {
              },
              "validationContextSdsSecretConfig": {
                "name": "validation_context_sds",
                "sdsConfig": {
                  "path": "/etc/envoy/validation_context_sds_secret.yaml"
                }
              },
              "validationContextCertificateProviderInstance": {
                "instanceName": "foobar",
                "certificateName": "ROOTCA"
              }
            },
            "tlsCertificateCertificateProviderInstance": {
              "instanceName": "foobar",
              "certificateName": "DEFAULT"
            }
          }
        }

But with the changes in this PR and the deprecated fields actually removed the config will have to be:

          "commonTlsContext": {
            "tlsCertificateSdsSecretConfigs": [{
              "name": "tls_sds",
              "sdsConfig": {
                "path": "/etc/envoy/tls_certificate_context_sds_secret.yaml"
              }
            }],
            "combinedValidationContext": {
              "defaultValidationContext": {
              },
              "validationContextSdsSecretConfig": {
                "name": "validation_context_sds",
                "sdsConfig": {
                  "path": "/etc/envoy/validation_context_sds_secret.yaml"
                }
              }
            },
            "tlsCertificateCertificateProviderInstance": {
              "instanceName": "foobar",
              "certificateName": "DEFAULT"
            },
            "caCertificateProviderInstance": {
               "instanceName": "foobar",
               "certificateName": "ROOTCA"
            }
          }
        }

So something for them to be aware of.

@karthikbox
Copy link
Copy Markdown

I'm fine with the api changes overall. What i'm worried about is that how does the mgmt server decides which API to use for a xDS client? (NOTE: This is not a new problem introduced by this PR.) Seems like there's no single API which would work for both grpc and envoy.
Will there ever be a common API that works for all clients?
Or should we consider a mechanism where the client can signal to the mgmt server on which API it supports (i'm thinking via resource names) ?
Such a mechanism would solve the following:

  1. mgmt server can correctly program the based on what the client actually supports.
  2. it will work well when we introduce new API in the future.
  3. Makes migration between API easier (see below).
    WDYT @markdroth @htuch

Since these are "not implemented hide", you can avoid deprecation if gRPC isn't already using this in the wild FWIW.

My initial commit actually did that, but it turns out that all of the fields are actually in use, so we need to deprecate them.

yes, these are in use.
How do we solve the problem of migrating to the new CertificateProviderPlugin API? Does the mgmt server populate them both and then later stops programming the deprecated CertificateProvider API ?

@markdroth
Copy link
Copy Markdown
Contributor Author

@sanjaypujare The exact change will be a little different than that, but yes, the change does affect management servers that support gRPC (regardless of whether they also support Envoy). So what you're pointing out is basically that management servers that use the fields that are being deprecated here need to be changed to use the new fields instead... which is the whole point of this PR. :)

@karthikbox As you note, this concern is not actually related to this PR, so I don't think we need to hash it out here, but we have had a bunch of discussions about this. In the long term, I think the goal should be to migrate Envoy to the CertificateProviderInstance model, because everything else can be supported via that API. An individual Envoy can have a CertificateProvider plugin that speaks SDS if that's what they want to use, or it can have a plugin that explicitly specifies a static set of certs, or reads it from files on local disk, or whatever. The management server should not actually need to know or care how the client is getting its certs, as long as it can tell the client which cert to get via which logical mechanism. This eliminates the need for any sort of negotiation between the client and server.

@markdroth
Copy link
Copy Markdown
Contributor Author

How do we solve the problem of migrating to the new CertificateProviderPlugin API? Does the mgmt server populate them both and then later stops programming the deprecated CertificateProvider API ?

Yes, a control plane like TD would need to temporarily populate both. But since the gRPC feature is only in beta right now, we would not need to maintain the code for the old fields for very long.

markdroth added 3 commits July 1, 2021 12:03
Signed-off-by: Mark D. Roth <roth@google.com>
Signed-off-by: Mark D. Roth <roth@google.com>
Signed-off-by: Mark D. Roth <roth@google.com>
@lizan
Copy link
Copy Markdown
Member

lizan commented Jul 2, 2021

Since these are "not implemented hide", you can avoid deprecation if gRPC isn't already using this in the wild FWIW.

My initial commit actually did that, but it turns out that all of the fields are actually in use, so we need to deprecate them.

How are they used? Since the message moved out is keeping same wire format, and as they are "not implemented hide" so if you don't need generated API compatibility we can still avoid deprecation.

@markdroth
Copy link
Copy Markdown
Contributor Author

How are they used? Since the message moved out is keeping same wire format, and as they are "not implemented hide" so if you don't need generated API compatibility we can still avoid deprecation.

We do need generated API compatibility, unfortunately.

message CertificateProviderPlugin {
// opaque name used to specify certificate instances or types. For example, "ROOTCA" to specify
// a root-certificate (validation context) or "TLS" to specify a new tls-certificate.
string name = 1 [(validate.rules).string = {min_len: 1}];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference of this name and the one in typed_config?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one in typed_config is simply a name or an opaque identifier for the extension. So it is purely config associated with the plugin. Whereas this name is meant to serve as an identification of the certificate that is meant to be provided by the certificate provider. Similar to the certificate_name on line 275. May be we should call this certificate_name too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this for a bit, I've elected to just remove CertificateProviderPlugin completely. We added that before we added CertificateProviderPluginInstance, but it's not actually being used anywhere, since all of our code uses the newer type. We can always add this later if we actually need it for something.

Signed-off-by: Mark D. Roth <roth@google.com>
@markdroth
Copy link
Copy Markdown
Contributor Author

Let's hold off on this for a bit. We're having some discussions about how we want to structure this for SPIFFE.

@markdroth
Copy link
Copy Markdown
Contributor Author

It looks like the SPIFFE discussion is getting kicked down the road a bit, so let's proceed with this PR as-is.

// delivered via SDS.
config.core.v3.DataSource trusted_ca = 1;
//
// Only one of *trusted_ca*, *ca_certificate_provider*, and *ca_certificate_provider_instance*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see ca_certificate_provider defined in this struct. Either that needs to be defined here or the comment should be updated to reflect it. IIRC @coolnsu wanted to keep it for future use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I forgot to update this comment when I decided not to include that field. (We can always add it later if we need it.)

Fixed.


// Certificate provider instance for fetching TLS certificates.
//
// Only one of *trusted_ca*, *ca_certificate_provider*, and *ca_certificate_provider_instance*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about the reference to ca_certificate_provider.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// The plugin allows certificates to be fetched/refreshed over the network asynchronously with
// respect to the TLS handshake.
// [#not-implemented-hide:]
message CertificateProviderPluginInstance {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from moving from tls.proto it seems we have also renamed it from CertificateProviderInstance. Are we forcing all certificate providers to be plugins thru this nomenclature? Also I think "extension" is preferred over "plugin" IIRC from a previous PR/comment but I don't share that preference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred to keep the existing name, but I was forced to rename it due to #17216. There is no intention to establish a new nomenclature convention here.

// fetched/refreshed over the network asynchronously with respect to the TLS handshake.
//
// DEPRECATED: This message was moved outside of CommonTlsContext
// and now lives in common.proto.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that in common.proto. So we have decided to drop it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've updated the comment.

// Only one of *tls_certificates*, *tls_certificate_sds_secret_configs*,
// and *tls_certificate_provider_instance* may be used.
// [#not-implemented-hide:]
CertificateProviderPluginInstance tls_certificate_provider_instance = 14;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it uses the new type (which is fair since that's the intent) but the field is now called tls_certificate_provider_instance and the original field was tls_certificate_certificate_provider_instance which combines tls_certificate (indicating this is for the identity certificate) and certificate_provider_instance (indicating this is a certificate provider instance). To be really consistent with other fields providing the identity certificate the field would be called tls_certificate_certificate_provider_plugin_instance . But I am okay with the name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think longer names are useful when they avoid ambiguity, but I don't see any ambiguity in the shorter name in this case.

@sanjaypujare
Copy link
Copy Markdown
Contributor

sanjaypujare commented Jul 28, 2021

Let's say a control is sending the following config (CommonTlsContext) to cater to both Envoy and gRPC:

"commonTlsContext": {
    "tlsCertificateSdsSecretConfigs": [{
        "name": "tls_sds",
        "sdsConfig": {
            "path": "/etc/envoy/tls_certificate_context_sds_secret.yaml"
        }
    }],
    "tlsCertificateCertificateProviderInstance": {
        "instanceName": "grpc-cert-instance",
        "certificateName": "DEFAULT"
    },
    "combinedValidationContext": {
        "defaultValidationContext": {
	    "matchSubjectAltNames": [{
                "exact": "spiffe://domain-A/ns/example-grpc-server/sa/example-grpc-server"
            }]
        },
	"validationContextSdsSecretConfig": {
            "name": "validation_context_sds",
            "sdsConfig": {
                "path": "/etc/envoy/validation_context_sds_secret.yaml"
            }
        },
        "validationContextCertificateProviderInstance": {
            "instanceName": "grpc-cert-instance",
            "certificateName": "ROOTCA"
        }
    }
}

Note the ...SdsSecretConfig fields are there for Envoy compatibility.

After merging the PR then the same control plane will need to change the config as follows to conform to the new structure and also be Envoy compatible:

"commonTlsContext": {
    "tlsCertificateSdsSecretConfigs": [{
        "name": "tls_sds",
        "sdsConfig": {
            "path": "/etc/envoy/tls_certificate_context_sds_secret.yaml"
        }
    }],
    "tlsCertificateProviderInstance": {
        "instanceName": "grpc-cert-instance",
        "certificateName": "DEFAULT"
    },
    "combinedValidationContext": {
        "defaultValidationContext": {
	    "matchSubjectAltNames": [{
                "exact": "spiffe://domain-A/ns/example-grpc-server/sa/example-grpc-server"
	    }],
	    "caCertificateProviderInstance": {
		"instanceName": "grpc-cert-instance",
		"certificateName": "ROOTCA"
            }
        },
	"validationContextSdsSecretConfig": {
            "name": "validation_context_sds",
            "sdsConfig": {
                "path": "/etc/envoy/validation_context_sds_secret.yaml"
            }
        }
    }
}

The only 2 changes I see are:

  • change from deprecated field tlsCertificateCertificateProviderInstance to the new field tlsCertificateProviderInstance
  • stop using validationContextCertificateProviderInstance inside combinedValidationContext instead send the same config in caCertificateProviderInstance inside defaultValidationContext inside combinedValidationContext.

CC @karthikbox

Signed-off-by: Mark D. Roth <roth@google.com>
@markdroth
Copy link
Copy Markdown
Contributor Author

@lizan, I think this PR is ready to be merged. What do you think?

Signed-off-by: Mark D. Roth <roth@google.com>
@repokitteh-read-only repokitteh-read-only bot removed the api label Aug 4, 2021
@lizan lizan merged commit b29d654 into envoyproxy:main Aug 4, 2021
baojr added a commit to baojr/envoy that referenced this pull request Aug 4, 2021
…bridge-stream

* upstream/main: (32 commits)
  tls: move ssl connection info into SocketAddressProvider (envoyproxy#17334)
  conn pool: default enable runtime feature `conn_pool_delete_when_idle` (envoyproxy#17577)
  api: LEDS api introduction (envoyproxy#17419)
  kafka: add support for api versions request in mesh-filter (envoyproxy#17475)
  ext_proc: Implement BUFFERED_PARTIAL processing mode (envoyproxy#17531)
  tooling: Async/pathlib/mypy cleanups and utils (envoyproxy#17505)
  xds: restructure CertificateProvider fields (envoyproxy#17201)
  Refactor OverloadIntegrationTest breaking out a test base, and the fake resource monitors. (envoyproxy#17530)
  listener: move active connection collection out of active tcp listener (envoyproxy#16947)
  tools: format checks for backticks (envoyproxy#17566)
  coverage: set lower limit for common/quic and common (envoyproxy#17573)
  v2: final source removal (envoyproxy#17565)
  test: bumping coverage (envoyproxy#17564)
  quic: enforcing header size and contents (envoyproxy#17520)
  Support for canonicalizing URI properly for AWS SigV4 signer (envoyproxy#17137)
  listener: add a stat for transport socket connect timeout (envoyproxy#17458)
  listener: add listen() error handling (envoyproxy#17427)
  http: return per route config when direct response is set (envoyproxy#17449)
  removing most v2 references from source/ (envoyproxy#17415)
  bug fix: return bootstrap when validating config (envoyproxy#17499)
  ...

Signed-off-by: Garrett Bourg <bourg@squareup.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
When these fields were originally added, we didn't fully understand how the `CertificateValidationContext` configuration worked.  A `CertificateProvider` cannot provide a full `CertificateValidationContext` proto; it provides only the CA certs (i.e., the `trusted_ca` field within the `CertificateValidationContext` proto).  So this PR makes the following changes:
- Duplicates the `CertificateProvider` and `CertificateProviderInstance` messages from `CommonTlsContext` as `CertificateProviderPlugin` and `CertificateProviderPluginInstance` outside of `CommonTlsContext` (in common.proto instead of tls.proto).  (Note that this also paves the way for reusing these messages for SPIFFE validation in the future.)
- Adds new `CertificateProviderPlugin` and `CertificateProviderPluginInstance` fields to `CertificateValidationContext`, as an alternative to the existing `trusted_ca` field.  (Added annotations to make this a `oneof` in the future.)
- In `CommonTlsContext`, I have deprecated the `validation_context_certificate_provider` and `validation_context_certificate_provider_instance` fields both in the `validation_context_type` oneof and in `CombinedValidationContext`.
- In `CommonTlsContext`, I have replaced the `tls_certificate_certificate_provider` and `tls_certificate_certificate_provider_instance` fields with new fields that use the new messages outside of `CommonTlsContext`.

Risk Level: Low
Testing: N/A
Docs Changes: Included in PR
Release Notes: N/A
Platform Specific Features: N/A

Signed-off-by: Mark D. Roth <roth@google.com>
@markdroth markdroth deleted the xds_tls_cert_provider_api_fixes branch January 4, 2022 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants