Draft: supporting API minor versions (Do not submit)#14943
Draft: supporting API minor versions (Do not submit)#14943adisuissa wants to merge 10 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
|
@envoyproxy/api-shepherds PTAL |
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
|
I'll take a first pass. |
htuch
left a comment
There was a problem hiding this comment.
Awesome, this is great to see coming to fruition. I've left some high level design comments. @markdroth probably should also take a pass (in particular around transport-level changes) as a next round.
/wait
| @@ -0,0 +1 @@ | |||
| 3.7.0 | |||
There was a problem hiding this comment.
FYI at the beginning of Q2 I think we will be bumping to minor version 3.1.0. We will tag all deprecated fields as deprecated at 3.0.0 prior to that. Then in 2022 Q2, we will bump to 3.2.0 and remove support in Envoy for everything deprecated 3.0.0. This gives 2 years of support for minor version 3.0.0 and any field that is removed in 2022 will have had 1-2 years life as deprecated.
There was a problem hiding this comment.
I agree that the first minor version will be 3.1.0, and all deprecated fields will have the deprecated_at annotation set to 3.0.
At the moment it is set to 3.7.0 to "comply" with the api/envoy/config/core/v3/deprecaetd_example.proto example contents.
| The minor version enables API stability and support, and to safely deprecate old | ||
| features and reduce the tech debt. A client can support a range of minor API | ||
| versions, and can request xDS resources that match this range, using the resource | ||
| [context parameters](https://github.com/cncf/udpa/blob/master/xds/core/v3/context_params.proto). |
There was a problem hiding this comment.
Also Node will indicate for simple string names. I'd generally lead in with the old way of doing it and then point out the new way.
There was a problem hiding this comment.
Added user_agent_latest_api_version and user_agent_oldest_api_version to Node.
There was a problem hiding this comment.
Looks good. Suggest also updating the docs here to cover both Node and xdstp://
| // from a resource name to its status. A resource that doesn't appear in the map | ||
| // implies the resource response is valid. No status map implies that all the | ||
| // resources in the response are valid. | ||
| map<string, DiscoveryResponseStatus> resources_status = 7; |
There was a problem hiding this comment.
As discussed IRL, we should probably move this to a separate PR for this feature in general. @markdroth would this help get rid of the "15 second rule"?
There was a problem hiding this comment.
Yeah, I think something like this is a good idea. But I do think it should probably be a separate PR.
Do we actually want this to be a separate map? An alternative would be to embed the DiscoveryResponseStatus inside of the Resource message in the normal resources field. That way, clients just need to look at one field for every resource, and some of those resources might be a special type that indicates that the server didn't have the requested resource.
Also, what is the expected semantic here when a subscribed resource does not exist and multiple responses are sent during the time it does not exist? For LDS and CDS, is the server obligated to send every non-existent resource in every response, just like it normally does? For other resource types, can the server send the response status just once, and the client assumes that the status doesn't change until/unless it receives a subsequent response that either populates the resource or changes its response status? (The semantic here might be a bit more obvious if the DiscoveryResponseStatus was inside of the resources field instead of being a separate map.)
There was a problem hiding this comment.
I agree that it makes more sense to add the DiscoveryResponseStatus to each specific resource (in Resource).
I'm probably missing something here, but IIUC the resources field in a SotW response is of type google.protobuf.Any, and the response is not required to return the resources in a wrapped Resource. If this implies that the Resource wrapper is required how do we plan to the gradual introduction of this to existing xDS services?
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
…rsion_v5 Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
…rsion_v5 Signed-off-by: Adi Suissa-Peleg <adip@google.com>
|
A description of the tools needed for minor/patch versioning verifications and updates (without the Manifest tools): |
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
htuch
left a comment
There was a problem hiding this comment.
Looks great, I've left another round of feedback. LMK when to take another pass. Thanks.
/wait
| The minor version enables API stability and support, and to safely deprecate old | ||
| features and reduce the tech debt. A client can support a range of minor API | ||
| versions, and can request xDS resources that match this range, using the resource | ||
| [context parameters](https://github.com/cncf/udpa/blob/master/xds/core/v3/context_params.proto). |
There was a problem hiding this comment.
Looks good. Suggest also updating the docs here to cover both Node and xdstp://
api/API_VERSIONING.md
Outdated
| versions, and can request xDS resources that match this range, using the resource | ||
| [context parameters](https://github.com/cncf/udpa/blob/master/xds/core/v3/context_params.proto). | ||
| Deprecated features, from previous minor or major versions (and their code), can be safely removed | ||
| from the client. In general, within a major and minor API version, we do not allow any breaking |
There was a problem hiding this comment.
within a major version (including all minor versions within that major version), we do not allow ..
| The minor version will be reset to 0 upon a release of a new major version. | ||
|
|
||
| The patch version tracks incremental changes to the API, such as, adding a new | ||
| package, adding a new field, deprecating a field, etc. |
There was a problem hiding this comment.
Suggest making this a list and enumerating explicitly all things that affect path version, since this is basically the specification doc for versioning.
api/API_VERSIONING.md
Outdated
| In general, within a package's major API version, we do not allow any breaking changes. The guiding | ||
| principle is that neither the wire format nor protobuf compiler generated language bindings should | ||
| experience a backward compatible break on a change. Specifically: | ||
| In general, within a major and minor API version, we do not allow any breaking changes the protobuf |
There was a problem hiding this comment.
within a major version (including all minor versions within that major version), we do not allow (seems a repeat of above)
api/API_VERSIONING.md
Outdated
| ## Minor version lifecycle | ||
|
|
||
| A new API minor version is expected to be released once a year, at the Q2 release. This | ||
| allows for gradual introduction of new capabilities and features that will be supported |
There was a problem hiding this comment.
allows for introduction of new capabilities and features that will be supported by Envoy and gradual deprecation of older feature support in Envoy.
api/API_VERSIONING.md
Outdated
| (7 to 8), and features deprecated in minor version 6 will be removed. | ||
| When the API version becomes v3.9.0, `field1` will be removed from the API. | ||
|
|
||
| Assume the [xDS API shepherds](https://github.com/orgs/envoyproxy/teams/api-shepherds) decide that |
There was a problem hiding this comment.
Actually, delete this paragraph, I agree it is confusing to keep when reading the context around it.
api/API_VERSIONING.md
Outdated
| When the minor version is cut (starting with minor version `Y+1`), the protoxform tool will first | ||
| verify that all deprecated fields have a `minor_version` annotations and remove any field or enum | ||
| value with minor version `Y-1`. Note that the API should not contain features of | ||
| version smaller than or equal to `Y-1` after the a new minor version is introduced. |
There was a problem hiding this comment.
No API field removal :) These will be human changes to the Envoy where we delete stuff that is not trully unsupported at minor version clock ticks.
| // the minor and patch versions are updated upon incremental changes to the API. | ||
| message ApiVersionNumber { | ||
| // SemVer version of the API. | ||
| type.v3.SemanticVersion version = 1; |
There was a problem hiding this comment.
Curious if you anticipate any other fields in this message?
|
|
||
| // [#not-implemented-hide:] | ||
| // An optional API patch version of the requested resource that is supported by | ||
| // the client. This is useful for use-cases where the client and server support new |
There was a problem hiding this comment.
I'd phrase this as Latest xDS API patch version supported by the client.
| envoy::config::core::v3::ApiVersionNumber ApiVersionInfo::makeApiVersion(const char* version) { | ||
| envoy::config::core::v3::ApiVersionNumber result; | ||
| // Split API_VERSION_NUMBER into version | ||
| std::regex ver_regex("([\\d]+)\\.([\\d]+)\\.([\\d]+)"); |
There was a problem hiding this comment.
Nit: maybe just absl::StrSplit on .
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
markdroth
left a comment
There was a problem hiding this comment.
Sorry for taking so long to review this; I've been pretty busy lately.
It might be a good idea for us to sync up in person to decide the best way to proceed with this. I'd still prefer to consider handling this via flexible range-based context parameter matching if possible.
Please let me know if you have any questions. Thanks!
| @@ -0,0 +1,39 @@ | |||
| syntax = "proto3"; | |||
There was a problem hiding this comment.
The name of this file has a typo in it.
| // from a resource name to its status. A resource that doesn't appear in the map | ||
| // implies the resource response is valid. No status map implies that all the | ||
| // resources in the response are valid. | ||
| map<string, DiscoveryResponseStatus> resources_status = 7; |
There was a problem hiding this comment.
Yeah, I think something like this is a good idea. But I do think it should probably be a separate PR.
Do we actually want this to be a separate map? An alternative would be to embed the DiscoveryResponseStatus inside of the Resource message in the normal resources field. That way, clients just need to look at one field for every resource, and some of those resources might be a special type that indicates that the server didn't have the requested resource.
Also, what is the expected semantic here when a subscribed resource does not exist and multiple responses are sent during the time it does not exist? For LDS and CDS, is the server obligated to send every non-existent resource in every response, just like it normally does? For other resource types, can the server send the response status just once, and the client assumes that the status doesn't change until/unless it receives a subsequent response that either populates the resource or changes its response status? (The semantic here might be a bit more obvious if the DiscoveryResponseStatus was inside of the resources field instead of being a separate map.)
| The minor version enables API stability and support, and to safely deprecate old | ||
| features and reduce the tech debt. A client can support a range of minor API | ||
| versions, and can request xDS resources that match this range, using the resource | ||
| [context parameters](https://github.com/cncf/udpa/blob/master/xds/core/v3/context_params.proto). |
There was a problem hiding this comment.
I would still like to consider adding a flexible range-based matching semantic for minor versions. Can we do that as part of this effort, or do we want to decouple that from this?
Hmm... Actually, now that we've decided to stick with string representations for new-style resource names, doesn't that mean that we don't actually have a way to add more flexible matching syntax later if we don't do it now? When we were going to be using ResourceLocator on the wire, it would have been easy to add new matching types, because they'd be expressed in different fields. But with the string form, it's basically just "key=value&key=value&...". So if we're interpreting the values literally, then any new special syntax we add there may break existing users.
It seems like we need to either add the more flexible matching syntax now or carve out some sort of escape hatch in the URI syntax to allow for future changes. (Actually, the latter would probably be a good idea even if we do the former, since we may want to add other flexible matcher types in the future.)
There was a problem hiding this comment.
I'm on the fence on this one. I would like to future proof with some notion of escape hatch in the URI syntax, that's a no brainer. I think I'd like to understand the full set of flexible match objectives before deciding on a concrete syntax and what do use here instead of min/max version. How do you think is best to go forward on that @markdroth?
There was a problem hiding this comment.
I agree that range-based matching semantics parameters should be supported, but I'm not certain I understand many of the use-cases that it will be used for.
I think that eventually we can replace this with a parameter similar to this: xds.api.range.minor_version={"start":5,"end":6}.
There was a problem hiding this comment.
Is the concern here that we don't want magic params that have prior agree to semantics? Meaning, if we have an oldest value and a newest value it seems not hard to just do the range matching manually.
There was a problem hiding this comment.
IIUC the issue of exact-matching is that the returned resource must have the same parameters.
Range-parameters enable explicit semantics. For example, a cache (or xds-relay) will be able to return the same resource with minor version 5 for a request with the range [4,5] and for a different request with the range [5,6].
There was a problem hiding this comment.
Right, the issue is that the current exact-match semantic means that the context params in the result must exactly match the ones in the request. So if you have one client requesting xds.api.range.minor_version=[0, 1] and another requesting xds.api.range.minor_version=[1, 2], you can provide the contents at version 1 to both of them, but you need to return each one with a different context parameter. This will basically lead to cache pollution, where any caching proxy will need an increasing number of copies of each identical resource under different names, and the pollution will increase combinatorically.
What I am proposing is a range-based matching mechanism where the clients can ask for something like xds.api.range.minor_version=SPECIAL_RANGE_SYNTAX[0, 1] and xds.api.range.minor_version=SPECIAL_RANGE_SYNTAX[1, 2], and both can get back the same resource with context parameter xds.api.range.minor_version=1. In order to make that work, we need to define the syntax for range-based matching, make sure it doesn't conflict with any "flat" context parameter values (e.g., by defining some sort of escaping mechanism), and make sure that xDS servers, clients, and proxies understand it.
This is a bit more work, but I think it's also a cleaner solution. And I think this same mechanism will be able to be used for other use-cases in the future. (We already have one use-case internally for which we've been discussing something like this.)
@htuch, assuming that we have a reasonable way to future-proof the syntax in a way that would allow us to add more flexible matching types if we need to do so in the future, I think I'd advocate for restricting our initial changes to just the cases that we need immediately. For minor version, I think that's range-based matching. And if we want to handle the patch version the same way, we might also want a simple wildcard match for that. I think anything else can wait for the future.
I would be happy to write up a proposal for how this would work, but I need something to build off of. Where did we leave things on plans for the initial xRFC for the new naming scheme? Are you still planning to do that at some point?
There was a problem hiding this comment.
OK thanks this makes sense to me. I agree we should do the work to implement range syntax. It doesn't seem too terribly difficult.
There was a problem hiding this comment.
Yes, a proposal here for a syntax which is future proof and can initially accommodate simple range would help unblock work here I think. Thanks Mark.
| // API features, before the next API minor version is rolled out. | ||
| // If not included, the patch version defaults to 0, stating that the client | ||
| // supports API features up to the client's minor version. | ||
| uint32 patch_version = 7; |
There was a problem hiding this comment.
Why is this here instead of being in the Node message? As currently designed, it's by definition something that works only when directly talking to an authoritative server, because it's not part of the resource name, so caching proxies can't handle it.
Note that if we do add a mechanism for flexible range-based context param matching, then we can make this part of the resource name, so it would work basically the same way as the minor version. I think that would actually make this mechanism easier for people to understand.
There was a problem hiding this comment.
If it's not federated, it can live in the Node message agreed.
There was a problem hiding this comment.
I'm probably missing something but why isn't it effectively part of the resource name? Meaning, if we assume the patch level for a specific message is a recursive walk of all sub-messages, don't we know it on a per message/resource basis and then it might be different and should be set here? I guess maybe I'm then also confused why we don't need major/minor here also? Or we assume that those are sent in context params?
There was a problem hiding this comment.
I think it's related to the semantics of exact-match and of an optional patch support.
If this is part of the context-params, caches should only return resources with the patch, and caches will need to cache the same resource with different patch versions.
There was a problem hiding this comment.
OK thanks, this makes sense. Both this convo and the range convo does make me wonder a bit if we are thinking about context params wrong and the server should actually be able to return a result with a subset of matching params? Does that make sense?
There was a problem hiding this comment.
Yes, let's do range and move things to patch, thanks.
There was a problem hiding this comment.
What if there was a special wildcard param value? (Just thinking out loud here). This would then allow intermediate caches to be quite intelligent about mappings. Taking the patch value this would allow the server to respond with patch_value=* or whatever.
There was a problem hiding this comment.
I agree with moving patch into Node. Actually Node already has a field ApiVersionNumber user_agent_latest_api_version where a client can add the patch version it supports.
Just wanted to note that at the moment there's no support for patch version in the xdstp naming scheme version negotiation, and that should be fine because patch version support is optional.
What if there was a special wildcard param value? (Just thinking out loud here). This would then allow intermediate caches to be quite intelligent about mappings. Taking the patch value this would allow the server to respond with
patch_value=*or whatever.
IMO the patch field should actually be understood as maximal supported patch version, so if the server responds with patch_value=*, the intermediate servers will not be able to know if the resource is supported in max_patch_version=3 or in max_patch_version=6, and they'll need that to process future requests.
There was a problem hiding this comment.
, the intermediate servers will not be able to know if the resource is supported in max_patch_version=3 or in max_patch_version=6, and they'll need that to process future requests.
But isn't that effectively by design? The management server shouldn't send * unless it really means it. I think what I'm getting at is it would give a way for the management server to effectively ignore patch that is cache compatible.
There was a problem hiding this comment.
I think it's possible to introduce a notion of cache-ignored context params. This is analogous to the ability in CDNs to ignore a list of query params. We've discussed adding this a few times to the spec.. I think the point becomes moot if we have range params though, since instead of max supported version N we can just say [0..N] with the same semantics and caches basically ignore it.
| // from a resource name to its status. A resource that doesn't appear in the map | ||
| // implies the resource response is valid. No status map implies that all the | ||
| // resources in the response are valid. | ||
| map<string, DiscoveryResponseStatus> resources_status = 8; |
There was a problem hiding this comment.
Note that unlike the SotW protocol, which currently doesn't have a way to explicitly indicate that a subscribed resource does not exist, the delta protocol does have such a mechanism (the removed_resources field). So in this case, the only value of this field is to support the version-mismatch case. And if we handle the version via flexible range-based context parameter matching, then we wouldn't need this field at all.
There was a problem hiding this comment.
And if we handle the version via flexible range-based context parameter matching, then we wouldn't need this field at all.
How does this relate? The management server can still be missing a version in the specified range.
There was a problem hiding this comment.
Right, but if the version was just a context parameter, then it would effectively be part of the name, so NO_MATCHING_RESOURCE and NO_MATCHING_VERSION_RESOURCE would be the same thing. And at that point, we would not need this field, because the existing removed_resources field would do what we need.
There was a problem hiding this comment.
In this case, we've actually strictly lost information, but I agree it's a more consistent and logical way of viewing things. I think @mattklein123 was interested in a specific "no matching version" feedback signal, but maybe this would be sufficient.
There was a problem hiding this comment.
I think the requirement here is that we have a clear way of providing status to the client about why a resource was not returned. The DiscoveryResponseStatus message seems to do that nicely. I don't have a strong preference on how we do it exactly but I agree with @htuch that we should not lose information on why something happened and should probably have it look similar for both SoTW and delta, whatever the best way of accomplishing that is.
There was a problem hiding this comment.
I'm fine with whatever solution you all agree to on this as long as we maintain the data fidelity.
There was a problem hiding this comment.
Right, but if the version was just a context parameter, then it would effectively be part of the name, so
NO_MATCHING_RESOURCEandNO_MATCHING_VERSION_RESOURCEwould be the same thing. And at that point, we would not need this field, because the existingremoved_resourcesfield would do what we need.
Maybe I got it wrong, but I thought of the response status as a complimentary field to the removed resources, providing the reason why were they removed. It essentially allows the following differentiation: a requested resource name (no matter what version) doesn't exist in the management server; and a requested resource in a specific version range doesn't exist in the management server (the server only handles older vs newer versions).
At the moment, the client doesn't use this information (other than logging) but it does allow the flexibility to request the resource from other servers, or to notify the operator that there's an issue that needs to be addressed (the client needs to be updated).
Also, could we consider using
google.rpc.Statusinstead of a customDiscoveryResponseStatusenum? That seems a bit more flexible in the long run.
I agree it should be used, and stating that the code field in it will be one of the DiscoveryResponseStatusTypes that we define.
There was a problem hiding this comment.
I don't feel strongly about it, but I tend to agree with Mark that resources_status could be a superset of removed_resources and we can just build it in a way that would allow differentiating the two cases?
There was a problem hiding this comment.
At the moment, the client doesn't use this information (other than logging) but it does allow the flexibility to request the resource from other servers, or to notify the operator that there's an issue that needs to be addressed (the client needs to be updated).
I suspect that there are no cases where the client will need to programmatically differentiate between these two cases. Knowing the difference is useful for humans but not really helpful for client implementations. Most clients will simply treat the resource as non-existing in both cases; I could imagine a client deciding to try another management server instead, but it seems unlikely that a client would choose to do so in one of these two cases but not the other.
That's why I favor simply deprecating the existing removed_resources field and replacing it with map<string, google.rpc.Status> resources_not_available, where the map value would be an error message indicating why the specified resource was not available. This approach would allow us to handle any arbitrary reason why the resource is not available, in case we run into cases other than the two we are currently considering.
I agree it should be used, and stating that the code field in it will be one of the
DiscoveryResponseStatusTypesthat we define.
I think the code in that proto needs to be one of the standard status code enum values. It would be very surprising to users for the code to be a new error space.
There was a problem hiding this comment.
Yeah, the primary requirement is debuggability; as long as we can include the version failure in the google.rpc.Status string it's all good.
mattklein123
left a comment
There was a problem hiding this comment.
Very exciting to see this happening. A few high level comments/questions.
| The minor version enables API stability and support, and to safely deprecate old | ||
| features and reduce the tech debt. A client can support a range of minor API | ||
| versions, and can request xDS resources that match this range, using the resource | ||
| [context parameters](https://github.com/cncf/udpa/blob/master/xds/core/v3/context_params.proto). |
There was a problem hiding this comment.
Is the concern here that we don't want magic params that have prior agree to semantics? Meaning, if we have an oldest value and a newest value it seems not hard to just do the range matching manually.
| // API features, before the next API minor version is rolled out. | ||
| // If not included, the patch version defaults to 0, stating that the client | ||
| // supports API features up to the client's minor version. | ||
| uint32 patch_version = 7; |
There was a problem hiding this comment.
I'm probably missing something but why isn't it effectively part of the resource name? Meaning, if we assume the patch level for a specific message is a recursive walk of all sub-messages, don't we know it on a per message/resource basis and then it might be different and should be set here? I guess maybe I'm then also confused why we don't need major/minor here also? Or we assume that those are sent in context params?
| // from a resource name to its status. A resource that doesn't appear in the map | ||
| // implies the resource response is valid. No status map implies that all the | ||
| // resources in the response are valid. | ||
| map<string, DiscoveryResponseStatus> resources_status = 8; |
There was a problem hiding this comment.
I think the requirement here is that we have a clear way of providing status to the client about why a resource was not returned. The DiscoveryResponseStatus message seems to do that nicely. I don't have a strong preference on how we do it exactly but I agree with @htuch that we should not lose information on why something happened and should probably have it look similar for both SoTW and delta, whatever the best way of accomplishing that is.
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Adi Suissa-Peleg adip@google.com
This is a draft PR based on the proposal in #8416, allowing to gather comments on whether we are heading in the right direction.
The plan is to gather some feedback, and introduce incremental PRs.
Roadmap: