Conversation
| ### GL3. Redundancy is a Useful Tool | ||
|
|
||
| While it can be aesthetically displeasing, redundantly encoding similar (or | ||
| identical) information can allow slow migration from one behavior |
There was a problem hiding this comment.
It must be stripped when copy 😄
|
|
||
| Many compatibility problems can been addressed be avoiding mixing components. | ||
| For example, if each Kata release includes both the agent and the runtime, the | ||
| agent can be |
|
|
||
| ### GL2. The Scope of Compatibility is Only Releases that may Mix | ||
|
|
||
| Many compatibility problems can been addressed be avoiding mixing components. |
| the backward compatibility window. | ||
|
|
||
| Exceptions to this rule can include converting enum types to int32 or int64, | ||
| provided no values other than the original enum definition are used |
|
|
||
| TODO: Find a concrete example from Kata rather than from an external source. | ||
|
|
||
| ### R3. Enums are Problematic |
There was a problem hiding this comment.
This needs to be made into a rule: "Do Not Use Enums Unless There Is No Other Alternative," or similar.
There was a problem hiding this comment.
I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )
There was a problem hiding this comment.
I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?
| the strictest sense this means that new features cannot be "enabled" until one | ||
| release after they are deemed "working". As a concrete example, when Google | ||
| Compute Engine introduced multi-queue networking we did so one release later | ||
| than when it was "fully supported" in GCE production. This allowed us to safely |
There was a problem hiding this comment.
release after it was "fully...
| int32 size = 4; // <-- The first available field number is 4. | ||
| ``` | ||
|
|
||
| ### R2. Avoid Introducing and Enable a Feature in the same Release |
There was a problem hiding this comment.
Enabling a Feature in the Same Release
| [encoded](https://developers.google.com/protocol-buffers/docs/encoding) on the | ||
| wire, this number and the field's type are the only information that is actually | ||
| passed. The result is that it is _never_ safe to re-use a field number. Protocol | ||
| buffers provides the `reserved` keyword allowing field numbers to be marked as |
jodh-intel
left a comment
There was a problem hiding this comment.
Thanks for raising @jon! This is going to be a useful doc which should probably be referenced from https://github.com/kata-containers/community/blob/master/PR-Review-Guide.md so that those responsible for ack'ing *.proto changes can refer to it carefully.
| behavior both in future releases and past releases. Every change to a protobuf | ||
| defintion should include an analysis of how future releases will cope with the | ||
| field if it is not present as well as how past releases will behave given that | ||
| they will be blind to the new field. |
There was a problem hiding this comment.
We currently require two additional pullapprove acks for any protocol buffer changes:
Also the agent is implicitly guaranteed to be upgraded before any other component using the protocol since the *.proto files live in the agent repo and other repos need to re-vendor at a later stage to obtain details of the new protocol version.
In terms of Kata, I think we should document that we require the analysis you mention to be added as a comment by those who are acking the protocol buffer changes.
As such, it might be useful to expand this doc to include advice specific for Kata after the general gRPC advice section.
| reserved and prevent them from being re-used. Any field number that has been | ||
| used as part of publicly available code (not _just_ releases; this rule applies | ||
| to anything checked into an official repo) should be considered "spent" and | ||
| marked as reserved if the field is later removed. |
There was a problem hiding this comment.
I wonder if we could contrive an additional automated check in https://github.com/kata-containers/tests/blob/master/.ci/static-checks.sh#L111 to look at the diff for all *.proto changes and ensure that any dropped that removed members should have a corresponding reserved N addition?
| Compute Engine introduced multi-queue networking we did so one release later | ||
| than when it was "fully supported" in GCE production. This allowed us to safely | ||
| roll-back the release when a bug elsewhere was encountered while not bricking | ||
| VMs that had booted after multi-queue had been enabled. |
There was a problem hiding this comment.
This sounds like good advice but is going to be difficult to ensure. I don't think any of the gRPC features have corresponding runtime configuration file options so we'd be faced with having to update the *.proto files, re-vendor into the other components and make the feature a NOP in the agent for "a release". We also need to decide on (and document) what we mean by "a release" here - presumably a major or minor x.Y.z release.
| ### R3. Enums are Problematic | ||
|
|
||
| Related to R2, enum types are problematic as adding additional values in a new | ||
| releases can create undefined behavior in prior releases. |
There was a problem hiding this comment.
- s/can/will/ ?
- Ugh. Maybe we should convert all enums to ints or strings :) I guess that atleast for strings that would not be as efficient and would place a burden on the agent rather than the gRPC library though.
| policy. | ||
|
|
||
| It is critical to have a clear policy on what upgrade and downgrade paths are | ||
| supported. One policy that has been generally successful for other large |
There was a problem hiding this comment.
We're going to need a careful testing matrix for this too. And these compatibility tests are going to have to block a release.
/cc @chavafg.
There was a problem hiding this comment.
Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:
| Runtime v1 | Runtime v2 | Runtime v3 | Runtime v4 | |
|---|---|---|---|---|
| Agent v1 | √ | √ | √ | √ |
| Agent v2 | √ | √ | √ | √ |
| Agent v3 | × | √ | √ | √ |
| Agent v4 | × | × | √ | √ |
Do I understand it right? @jon
|
|
||
| ### GL2. The Scope of Compatibility is Only Releases that may Mix | ||
|
|
||
| Many compatibility problems can been addressed be avoiding mixing components. |
There was a problem hiding this comment.
s/be avoiding mixing components/by avoiding mixing component versions/ ?
|
|
||
| Many compatibility problems can been addressed be avoiding mixing components. | ||
| For example, if each Kata release includes both the agent and the runtime, the | ||
| agent can be |
| ### GL3. Redundancy is a Useful Tool | ||
|
|
||
| While it can be aesthetically displeasing, redundantly encoding similar (or | ||
| identical) information can allow slow migration from one behavior |
There was a problem hiding this comment.
And another. I'm rather intrigued to read the rest of this sentence though :)
| As per R2 above, new fields cannot generally be introduced and used in the same | ||
| release. More broadly, new behaviors (typically new features) should not become | ||
| the default until the _oldest_ release not supporting that behavior is no longer | ||
| in the downgrade compatibiltiy window. |
|
|
||
| ## Rules | ||
|
|
||
| ### R1. Never, _EVER_ Re-use Protcol Buffer Field Numbers |
|
|
||
| Maintaining downgrade compatibility when launching a new release requires that | ||
| the release not include any features not supported by the previous release. In | ||
| the strictest sense this means that new features cannot be "enabled" until one |
| ### R3. Enums are Problematic | ||
|
|
||
| Related to R2, enum types are problematic as adding additional values in a new | ||
| releases can create undefined behavior in prior releases. |
WeiZhang555
left a comment
There was a problem hiding this comment.
This docs looks so great, I love it! Thanks for sharing with us! @jon
|
|
||
| TODO: Find a concrete example from Kata rather than from an external source. | ||
|
|
||
| ### R3. Enums are Problematic |
There was a problem hiding this comment.
I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )
| reserved and prevent them from being re-used. Any field number that has been | ||
| used as part of publicly available code (not _just_ releases; this rule applies | ||
| to anything checked into an official repo) should be considered "spent" and | ||
| marked as reserved if the field is later removed. |
| policy. | ||
|
|
||
| It is critical to have a clear policy on what upgrade and downgrade paths are | ||
| supported. One policy that has been generally successful for other large |
There was a problem hiding this comment.
Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:
| Runtime v1 | Runtime v2 | Runtime v3 | Runtime v4 | |
|---|---|---|---|---|
| Agent v1 | √ | √ | √ | √ |
| Agent v2 | √ | √ | √ | √ |
| Agent v3 | × | √ | √ | √ |
| Agent v4 | × | × | √ | √ |
Do I understand it right? @jon
| ### GL3. Redundancy is a Useful Tool | ||
|
|
||
| While it can be aesthetically displeasing, redundantly encoding similar (or | ||
| identical) information can allow slow migration from one behavior |
There was a problem hiding this comment.
It must be stripped when copy 😄
| wire format (either explicitly or implicitly). For example, the Google Compute | ||
| Engine API is still on "version 1" despite dozens of calls and arguments being | ||
| added since its inception. This is possible because new features are rolled out | ||
| as "additions" to existing APIs, and by following R2 above. |
There was a problem hiding this comment.
As kata-agent will keeps running after kata-runtime is upgraded, I think we need a mechanism to let kata-runtime know which field isn't supported by this version of kata-agent, and kata-runtime should downgrade it's grpc message version to appropriate version for better adapt to the agent. so question is, without on wire versioning support, how could kata-runtime do this?
There was a problem hiding this comment.
One way I think is, we give the on-disk storage config of a POD a version number, and this version number will represent the kata-agent version, kata-runtime will send appropriate message according to the on-disk version(which equals to agent version). What do you think, do you have better practice to share?
There was a problem hiding this comment.
I think that @jon is suggesting we should not perform any version checks.
A possible solution which was mentioned briefly yesterday is that we could introduce a new RestartAgent gRPC service. When we can guarantee that all agent versions support that service (and assuming that adding new services to the protocol will not stop RestartAgent from working), when the runtime is upgraded, it can call RestartService for running pods. That will cause the agent to re-exec itself, reloading a newer binary version (which supports the newer gRPC protocol version). Once done, the runtime is guaranteed that it can use the full (new) gRPC protocol version when talking to the agent. This would potentially require the agent to pass state between the original instance and the newly-exec'd instance but that should be achievable.
Alternatively, rather than re-execing the agent, we could apparently squirt bytes down the wire via gRPC from the runtime to the agent so that the agent could read the byte stream and create a new agent instance out of it (write the bytes to disk and re-exec?). I think @jon mentioned that would be slower though?
I'd vote for the re-exec option as it's proven and used by init systems such as systemd and upstart.
One issue with both approaches is being able to rewrite the agent binary. This might not be possible or desirable. We could share the agent into all pods images via 9p and then the runtime would only need to update the (host) agent binary file and ask all the running agent instances to re-exec to use it. Again, performance might take a hit here due to 9p.
Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.
There was a problem hiding this comment.
How long it takes to get a new agent binary into the VM, and how, I'm not so worried about. This will be a very rare sequence (it's not like something we would be doing every minute, or even daily - this is probably a once every 3 month or more type scenario I would think).
How we hand over the state from the old agent to the new I think will be an interesting challenge - but, as you say, should be achievable.
There was a problem hiding this comment.
For Upstart, we used a pipe - see https://wiki.ubuntu.com/FoundationsTeam/Specs/QuantalUpstartStatefulReexec#State-Passing for the logic.
There was a problem hiding this comment.
This sound achievable, also as @jodh-intel mentioned:
Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.
This is also the biggest concern. Thanks @grahamwhaley @jodh-intel for explanation !
grahamwhaley
left a comment
There was a problem hiding this comment.
Great info @jon . Minor comments added. Looking forward to this getting fleshed out and settled as we define the Kata flow specifics.
| @@ -0,0 +1,105 @@ | |||
| # Evolving gRPC APIs — Best Practices | |||
There was a problem hiding this comment.
Most of our md docs have a toc at the top or just below the first level title - consider if it is worth adding one to this doc as well.
| the release not include any features not supported by the previous release. In | ||
| the strictest sense this means that new features cannot be "enabled" until one | ||
| release after they are deemed "working". As a concrete example, when Google | ||
| Compute Engine introduced multi-queue networking we did so one release later |
| roll-back the release when a bug elsewhere was encountered while not bricking | ||
| VMs that had booted after multi-queue had been enabled. | ||
|
|
||
| TODO: Find a concrete example from Kata rather than from an external source. |
There was a problem hiding this comment.
In case it helps - I think there is some non-backwards compat change in one of the components between the current head, and the v1.1.0 tags. If I roll the runtime back to 1.1.0 for instance I can not launch docker containers. I've not figured out where the break was though - if anybody knows (and I suspect an agent/proxy relationship maybe), please speak up...
|
|
||
| TODO: Find a concrete example from Kata rather than from an external source. | ||
|
|
||
| ### R3. Enums are Problematic |
There was a problem hiding this comment.
I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?
| ### R4. Rarely Change Field Types | ||
|
|
||
| Related to the R1, while not always strictly unsafe it is usually unwise to | ||
| change the type of an existing field. In most cases it is better to inroduce a |
There was a problem hiding this comment.
s/inroduce/introduce/
|
Hi @jon, Please could you update your commit as if you look at the Travis log: |
|
@jon ping, any updates? Thx! |
|
Hi @jon - could you tal at this when you get a chance? 😄 |
|
Hi @jon - time for a nudge then. Any thoughts on plan or ETA here? Is it worth doing a quick parse and landing this, and then re-visiting to add to or spruce up more later? |
|
Yeah, like that plan -- I'll take a stab at a quick set of edits to this
over the next few days to cover the feedback and the issues that I recall
coming up in Arch Council meetings. Perhaps future addenda can be
crowdsourced from there :)
Jon
…On Thu, Oct 4, 2018 at 5:25 AM Graham Whaley ***@***.***> wrote:
Hi @jon <https://github.com/jon> - time for a nudge then. Any thoughts on
plan or ETA here? Is it worth doing a quick parse and landing this, and
then re-visiting to add to or spruce up more later?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAG_6B-jpfvuTDGfjTRkcPamxTUHVX3ks5uhf5UgaJpZM4VbKuD>
.
|
|
@jon ping from the kata herder :) |
|
@jon what's the status here? |
|
Yep - I'm afraid this PR has the dubious honour of being the second oldest (behind the vintage kata-containers/proxy#74). If anyone has cycles, we could invoke the assisted process on this one? |
|
@jon ping (from your weekly Kata herder) |
|
Hi @jon - this is your friendly periodic ping for this ol' 🌰 (<-- |
|
@jon any chance to take a look at this? |
|
🦗's 😄 |
|
Nice! Slack shows this emoji as I intended it! :) |
|
Unless we have any volunteers to update this using the assisted workflow, I propose we close this. |
|
@jon any updates? 😄 |
|
Hi @jon and @kata-containers/architecture-committee. This PR has now been open for 291 days. Since the requested feedback still hasn't been applied and since nobody has come forward to handle this (I think you are in fact the only one who could do this given your knowledge of gRPC), I'm going to close it. Of course "close" does not mean "delete", so it can be reopened at any time if anyone has spare cycles at some point. |
Add s390x architecture
No description provided.