WIP: gRPC Best practices. by jon · Pull Request #196 · kata-containers/documentation

jon · 2018-07-23T15:06:56Z

No description provided.

intelkevinputnam · 2018-07-23T17:18:32Z

best-practices/grpc-apis.md

+### GL3. Redundancy is a Useful Tool
+
+While it can be aesthetically displeasing, redundantly encoding similar (or
+identical) information can allow slow migration from one behavior


behavior to another.

It must be stripped when copy 😄

intelkevinputnam · 2018-07-23T17:18:34Z

best-practices/grpc-apis.md

+
+Many compatibility problems can been addressed be avoiding mixing components.
+For example, if each Kata release includes both the agent and the runtime, the
+agent can be


can be ?????

intelkevinputnam · 2018-07-23T17:18:38Z

best-practices/grpc-apis.md

+
+### GL2. The Scope of Compatibility is Only Releases that may Mix
+
+Many compatibility problems can been addressed be avoiding mixing components.


can be addressed

intelkevinputnam · 2018-07-23T17:18:40Z

best-practices/grpc-apis.md

+the backward compatibility window.
+
+Exceptions to this rule can include converting enum types to int32 or int64,
+provided no values other than the original enum definition are used


intelkevinputnam · 2018-07-23T17:18:42Z

best-practices/grpc-apis.md

+
+TODO: Find a concrete example from Kata rather than from an external source.
+
+### R3. Enums are Problematic


This needs to be made into a rule: "Do Not Use Enums Unless There Is No Other Alternative," or similar.

I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )

+1. We could have an automated check to forbid enums too of course. Thoughts @sboeuf, @bergwolf?

I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?

intelkevinputnam · 2018-07-23T17:18:44Z

best-practices/grpc-apis.md

+the strictest sense this means that new features cannot be "enabled" until one
+release after they are deemed "working". As a concrete example, when Google
+Compute Engine introduced multi-queue networking we did so one release later
+than when it was "fully supported" in GCE production. This allowed us to safely


release after it was "fully...

intelkevinputnam · 2018-07-23T17:18:46Z

best-practices/grpc-apis.md

+int32 size = 4;  // <-- The first available field number is 4.
+```
+
+### R2. Avoid Introducing and Enable a Feature in the same Release


Enabling a Feature in the Same Release

intelkevinputnam · 2018-07-23T17:18:48Z

best-practices/grpc-apis.md

+[encoded](https://developers.google.com/protocol-buffers/docs/encoding) on the
+wire, this number and the field's type are the only information that is actually
+passed. The result is that it is _never_ safe to re-use a field number. Protocol
+buffers provides the `reserved` keyword allowing field numbers to be marked as


buffers provide the ...

jodh-intel

Thanks for raising @jon! This is going to be a useful doc which should probably be referenced from https://github.com/kata-containers/community/blob/master/PR-Review-Guide.md so that those responsible for ack'ing *.proto changes can refer to it carefully.

jodh-intel · 2018-07-23T16:34:50Z

best-practices/grpc-apis.md

+behavior both in future releases and past releases. Every change to a protobuf
+defintion should include an analysis of how future releases will cope with the
+field if it is not present as well as how past releases will behave given that
+they will be blind to the new field.


We currently require two additional pullapprove acks for any protocol buffer changes:

https://github.com/kata-containers/agent/blob/master/.pullapprove.yml#L45..L56

Also the agent is implicitly guaranteed to be upgraded before any other component using the protocol since the *.proto files live in the agent repo and other repos need to re-vendor at a later stage to obtain details of the new protocol version.

In terms of Kata, I think we should document that we require the analysis you mention to be added as a comment by those who are acking the protocol buffer changes.

As such, it might be useful to expand this doc to include advice specific for Kata after the general gRPC advice section.

jodh-intel · 2018-07-23T16:40:22Z

best-practices/grpc-apis.md

+reserved and prevent them from being re-used. Any field number that has been
+used as part of publicly available code (not _just_ releases; this rule applies
+to anything checked into an official repo) should be considered "spent" and
+marked as reserved if the field is later removed.


I wonder if we could contrive an additional automated check in https://github.com/kata-containers/tests/blob/master/.ci/static-checks.sh#L111 to look at the diff for all *.proto changes and ensure that any dropped that removed members should have a corresponding reserved N addition?

I love automated check @jodh-intel 😆

jodh-intel · 2018-07-23T16:44:16Z

best-practices/grpc-apis.md

+Compute Engine introduced multi-queue networking we did so one release later
+than when it was "fully supported" in GCE production. This allowed us to safely
+roll-back the release when a bug elsewhere was encountered while not bricking
+VMs that had booted after multi-queue had been enabled.


This sounds like good advice but is going to be difficult to ensure. I don't think any of the gRPC features have corresponding runtime configuration file options so we'd be faced with having to update the *.proto files, re-vendor into the other components and make the feature a NOP in the agent for "a release". We also need to decide on (and document) what we mean by "a release" here - presumably a major or minor x.Y.z release.

jodh-intel · 2018-07-23T16:49:26Z

best-practices/grpc-apis.md

+### R3. Enums are Problematic
+
+Related to R2, enum types are problematic as adding additional values in a new
+releases can create undefined behavior in prior releases.


s/can/will/ ?

Ugh. Maybe we should convert all enums to ints or strings :) I guess that atleast for strings that would not be as efficient and would place a burden on the agent rather than the gRPC library though.

jodh-intel · 2018-07-23T17:08:17Z

best-practices/grpc-apis.md

+policy.
+
+It is critical to have a clear policy on what upgrade and downgrade paths are
+supported. One policy that has been generally successful for other large


We're going to need a careful testing matrix for this too. And these compatibility tests are going to have to block a release.

/cc @chavafg.

Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:

Runtime v1 Runtime v2 Runtime v3 Runtime v4

Agent v1 √ √ √ √

Agent v2 √ √ √ √

Agent v3 × √ √ √

Agent v4 × × √ √

Do I understand it right? @jon

jodh-intel · 2018-07-23T17:08:38Z

best-practices/grpc-apis.md

+
+### GL2. The Scope of Compatibility is Only Releases that may Mix
+
+Many compatibility problems can been addressed be avoiding mixing components.


s/be avoiding mixing components/by avoiding mixing component versions/ ?

jodh-intel · 2018-07-23T17:08:48Z

best-practices/grpc-apis.md

+
+Many compatibility problems can been addressed be avoiding mixing components.
+For example, if each Kata release includes both the agent and the runtime, the
+agent can be


Incomplete sentence :)

jodh-intel · 2018-07-23T17:09:33Z

best-practices/grpc-apis.md

+### GL3. Redundancy is a Useful Tool
+
+While it can be aesthetically displeasing, redundantly encoding similar (or
+identical) information can allow slow migration from one behavior


And another. I'm rather intrigued to read the rest of this sentence though :)

jodh-intel · 2018-07-23T17:09:56Z

best-practices/grpc-apis.md

+As per R2 above, new fields cannot generally be introduced and used in the same
+release. More broadly, new behaviors (typically new features) should not become
+the default until the _oldest_ release not supporting that behavior is no longer
+in the downgrade compatibiltiy window.


raravena80 · 2018-07-23T17:29:56Z

best-practices/grpc-apis.md

+
+## Rules
+
+### R1. Never, _EVER_ Re-use Protcol Buffer Field Numbers


Typo => Protocol

raravena80 · 2018-07-23T17:31:02Z

best-practices/grpc-apis.md

+
+Maintaining downgrade compatibility when launching a new release requires that
+the release not include any features not supported by the previous release. In
+the strictest sense this means that new features cannot be "enabled" until one


strictest sense, <- (comma)

raravena80 · 2018-07-23T17:31:36Z

best-practices/grpc-apis.md

+### R3. Enums are Problematic
+
+Related to R2, enum types are problematic as adding additional values in a new
+releases can create undefined behavior in prior releases.


typo: in a new release

WeiZhang555

This docs looks so great, I love it! Thanks for sharing with us! @jon

WeiZhang555 · 2018-07-24T09:28:13Z

best-practices/grpc-apis.md

+
+TODO: Find a concrete example from Kata rather than from an external source.
+
+### R3. Enums are Problematic


I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )

WeiZhang555 · 2018-07-24T09:30:01Z

best-practices/grpc-apis.md

+reserved and prevent them from being re-used. Any field number that has been
+used as part of publicly available code (not _just_ releases; this rule applies
+to anything checked into an official repo) should be considered "spent" and
+marked as reserved if the field is later removed.


I love automated check @jodh-intel 😆

WeiZhang555 · 2018-07-24T10:50:47Z

best-practices/grpc-apis.md

+policy.
+
+It is critical to have a clear policy on what upgrade and downgrade paths are
+supported. One policy that has been generally successful for other large


Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:

Runtime v1 Runtime v2 Runtime v3 Runtime v4

Agent v1 √ √ √ √

Agent v2 √ √ √ √

Agent v3 × √ √ √

Agent v4 × × √ √

Do I understand it right? @jon

WeiZhang555 · 2018-07-24T10:52:14Z

best-practices/grpc-apis.md

+### GL3. Redundancy is a Useful Tool
+
+While it can be aesthetically displeasing, redundantly encoding similar (or
+identical) information can allow slow migration from one behavior


It must be stripped when copy 😄

WeiZhang555 · 2018-07-24T10:58:03Z

Related to:

WeiZhang555 · 2018-07-24T11:03:18Z

best-practices/grpc-apis.md

+wire format (either explicitly or implicitly). For example, the Google Compute
+Engine API is still on "version 1" despite dozens of calls and arguments being
+added since its inception. This is possible because new features are rolled out
+as "additions" to existing APIs, and by following R2 above.


As kata-agent will keeps running after kata-runtime is upgraded, I think we need a mechanism to let kata-runtime know which field isn't supported by this version of kata-agent, and kata-runtime should downgrade it's grpc message version to appropriate version for better adapt to the agent. so question is, without on wire versioning support, how could kata-runtime do this？

One way I think is, we give the on-disk storage config of a POD a version number, and this version number will represent the kata-agent version, kata-runtime will send appropriate message according to the on-disk version(which equals to agent version). What do you think, do you have better practice to share?

I think that @jon is suggesting we should not perform any version checks.

A possible solution which was mentioned briefly yesterday is that we could introduce a new RestartAgent gRPC service. When we can guarantee that all agent versions support that service (and assuming that adding new services to the protocol will not stop RestartAgent from working), when the runtime is upgraded, it can call RestartService for running pods. That will cause the agent to re-exec itself, reloading a newer binary version (which supports the newer gRPC protocol version). Once done, the runtime is guaranteed that it can use the full (new) gRPC protocol version when talking to the agent. This would potentially require the agent to pass state between the original instance and the newly-exec'd instance but that should be achievable.

Alternatively, rather than re-execing the agent, we could apparently squirt bytes down the wire via gRPC from the runtime to the agent so that the agent could read the byte stream and create a new agent instance out of it (write the bytes to disk and re-exec?). I think @jon mentioned that would be slower though?

I'd vote for the re-exec option as it's proven and used by init systems such as systemd and upstart.

One issue with both approaches is being able to rewrite the agent binary. This might not be possible or desirable. We could share the agent into all pods images via 9p and then the runtime would only need to update the (host) agent binary file and ask all the running agent instances to re-exec to use it. Again, performance might take a hit here due to 9p.

Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.

How long it takes to get a new agent binary into the VM, and how, I'm not so worried about. This will be a very rare sequence (it's not like something we would be doing every minute, or even daily - this is probably a once every 3 month or more type scenario I would think).
How we hand over the state from the old agent to the new I think will be an interesting challenge - but, as you say, should be achievable.

For Upstart, we used a pipe - see https://wiki.ubuntu.com/FoundationsTeam/Specs/QuantalUpstartStatefulReexec#State-Passing for the logic.

This sound achievable, also as @jodh-intel mentioned:

Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.

This is also the biggest concern. Thanks @grahamwhaley @jodh-intel for explanation !

grahamwhaley

Great info @jon . Minor comments added. Looking forward to this getting fleshed out and settled as we define the Kata flow specifics.

grahamwhaley · 2018-07-25T16:45:37Z

best-practices/grpc-apis.md

@@ -0,0 +1,105 @@
+# Evolving gRPC APIs &mdash; Best Practices


Most of our md docs have a toc at the top or just below the first level title - consider if it is worth adding one to this doc as well.

grahamwhaley · 2018-07-25T16:47:09Z

best-practices/grpc-apis.md

+the release not include any features not supported by the previous release. In
+the strictest sense this means that new features cannot be "enabled" until one
+release after they are deemed "working". As a concrete example, when Google
+Compute Engine introduced multi-queue networking we did so one release later


grahamwhaley · 2018-07-25T16:49:34Z

best-practices/grpc-apis.md

+roll-back the release when a bug elsewhere was encountered while not bricking
+VMs that had booted after multi-queue had been enabled.
+
+TODO: Find a concrete example from Kata rather than from an external source.


In case it helps - I think there is some non-backwards compat change in one of the components between the current head, and the v1.1.0 tags. If I roll the runtime back to 1.1.0 for instance I can not launch docker containers. I've not figured out where the break was though - if anybody knows (and I suspect an agent/proxy relationship maybe), please speak up...

grahamwhaley · 2018-07-25T16:51:23Z

best-practices/grpc-apis.md

+
+TODO: Find a concrete example from Kata rather than from an external source.
+
+### R3. Enums are Problematic


I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?

grahamwhaley · 2018-07-25T16:52:05Z

best-practices/grpc-apis.md

+### R4. Rarely Change Field Types
+
+Related to the R1, while not always strictly unsafe it is usually unwise to
+change the type of an existing field. In most cases it is better to inroduce a


s/inroduce/introduce/

jodh-intel · 2018-07-30T10:00:50Z

Hi @jon,

Please could you update your commit as if you look at the Travis log:

Found 1 commit between commit 71a506c739f69ea6a431ba9f3bfe6e2bff7816f3 and branch master
ERROR: Commit 71a506c739f69ea6a431ba9f3bfe6e2bff7816f3: pure whitespace body
ERROR: checkcommits failed. See the document below for help on formatting
commits for the project.
https://github.com/kata-containers/community/blob/master/CONTRIBUTING.md#patch-format

raravena80 · 2018-08-28T16:51:24Z

@jon ping, any updates? Thx!

jodh-intel · 2018-09-24T13:31:18Z

Hi @jon - could you tal at this when you get a chance? 😄

grahamwhaley · 2018-10-04T12:25:54Z

Hi @jon - time for a nudge then. Any thoughts on plan or ETA here? Is it worth doing a quick parse and landing this, and then re-visiting to add to or spruce up more later?

jon · 2018-10-04T22:18:13Z

Yeah, like that plan -- I'll take a stab at a quick set of edits to this over the next few days to cover the feedback and the issues that I recall coming up in Arch Council meetings. Perhaps future addenda can be crowdsourced from there :) Jon

…

On Thu, Oct 4, 2018 at 5:25 AM Graham Whaley ***@***.***> wrote: Hi @jon <https://github.com/jon> - time for a nudge then. Any thoughts on plan or ETA here? Is it worth doing a quick parse and landing this, and then re-visiting to add to or spruce up more later? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAG_6B-jpfvuTDGfjTRkcPamxTUHVX3ks5uhf5UgaJpZM4VbKuD> .

raravena80 · 2018-10-22T16:35:14Z

@jon ping from the kata herder :)

sboeuf · 2018-11-05T17:14:53Z

@jon what's the status here?

jodh-intel · 2018-11-16T16:02:24Z

Yep - I'm afraid this PR has the dubious honour of being the second oldest (behind the vintage kata-containers/proxy#74).

If anyone has cycles, we could invoke the assisted process on this one?

raravena80 · 2018-11-21T04:58:07Z

@jon ping (from your weekly Kata herder)

jodh-intel · 2018-12-03T13:37:39Z

Hi @jon - this is your friendly periodic ping for this ol' 🌰 (<-- :chestnut: 😄)

raravena80 · 2018-12-20T22:41:18Z

@jon any chance to take a look at this?

jodh-intel · 2019-01-07T11:50:46Z

🦗's 😄

jodh-intel · 2019-01-07T11:52:06Z

Nice! Slack shows this emoji as I intended it! :)

jodh-intel · 2019-02-26T09:59:51Z

Unless we have any volunteers to update this using the assisted workflow, I propose we close this.

grahamwhaley · 2019-02-26T10:05:14Z

let's hear from @jon - is this good to merge as is (better than nothing), has it rotted, or does it need a mild refresh. Guide us @jon on what to do with this pls :-)

raravena80 · 2019-03-29T17:19:45Z

@jon any updates? 😄

grahamwhaley · 2019-05-01T14:03:46Z

@jon - heeelllooowweeee.... :-) @kata-containers/architecture-committee - can we make a decision, to either:

have some volunteer(s) take this through assisted cleanup (pretty much probably apply the feedback as is and merge)
or maybe @jon can fix this??
or, can/shall we close this down?

jodh-intel · 2019-05-10T10:32:35Z

Hi @jon and @kata-containers/architecture-committee. This PR has now been open for 291 days. Since the requested feedback still hasn't been applied and since nobody has come forward to handle this (I think you are in fact the only one who could do this given your knowledge of gRPC), I'm going to close it.

Of course "close" does not mean "delete", so it can be reopened at any time if anyone has spare cycles at some point.

Add s390x architecture

egernst · 2019-10-30T03:24:20Z

/cc @bergwolf @gnawux @sameo

WIP: gRPC Best practices.

71a506c

intelkevinputnam reviewed Jul 23, 2018

View reviewed changes

jodh-intel reviewed Jul 23, 2018

View reviewed changes

raravena80 reviewed Jul 23, 2018

View reviewed changes

WeiZhang555 reviewed Jul 24, 2018

View reviewed changes

grahamwhaley reviewed Jul 25, 2018

View reviewed changes

This was referenced Sep 5, 2018

Shall we separate onlineCPUMem to onlineCPU and onlineMem kata-containers/agent#357

Closed

agent: add GetGuestDetails gRPC function kata-containers/agent#358

Merged

WeiZhang555 mentioned this pull request Nov 2, 2018

[WIP][RFC]persist: baseline persist data format kata-containers/runtime#874

Closed

3 tasks

jodh-intel mentioned this pull request Feb 8, 2019

network: Remove grpc AddInterface and RemoveInterface api kata-containers/agent#457

Merged

jodh-intel closed this May 10, 2019

devimc pushed a commit to devimc/kata-documentation that referenced this pull request Sep 2, 2019

Merge pull request kata-containers#196 from alicefr/add_s390x

9d74134

Add s390x architecture


		### GL2. The Scope of Compatibility is Only Releases that may Mix

		Many compatibility problems can been addressed be avoiding mixing components.


		TODO: Find a concrete example from Kata rather than from an external source.

		### R3. Enums are Problematic

	Runtime v1	Runtime v2	Runtime v3	Runtime v4
Agent v1	√	√	√	√
Agent v2	√	√	√	√
Agent v3	×	√	√	√
Agent v4	×	×	√	√


		## Rules

		### R1. Never, _EVER_ Re-use Protcol Buffer Field Numbers

Conversation

jon commented Jul 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jodh-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeiZhang555 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeiZhang555 commented Jul 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment