Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

WIP: gRPC Best practices.#196

Closed
jon wants to merge 1 commit intokata-containers:masterfrom
jon:master
Closed

WIP: gRPC Best practices.#196
jon wants to merge 1 commit intokata-containers:masterfrom
jon:master

Conversation

@jon
Copy link
Copy Markdown
Contributor

@jon jon commented Jul 23, 2018

No description provided.

### GL3. Redundancy is a Useful Tool

While it can be aesthetically displeasing, redundantly encoding similar (or
identical) information can allow slow migration from one behavior
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behavior to another.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be stripped when copy 😄


Many compatibility problems can been addressed be avoiding mixing components.
For example, if each Kata release includes both the agent and the runtime, the
agent can be
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ?????


### GL2. The Scope of Compatibility is Only Releases that may Mix

Many compatibility problems can been addressed be avoiding mixing components.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be addressed

the backward compatibility window.

Exceptions to this rule can include converting enum types to int32 or int64,
provided no values other than the original enum definition are used
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are used.


TODO: Find a concrete example from Kata rather than from an external source.

### R3. Enums are Problematic
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be made into a rule: "Do Not Use Enums Unless There Is No Other Alternative," or similar.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We could have an automated check to forbid enums too of course. Thoughts @sboeuf, @bergwolf?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?

the strictest sense this means that new features cannot be "enabled" until one
release after they are deemed "working". As a concrete example, when Google
Compute Engine introduced multi-queue networking we did so one release later
than when it was "fully supported" in GCE production. This allowed us to safely
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release after it was "fully...

int32 size = 4; // <-- The first available field number is 4.
```

### R2. Avoid Introducing and Enable a Feature in the same Release
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling a Feature in the Same Release

[encoded](https://developers.google.com/protocol-buffers/docs/encoding) on the
wire, this number and the field's type are the only information that is actually
passed. The result is that it is _never_ safe to re-use a field number. Protocol
buffers provides the `reserved` keyword allowing field numbers to be marked as
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffers provide the ...

Copy link
Copy Markdown

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising @jon! This is going to be a useful doc which should probably be referenced from https://github.com/kata-containers/community/blob/master/PR-Review-Guide.md so that those responsible for ack'ing *.proto changes can refer to it carefully.

behavior both in future releases and past releases. Every change to a protobuf
defintion should include an analysis of how future releases will cope with the
field if it is not present as well as how past releases will behave given that
they will be blind to the new field.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently require two additional pullapprove acks for any protocol buffer changes:

Also the agent is implicitly guaranteed to be upgraded before any other component using the protocol since the *.proto files live in the agent repo and other repos need to re-vendor at a later stage to obtain details of the new protocol version.

In terms of Kata, I think we should document that we require the analysis you mention to be added as a comment by those who are acking the protocol buffer changes.

As such, it might be useful to expand this doc to include advice specific for Kata after the general gRPC advice section.

reserved and prevent them from being re-used. Any field number that has been
used as part of publicly available code (not _just_ releases; this rule applies
to anything checked into an official repo) should be considered "spent" and
marked as reserved if the field is later removed.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could contrive an additional automated check in https://github.com/kata-containers/tests/blob/master/.ci/static-checks.sh#L111 to look at the diff for all *.proto changes and ensure that any dropped that removed members should have a corresponding reserved N addition?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love automated check @jodh-intel 😆

Compute Engine introduced multi-queue networking we did so one release later
than when it was "fully supported" in GCE production. This allowed us to safely
roll-back the release when a bug elsewhere was encountered while not bricking
VMs that had booted after multi-queue had been enabled.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like good advice but is going to be difficult to ensure. I don't think any of the gRPC features have corresponding runtime configuration file options so we'd be faced with having to update the *.proto files, re-vendor into the other components and make the feature a NOP in the agent for "a release". We also need to decide on (and document) what we mean by "a release" here - presumably a major or minor x.Y.z release.

### R3. Enums are Problematic

Related to R2, enum types are problematic as adding additional values in a new
releases can create undefined behavior in prior releases.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • s/can/will/ ?
  • Ugh. Maybe we should convert all enums to ints or strings :) I guess that atleast for strings that would not be as efficient and would place a burden on the agent rather than the gRPC library though.

policy.

It is critical to have a clear policy on what upgrade and downgrade paths are
supported. One policy that has been generally successful for other large
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need a careful testing matrix for this too. And these compatibility tests are going to have to block a release.

/cc @chavafg.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:

  Runtime v1 Runtime v2 Runtime v3 Runtime v4
Agent v1
Agent v2
Agent v3 ×
Agent v4 × ×

Do I understand it right? @jon


### GL2. The Scope of Compatibility is Only Releases that may Mix

Many compatibility problems can been addressed be avoiding mixing components.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/be avoiding mixing components/by avoiding mixing component versions/ ?


Many compatibility problems can been addressed be avoiding mixing components.
For example, if each Kata release includes both the agent and the runtime, the
agent can be
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete sentence :)

### GL3. Redundancy is a Useful Tool

While it can be aesthetically displeasing, redundantly encoding similar (or
identical) information can allow slow migration from one behavior
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another. I'm rather intrigued to read the rest of this sentence though :)

As per R2 above, new fields cannot generally be introduced and used in the same
release. More broadly, new behaviors (typically new features) should not become
the default until the _oldest_ release not supporting that behavior is no longer
in the downgrade compatibiltiy window.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.


## Rules

### R1. Never, _EVER_ Re-use Protcol Buffer Field Numbers
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo => Protocol


Maintaining downgrade compatibility when launching a new release requires that
the release not include any features not supported by the previous release. In
the strictest sense this means that new features cannot be "enabled" until one
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictest sense, <- (comma)

### R3. Enums are Problematic

Related to R2, enum types are problematic as adding additional values in a new
releases can create undefined behavior in prior releases.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: in a new release

Copy link
Copy Markdown
Member

@WeiZhang555 WeiZhang555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docs looks so great, I love it! Thanks for sharing with us! @jon


TODO: Find a concrete example from Kata rather than from an external source.

### R3. Enums are Problematic
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )

reserved and prevent them from being re-used. Any field number that has been
used as part of publicly available code (not _just_ releases; this rule applies
to anything checked into an official repo) should be considered "spent" and
marked as reserved if the field is later removed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love automated check @jodh-intel 😆

policy.

It is critical to have a clear policy on what upgrade and downgrade paths are
supported. One policy that has been generally successful for other large
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:

  Runtime v1 Runtime v2 Runtime v3 Runtime v4
Agent v1
Agent v2
Agent v3 ×
Agent v4 × ×

Do I understand it right? @jon

### GL3. Redundancy is a Useful Tool

While it can be aesthetically displeasing, redundantly encoding similar (or
identical) information can allow slow migration from one behavior
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be stripped when copy 😄

@WeiZhang555
Copy link
Copy Markdown
Member

wire format (either explicitly or implicitly). For example, the Google Compute
Engine API is still on "version 1" despite dozens of calls and arguments being
added since its inception. This is possible because new features are rolled out
as "additions" to existing APIs, and by following R2 above.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As kata-agent will keeps running after kata-runtime is upgraded, I think we need a mechanism to let kata-runtime know which field isn't supported by this version of kata-agent, and kata-runtime should downgrade it's grpc message version to appropriate version for better adapt to the agent. so question is, without on wire versioning support, how could kata-runtime do this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way I think is, we give the on-disk storage config of a POD a version number, and this version number will represent the kata-agent version, kata-runtime will send appropriate message according to the on-disk version(which equals to agent version). What do you think, do you have better practice to share?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @jon is suggesting we should not perform any version checks.

A possible solution which was mentioned briefly yesterday is that we could introduce a new RestartAgent gRPC service. When we can guarantee that all agent versions support that service (and assuming that adding new services to the protocol will not stop RestartAgent from working), when the runtime is upgraded, it can call RestartService for running pods. That will cause the agent to re-exec itself, reloading a newer binary version (which supports the newer gRPC protocol version). Once done, the runtime is guaranteed that it can use the full (new) gRPC protocol version when talking to the agent. This would potentially require the agent to pass state between the original instance and the newly-exec'd instance but that should be achievable.

Alternatively, rather than re-execing the agent, we could apparently squirt bytes down the wire via gRPC from the runtime to the agent so that the agent could read the byte stream and create a new agent instance out of it (write the bytes to disk and re-exec?). I think @jon mentioned that would be slower though?

I'd vote for the re-exec option as it's proven and used by init systems such as systemd and upstart.

One issue with both approaches is being able to rewrite the agent binary. This might not be possible or desirable. We could share the agent into all pods images via 9p and then the runtime would only need to update the (host) agent binary file and ask all the running agent instances to re-exec to use it. Again, performance might take a hit here due to 9p.

Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long it takes to get a new agent binary into the VM, and how, I'm not so worried about. This will be a very rare sequence (it's not like something we would be doing every minute, or even daily - this is probably a once every 3 month or more type scenario I would think).
How we hand over the state from the old agent to the new I think will be an interesting challenge - but, as you say, should be achievable.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sound achievable, also as @jodh-intel mentioned:

Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.

This is also the biggest concern. Thanks @grahamwhaley @jodh-intel for explanation !

Copy link
Copy Markdown
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info @jon . Minor comments added. Looking forward to this getting fleshed out and settled as we define the Kata flow specifics.

@@ -0,0 +1,105 @@
# Evolving gRPC APIs &mdash; Best Practices
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of our md docs have a toc at the top or just below the first level title - consider if it is worth adding one to this doc as well.

the release not include any features not supported by the previous release. In
the strictest sense this means that new features cannot be "enabled" until one
release after they are deemed "working". As a concrete example, when Google
Compute Engine introduced multi-queue networking we did so one release later
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tm) ?

roll-back the release when a bug elsewhere was encountered while not bricking
VMs that had booted after multi-queue had been enabled.

TODO: Find a concrete example from Kata rather than from an external source.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it helps - I think there is some non-backwards compat change in one of the components between the current head, and the v1.1.0 tags. If I roll the runtime back to 1.1.0 for instance I can not launch docker containers. I've not figured out where the break was though - if anybody knows (and I suspect an agent/proxy relationship maybe), please speak up...


TODO: Find a concrete example from Kata rather than from an external source.

### R3. Enums are Problematic
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?

### R4. Rarely Change Field Types

Related to the R1, while not always strictly unsafe it is usually unwise to
change the type of an existing field. In most cases it is better to inroduce a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inroduce/introduce/

@jodh-intel
Copy link
Copy Markdown

Hi @jon,

Please could you update your commit as if you look at the Travis log:

Found 1 commit between commit 71a506c739f69ea6a431ba9f3bfe6e2bff7816f3 and branch master
ERROR: Commit 71a506c739f69ea6a431ba9f3bfe6e2bff7816f3: pure whitespace body
ERROR: checkcommits failed. See the document below for help on formatting
commits for the project.
https://github.com/kata-containers/community/blob/master/CONTRIBUTING.md#patch-format

@raravena80
Copy link
Copy Markdown
Member

@jon ping, any updates? Thx!

@jodh-intel
Copy link
Copy Markdown

Hi @jon - could you tal at this when you get a chance? 😄

@grahamwhaley
Copy link
Copy Markdown
Contributor

Hi @jon - time for a nudge then. Any thoughts on plan or ETA here? Is it worth doing a quick parse and landing this, and then re-visiting to add to or spruce up more later?

@jon
Copy link
Copy Markdown
Contributor Author

jon commented Oct 4, 2018 via email

@raravena80
Copy link
Copy Markdown
Member

@jon ping from the kata herder :)

@sboeuf
Copy link
Copy Markdown

sboeuf commented Nov 5, 2018

@jon what's the status here?

@jodh-intel
Copy link
Copy Markdown

Yep - I'm afraid this PR has the dubious honour of being the second oldest (behind the vintage kata-containers/proxy#74).

If anyone has cycles, we could invoke the assisted process on this one?

@raravena80
Copy link
Copy Markdown
Member

@jon ping (from your weekly Kata herder)

@jodh-intel
Copy link
Copy Markdown

Hi @jon - this is your friendly periodic ping for this ol' 🌰 (<-- :chestnut: 😄)

@raravena80
Copy link
Copy Markdown
Member

@jon any chance to take a look at this?

@jodh-intel
Copy link
Copy Markdown

🦗's 😄

@jodh-intel
Copy link
Copy Markdown

Nice! Slack shows this emoji as I intended it! :)

@jodh-intel
Copy link
Copy Markdown

Unless we have any volunteers to update this using the assisted workflow, I propose we close this.

@grahamwhaley
Copy link
Copy Markdown
Contributor

let's hear from @jon - is this good to merge as is (better than nothing), has it rotted, or does it need a mild refresh. Guide us @jon on what to do with this pls :-)

@raravena80
Copy link
Copy Markdown
Member

@jon any updates? 😄

@grahamwhaley
Copy link
Copy Markdown
Contributor

@jon - heeelllooowweeee.... :-) @kata-containers/architecture-committee - can we make a decision, to either:

  • have some volunteer(s) take this through assisted cleanup (pretty much probably apply the feedback as is and merge)
  • or maybe @jon can fix this??
  • or, can/shall we close this down?

@jodh-intel
Copy link
Copy Markdown

Hi @jon and @kata-containers/architecture-committee. This PR has now been open for 291 days. Since the requested feedback still hasn't been applied and since nobody has come forward to handle this (I think you are in fact the only one who could do this given your knowledge of gRPC), I'm going to close it.

Of course "close" does not mean "delete", so it can be reopened at any time if anyone has spare cycles at some point.

@jodh-intel jodh-intel closed this May 10, 2019
devimc pushed a commit to devimc/kata-documentation that referenced this pull request Sep 2, 2019
@egernst
Copy link
Copy Markdown
Member

egernst commented Oct 30, 2019

/cc @bergwolf @gnawux @sameo

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants