-
Notifications
You must be signed in to change notification settings - Fork 367
Kata Components should support "Live-Upgrade“ #492
Description
This is from mailing list earlier, I think it's better to sync up to github issue for tracking more easily.
====================
Actually I also mentioned this in Vancouver, in my opinion, a breakage between kata-agent and kata-runtime should always be considered as a backward compatibility breakage.
This breakage is a "gap" between "project" and "product" for kata-containers, I'll elaborate why here.
Starting from our requirement for a mature cloud product in use of kata, we have SLA with our customers, which means we can't shutdown customers' service while we are updating Kata components, this feature is named as "live-upgrade", so running kata-runtime and agent of different versions will very likely happen:
- 1). New runtime + old agent: updating kata-runtime when VM+old agent is running, kata-runtime shouldn't issue a command which will crash the agent.
- 2). old runtime + new agent: rollback when new kata version has issues, in this case, some service could be started already, new agent should always handle commands from old runtime.
So what will happen if we miss 1) and 2)? We need to shutdown user's running workload whenever we want to upgrade/downgrade the kata-components, that will make our SLA a joke.
(Of course we can also choose to send them notification and let users shutdown their workload by themselves, but we definitely hope to do better and go further. )
So to guarantee the "live-upgrade" ability of kata-components(meas install kata rpm packages while workloads are still running), what we need to do for these components are:
1) kata-runtime
A. issue "versioned" command to kata-agent, can always communicate correctly with old kata-agent. (MUST)
B. disk persist data should be "versioned", kata-runtime can always handle old "version" of persist data to restore sandbox/container struct from disk to memory. (MUST)
2) kata-agent:
protocol needs to be versioned, can always handle commands from old kata-runtime. ("versioned" may be achieved by leveraging protobuf) (MUST)
3) kata-shim/kata-proxy
daemon process, no need to shut down while updating kata rpm package. So I don't see a problem currently, need to guarantee interact between kata-runtime and shim/proxy. (MUST)
4) qemu:
A. current status: NO WAY to upgrade now. running workload must be shutdown before installing newer version of qemu rpm package. (IMPOSSIBLE)
B. In future: qemu live-migration, live-replacement, live-patch etc. (BETTER HAVE)
5) guest kernel:
A. current status: after install kata rpm package with newer VM image, old workloads can keep running with old kernel, newly started workload will use new VM kernel. It's fine. (ALREADY HAVE)
B. in future: live patch. (BETTER HAVE)
summary
- We already break the backward compatibility, and we will break a lot more in near future definitely. Actually in Vancouver, the participants all agree that we can't promise the API won't be broken and current API isn't a stable version.
- Before we claim that kata can support "live upgrade" and kata is real production ready, I'm fine with the breakage and also fine with 1.0.1 or 1.1.0, maybe latter one looks better.
- After we claim that kata can support "live upgrade" in future, we should reject any modifications which will break the running workloads, unless this is really inevitable, by then, we need to upgrade kata version from x.0.0 to y.0.0.
But I hope our kata developers can understand what a disaster this could be to a cloud provider like us :-(, and I hope this will never happen. - Better document that we don't support "live upgrade" yet, and tell users that if you want to upgrade to this new kata-containers version, you must stop all you running kata containers, or there will be anticipated issues.