Skip to content

E2E Zero DownTime Test when upgrading Envoy Proxy Versions #2610

@arkodg

Description

@arkodg

I executed a naive test:

  • Environment: kind, metallb, EG quickstart.yaml
  • envoy proxy replicas: 2
  • upgrade: 0.6.0 => 0.0.0-latest using helm upgrade
  • load simulation during upgrade: hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/

The upgrade caused some client-facing failures during the test:

Error distribution:
  [8]	Get "http://172.18.255.200/": EOF
  [32]	Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer

It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.

So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.

In the future, we can explore:

  • Implementing a graceful envoy shutdown feature and providing guidance on configuring envoy for hitless in-place upgrades
  • Supporting canary deployments

WDYT?

Originally posted by @guydc in #1712 (comment)

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions