-
Notifications
You must be signed in to change notification settings - Fork 711
E2E Zero DownTime Test when upgrading Envoy Proxy Versions #2610
Copy link
Copy link
Closed
Description
I executed a naive test:
- Environment: kind, metallb, EG quickstart.yaml
- envoy proxy replicas: 2
- upgrade: 0.6.0 => 0.0.0-latest using
helm upgrade - load simulation during upgrade:
hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/
The upgrade caused some client-facing failures during the test:
Error distribution:
[8] Get "http://172.18.255.200/": EOF
[32] Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
[1] Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
[1] Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer
It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.
So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.
In the future, we can explore:
- Implementing a graceful envoy shutdown feature and providing guidance on configuring envoy for hitless in-place upgrades
- Supporting canary deployments
WDYT?
Originally posted by @guydc in #1712 (comment)
Reactions are currently unavailable