E2E Zero DownTime Test when upgrading Envoy Proxy Versions


I executed a naive test:
- Environment: kind, metallb, EG quickstart.yaml
- envoy proxy replicas: 2
- upgrade: 0.6.0 => 0.0.0-latest using `helm upgrade`
- load simulation during upgrade: `hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/ `  

The upgrade caused some client-facing failures during the test:

```
Error distribution:
  [8]	Get "http://172.18.255.200/": EOF
  [32]	Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer
```
It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test. 

So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal. 

In the future, we can explore:
- Implementing a graceful envoy shutdown feature and providing guidance on configuring envoy for hitless in-place upgrades
- Supporting canary deployments 

WDYT?

_Originally posted by @guydc in https://github.com/envoyproxy/gateway/issues/1712#issuecomment-1944561296_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Zero DownTime Test when upgrading Envoy Proxy Versions #2610

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

E2E Zero DownTime Test when upgrading Envoy Proxy Versions #2610

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions