-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Hi,
At Atlassian we are a pretty big user of envoy (a few thousand services). We recently upgraded from 1.12.3 to 1.14.1 and we discovered that the circuit breaker behavior, given our existing configuration, had changed.
Our config up to that point had intended to disable circuit breakers on the ingress by default, and so we had:
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_connections: 0xffffffff
We discovered that as of 1.14.1 there was a circuit breaker that was limiting the number of concurrent connections through the ingress. We believe this came about as a result of this change:
upstream: combined HTTP/1 and HTTP/2 connection pool code. This means that circuit breaker limits for both requests and connections apply to both pool types. Also, HTTP/2 now has the option to limit concurrent requests on a connection, and allow multiple draining connections. The old behavior is deprecated, but can be used during the deprecation period by disabling runtime feature envoy.reloadable_features.new_http1_connection_pool_behavior or envoy.reloadable_features.new_http2_connection_pool_behavior and then re-configure your clusters or restart Envoy. The behavior will not switch until the connection pools are recreated. The new circuit breaker behavior is described here.
In order for us to restore the same behavior we needed to change our config to:
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_connections: 0xffffffff
max_requests: 0xffffffff
We got lucky on this, because one of our teams happened to be doing load testing just when this change was undergoing soaking in a development environment. They reported that during their test the success rate of the calling service went from 98% to 4% or so. As a result, the version bump never got rolled out all the way, but it was a close call for us.
We reviewed the release notes at the time, but the full implications of the change did not occur to us. We'd like to propose that in the future backwards incompatible changes be spelled out more clearly in the release notes, and when changes to config are required to preserve existing behavior those instructions are given explicitly as well.