fix regression in router that can cause a crash on response timeout#7192
Conversation
Prior to the recent retry hedging change, upstream requests were always reset as they should be when a response timeout is triggered. The retry hedging change inadvertently changed the logic so that the upstream request is only reset if it hadn't seen headers. However it's possible to hit the timeout after seeing headers and before end_stream which would result in a crash as the body or trailers are attempted to be decoded using a null pointer. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
| timeSystem().sleep(std::chrono::milliseconds(200)); | ||
|
|
||
| // Respond with body. | ||
| upstream_request_->encodeData(100, true); |
There was a problem hiding this comment.
this crashed before the fix was applied
| {":authority", "host"}, | ||
| {"x-forwarded-for", "10.0.0.1"}, | ||
| {"x-envoy-upstream-rq-timeout-ms", "100"}}; | ||
| auto encoder_decoder = codec_client_->startRequest(request_headers); |
There was a problem hiding this comment.
use auto only if the type is obvious (e.g. cast, make_unique/shared, assigning literals).
mattklein123
left a comment
There was a problem hiding this comment.
Nice, thanks. LGTM modulo @lizan comment.
alyssawilk
left a comment
There was a problem hiding this comment.
Thanks both for the fix and the regression test!
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
Testing this with traffic that triggered the crash for us and thus far, no more crashes 👍 |
|
/retest |
|
🔨 rebuilding |
|
Unfortunately I can't seem to reproduce that in gdb or figure out line numbers from stack_decode.py which means I'm kind of stuck. It does reproduce about 20 times in 1000 though on my machine |
Sending a body causes the ipv6 test to segfault for some reason, perhaps that's not a sane thing to try in the integration tests (send data after reset). Asserting a reset is just as good from a testing standpoint since it verifies the correct router behavior. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
OK think I lucked out with a guess on what was going on. Now it passes after 1000 runs. Maybe the way the fake upstreams work trying to write a body after a reset is a recipe for badness |
Prior to the recent retry hedging change, upstream requests were always
reset as they should be when a response timeout is triggered. The retry
hedging change inadvertently changed the logic so that the upstream
request is only reset if it hadn't seen headers. However it's possible
to hit the timeout after seeing headers and before end_stream which
would result in a crash as the body or trailers are attempted to be
decoded using a null pointer.
Signed-off-by: Michael Puncel mpuncel@squareup.com
For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md
Description: Fixes a regression in the router that can cause a crash when an upstream request has seen headers but not end_stream when a response timeout is hit
Risk Level: low
Testing: integration test
Docs Changes: N/A
Release Notes: N/A
Fixes #7154