Skip to content

fix regression in router that can cause a crash on response timeout#7192

Merged
mattklein123 merged 4 commits intoenvoyproxy:masterfrom
mpuncel:mpuncel/fix-timeout-regression
Jun 7, 2019
Merged

fix regression in router that can cause a crash on response timeout#7192
mattklein123 merged 4 commits intoenvoyproxy:masterfrom
mpuncel:mpuncel/fix-timeout-regression

Conversation

@mpuncel
Copy link
Copy Markdown
Contributor

@mpuncel mpuncel commented Jun 6, 2019

Prior to the recent retry hedging change, upstream requests were always
reset as they should be when a response timeout is triggered. The retry
hedging change inadvertently changed the logic so that the upstream
request is only reset if it hadn't seen headers. However it's possible
to hit the timeout after seeing headers and before end_stream which
would result in a crash as the body or trailers are attempted to be
decoded using a null pointer.

Signed-off-by: Michael Puncel mpuncel@squareup.com

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Description: Fixes a regression in the router that can cause a crash when an upstream request has seen headers but not end_stream when a response timeout is hit
Risk Level: low
Testing: integration test
Docs Changes: N/A
Release Notes: N/A
Fixes #7154

Prior to the recent retry hedging change, upstream requests were always
reset as they should be when a response timeout is triggered. The retry
hedging change inadvertently changed the logic so that the upstream
request is only reset if it hadn't seen headers. However it's possible
to hit the timeout after seeing headers and before end_stream which
would result in a crash as the body or trailers are attempted to be
decoded using a null pointer.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
timeSystem().sleep(std::chrono::milliseconds(200));

// Respond with body.
upstream_request_->encodeData(100, true);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this crashed before the fix was applied

Copy link
Copy Markdown
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, just one nit.

{":authority", "host"},
{"x-forwarded-for", "10.0.0.1"},
{"x-envoy-upstream-rq-timeout-ms", "100"}};
auto encoder_decoder = codec_client_->startRequest(request_headers);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use auto only if the type is obvious (e.g. cast, make_unique/shared, assigning literals).

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. LGTM modulo @lizan comment.

alyssawilk
alyssawilk previously approved these changes Jun 6, 2019
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for the fix and the regression test!

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
lizan
lizan previously approved these changes Jun 6, 2019
@rgs1
Copy link
Copy Markdown
Member

rgs1 commented Jun 6, 2019

Testing this with traffic that triggered the crash for us and thus far, no more crashes 👍

@lizan
Copy link
Copy Markdown
Member

lizan commented Jun 6, 2019

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: ipv6_tests (failed build)

🐱

Caused by: a #7192 (comment) was created by @lizan.

see: more, trace.

@lizan
Copy link
Copy Markdown
Member

lizan commented Jun 6, 2019

@mpuncel
Copy link
Copy Markdown
Contributor Author

mpuncel commented Jun 7, 2019

Unfortunately I can't seem to reproduce that in gdb or figure out line numbers from stack_decode.py which means I'm kind of stuck. It does reproduce about 20 times in 1000 though on my machine

Sending a body causes the ipv6 test to segfault for some reason, perhaps
that's not a sane thing to try in the integration tests (send data after
reset). Asserting a reset is just as good from a testing standpoint
since it verifies the correct router behavior.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Copy Markdown
Contributor Author

mpuncel commented Jun 7, 2019

OK think I lucked out with a guess on what was going on. Now it passes after 1000 runs. Maybe the way the fake upstreams work trying to write a body after a reset is a recipe for badness

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@mattklein123 mattklein123 merged commit 9d47062 into envoyproxy:master Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

seeing crash starting after syncing with 8d1ad35aa

5 participants