http/2: better handling for messaging error resets#3140
Conversation
This patch stems from an operational issue at Lyft that took a really long time to debug. The issue was a client sending a content-length along with a 204 response which is against spec and nghttp2 does not like. There are a few different changes here: 1) Treat all messaging errors as stream level errors vs. connection level errors. 2) Increment a specific stat so its easier to debug these cases. 3) Charge resets that stem from these issues as local resets vs. remote resets which is what was happening previously. Signed-off-by: Matt Klein <mklein@lyft.com>
|
@alyssawilk PTAL when you have time. Not urgent. |
alyssawilk
left a comment
There was a problem hiding this comment.
+100 to not always killing the connection on bad individual requests. I'm a bit fuzzy on the details so mostly questions here.
| if (error_code == NGHTTP2_ERR_HTTP_HEADER) { | ||
| // The stream is about to be closed due to an invalid header or messaging. Don't kill the | ||
| // entire connection if one stream has bad headers or messaging. | ||
| if (error_code == NGHTTP2_ERR_HTTP_HEADER || error_code == NGHTTP2_ERR_HTTP_MESSAGING) { |
There was a problem hiding this comment.
Does nghttp2 save us from incomplete hpack state? If the stream is closed based on an error in an initial headers frame but there's a contination, does it do the right thing or can we get wedged? Worth adding a test for multi-frame request headers?
There was a problem hiding this comment.
I'm fairly certain it will do the right thing. Let me see about adding a test case for that. I'm not sure how hard it will be as nghttp2 generally hides continuation frames from us.
There was a problem hiding this comment.
I think a really large header frames should work, worst case.
There was a problem hiding this comment.
HTTP semantics error occurs after successful HPACK decoding and its state is completely fine even with those errors.
There was a problem hiding this comment.
Great to hear, thank you for chiming in.
Curious, does that mean if the request headers are broken into multiple frames, we'll not send the stream cancellation until processing all of the headers, or we kill the stream off right away but continue to update HPACK state for internal consistency?
There was a problem hiding this comment.
Yes. Unless we get unrecoverable error during HPACK decoding, we will feed incoming bits into HPACK even if its output is just thrown away.
| ? StreamResetReason::RemoteRefusedStreamReset | ||
| : StreamResetReason::RemoteReset); | ||
| StreamResetReason reason; | ||
| if (stream->reset_due_to_messaging_error_) { |
There was a problem hiding this comment.
https://nghttp2.org/documentation/types.html#c.nghttp2_on_stream_close_callback indicates that error_code is generally nghttp2_error_code here.
Are there cases where we'll get onInvalidFrame(NGHTTP2_ERR_HTTP_MESSAGING) and onStreamClose(SomeOtherError)? I'm not terribly familiar with the internals of NGHTTP2_ERR_HTTP_MESSAGING but I wouldn't have assumed all messaging errors triggered invalidFrame so I would assume setting reset_due_to_messaging_error_ where we reset the stream would more often be correct.
There was a problem hiding this comment.
Unfortunately this is the issue that I was commenting on re: gymnastics of doing this change. By the time we get to onStreamClose(), we only get one of these errors: https://nghttp2.org/documentation/enums.html#c.nghttp2_error_code which are RFC wire errors. So all lib errors (https://nghttp2.org/documentation/enums.html#c.nghttp2_error) have been collapsed into a generic error. It would be nice if nghttp2 also gave us the lib error that caused the wire error but I don't think there is any way to get that. What I did here is I think the "cleanest" way of moving from lib error to reset reason. AFAIK, all messaging errors will trigger invalid frame, because a messaging error came from some frame (headers, data, etc.) so that frame will be rejected before it is dispatched.
@tatsuhiro-t in case you have a few moments to review this change, is there any cleaner way of accomplishing what I'm trying to do?
There was a problem hiding this comment.
Argh, I was conflating nghttp2_error and nghttp2_error_code
Totally PEBKAC but I'm going to claim it's an understandable mistake :-P
There was a problem hiding this comment.
I think this PR is best thing I can think of if you want to treat HTTP messaging error differently.
alyssawilk
left a comment
There was a problem hiding this comment.
(though I still wouldn't mind the unit test being added just in case =P)
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
|
@alyssawilk I modified one of the tests to force a continuation frame. PTAL. |
alyssawilk
left a comment
There was a problem hiding this comment.
Awesome, thank you for doing the extra work there :-)
This reverts commit e293ffb. Signed-off-by: Matt Klein <mklein@lyft.com>
This patch stems from an operational issue at Lyft that took a really long time to debug. The issue was a client sending a content-length along with a 204 response which is against spec and nghttp2 does not like. There are a few different changes here: 1) Treat all messaging errors as stream level errors vs. connection level errors. 2) Increment a specific stat so its easier to debug these cases. 3) Charge resets that stem from these issues as local resets vs. remote resets which is what was happening previously. Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…y#3140)" (envoyproxy#3213) This reverts commit e293ffb. Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…nvoyproxy#3140)" (envoyproxy#3213)" (envoyproxy#3233) This reverts commit 3b4badd. Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
This patch stems from an operational issue at Lyft that took a really long
time to debug. The issue was a client sending a content-length along with a
204 response which is against spec and nghttp2 does not like.
There are a few different changes here:
errors.
resets which is what was happening previously.
Risk Level: Low/Medium (I think the risk is low, but changing this code is always
a bit scary).
Testing: Additional/fixed unit tests.
Docs Changes: New stat added.
Release Notes: N/A