Skip to content

common: http2 codec avoids accessing decoder_ (ActiveStream) if nghttp2 returns failure#6842

Closed
soya3129 wants to merge 7 commits intoenvoyproxy:masterfrom
soya3129:decoder
Closed

common: http2 codec avoids accessing decoder_ (ActiveStream) if nghttp2 returns failure#6842
soya3129 wants to merge 7 commits intoenvoyproxy:masterfrom
soya3129:decoder

Conversation

@soya3129
Copy link
Copy Markdown
Contributor

@soya3129 soya3129 commented May 7, 2019

Signed-off-by: Yang Song yasong@google.com

Description: When nghttp2 callback function fails, HCM calls resetAllStreams(here) to delete ActiveStreams. If a client keep pushing data, before the connection is closed, code_->dispatch() will keep processing data, which may refer to decoder_ (aka: ActiveStream). This PR checks if ActiveStream has been deleted before accessing. The problem is discovered in metadata change #5656
Risk Level: Low. No behavior change.
Testing: In #5656, test ProxyMultipleMetadataReachSizeLimit shows without the protection, the test can crash. The test will be submitted with #5656 instead of this PR since it needs the metadata change.

@alyssawilk
Copy link
Copy Markdown
Contributor

Thanks for splitting this out! I'll take a look tomorrow but in the meantime do we have coverage of the error paths?

@soya3129
Copy link
Copy Markdown
Contributor Author

soya3129 commented May 7, 2019

The test ProxyMultipleMetadataReachSizeLimit in #5656 covers the error path. Since the test needs the metadata change, I didn't include it in the PR.

@htuch
Copy link
Copy Markdown
Member

htuch commented May 8, 2019

@soya3129 arguably #6852 is the fix here; HCM shouldn't be having any data pushed to it after the error.

@alyssawilk
Copy link
Copy Markdown
Contributor

I'm not sure it's a case where the data is getting pushed from the connection to the codec, it could be that the codec got data including multiple frames. I think if we we hit this nghttp2 is processing after getting an error response from a callback and we should look into fixing that "upstream", but in the interim I'm more than happy to have our code provide defense in depth against nghttp2 bugs. WDYT?

@soya3129
Copy link
Copy Markdown
Contributor Author

soya3129 commented May 8, 2019

Correction: the test covers the error path is RequestMetadataReachSizeLimit instead of ProxyMultipleMetadataReachSizeLimit.

The log from the test shows: after resetAllStreams is called, the connection still sets "read ready" and calls ConnectionImpl::onRead(). Does that mean the connection is still pushing data to HCM?

53 [2019-05-08 15:37:54.414][29][error][connection] [source/common/network/connection_impl.cc:489] [C7] read ready
54 [2019-05-08 15:37:54.414][29][error][connection] [source/common/network/connection_impl.cc:247] [C7] ConnectionImpl::onRead called
55 [2019-05-08 15:37:54.414][29][error][misc] [source/common/http/conn_manager_impl.cc:258] ConnectionManagerImpl::onData is called, HCM: 0x55f952a1a100
56 [2019-05-08 15:37:54.421][29][error][misc] [source/common/http/conn_manager_impl.cc:286] ConnectionManagerImpl::onData calls resetAllStreams, HCM: 0x55f952a1a100
57 [2019-05-08 15:37:54.422][29][error][connection] [source/common/network/connection_impl.cc:489] [C7] read ready
58 [2019-05-08 15:37:54.422][29][error][connection] [source/common/network/connection_impl.cc:247] [C7] ConnectionImpl::onRead called
59 [2019-05-08 15:37:54.422][29][error][misc] [source/common/http/conn_manager_impl.cc:258] ConnectionManagerImpl::onData is called, HCM: 0x55f952a1a100
60 [2019-05-08 15:37:54.422][29][critical][assert] [source/common/http/http2/codec_impl.cc:355] assert failure: false.

@alyssawilk
Copy link
Copy Markdown
Contributor

Yeah, if that's the use case, then Harvey's fix will fix. Right after the resetAllStreams there's a connection close(FlushWrite) which currently doesn't disable reads but will in future.

I still prefer we do error handling in more places rather than only one place, so still think this PR is worth landing. I tagged @mattklein123 for a second opinion.

@mattklein123
Copy link
Copy Markdown
Member

I still prefer we do error handling in more places rather than only one place, so still think this PR is worth landing. I tagged @mattklein123 for a second opinion.

My general feeling is that we shouldn't add extra error handling if we should never hit it, it just makes the code harder to reason about. @soya3129 do you want to apply #6852 locally and make sure it fixes your issue and then we can discuss further?

@soya3129
Copy link
Copy Markdown
Contributor Author

soya3129 commented May 9, 2019

I still prefer we do error handling in more places rather than only one place, so still think this PR is worth landing. I tagged @mattklein123 for a second opinion.

My general feeling is that we shouldn't add extra error handling if we should never hit it, it just makes the code harder to reason about. @soya3129 do you want to apply #6852 locally and make sure it fixes your issue and then we can discuss further?

Confirmed the latest version fixed the crash.

@mattklein123
Copy link
Copy Markdown
Member

My preference would be to close this then but if @alyssawilk @htuch feel strongly about it one way or the other will defer to them.

@alyssawilk
Copy link
Copy Markdown
Contributor

Yeah, if you're not dead set against I'd prefer to have the checks in two places, especially given the debate of how to not make the Network::Connection fixes fragile.. I'd be fine with ASSERTS of !nullptr "this should never happen" if you think that makes it more clear?

@mattklein123
Copy link
Copy Markdown
Member

I would strongly prefer ASSERTs if you are OK with that.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I meant both nullptr checks and asserts. Doable?

return 0;
}

if (stream != nullptr) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't have a test for this one. Can we?

@mattklein123
Copy link
Copy Markdown
Member

To be clear, I meant both nullptr checks and asserts. Doable?

This would be a violation of the error handling guidelines that we use everywhere else, so I'm not sure why we feel this code is any different than anywhere else, and I'm not sure how we are going to test this logic to make sure that things still "work" if things are null. So I'm still against adding this code but I'm going to fight it if you feel strongly about it. @htuch any thoughts here?

@htuch
Copy link
Copy Markdown
Member

htuch commented May 9, 2019

@mattklein123 isn't it fine to have ASSERT(foo != nullptr) providing we're not immediately dereferencing foo in the next line? It's not a redundant check in this situation but a potentially useful statement of expected internal state.

@mattklein123
Copy link
Copy Markdown
Member

mattklein123 commented May 9, 2019

@mattklein123 isn't it fine to have ASSERT(foo != nullptr) providing we're not immediately dereferencing foo in the next line? It's not a redundant check in this situation but a potentially useful statement of expected internal state.

IMO the fact that we are asserting something can't happen, and then doing an if check in case it happens, is a bad code smell. It adds complexity for state that we are saying shouldn't ever happen, right? What if it does happen somehow, are we sure things will "work?" Shouldn't we crash?

@alyssawilk
Copy link
Copy Markdown
Contributor

Given in this case we had a fuzzer which could repro this crash, to me it says that a determined attacker could create a global outage for companies running Envoy from when the bug was discovered until it was fixed.

I don't think we can afford to maintain a crash first, debug later mentality for Envoy. I think it's worth aiming for an intermediate "this should never happen and if it does it is a serious bug" without actually causing a global outage, which is why I keep pushing for defense in depth even if it looks like cruft.

@mattklein123
Copy link
Copy Markdown
Member

I don't think we can afford to maintain a crash first, debug later mentality for Envoy. I think it's worth aiming for an intermediate "this should never happen and if it does it is a serious bug" without actually causing a global outage, which is why I keep pushing for defense in depth even if it looks like cruft.

How do you know that with these if statements, things will actually work and not crash later? I don't see what this adds if we don't have tests that cover it. I also don't see how this is any different then any of the other existing bugs that we don't know about (I'm sure there are a bunch!).

Like I said, I don't feel this change adds any real value and adds complexity for no reason. But, if you all want to add it, I will stop there.

@alyssawilk
Copy link
Copy Markdown
Contributor

I'm 100% fine adding tests which cover rogue network::connections, and make sure we don't propagate calls after encountering an nghttp2 error. The connection class today (well when #6852 lands) won't do this, but we can totally reproduce the buggy behavior in unit tests since the connection class can be forced to send spurious data there. I'm all for regression testing that the hardening is useful.

@htuch
Copy link
Copy Markdown
Member

htuch commented May 9, 2019

Maybe let's see how the #6852 tests play out. I agree we want to be pretty paranoid here, it's a question of whether the tests we add for #6852 are enough to convince that the H2 codec isn't going to see bad behavior.

Philosophically, I agree with both the points that we can't crash if avoidable and also that we shouldn't have code paths that should never be executed. I think being able to have a connection level release assert, which allows us to limit blast radius of asserts is kind of an interesting middle ground. It might look in implementation similar to what exists there today, but be more generally useful in other parts of HCM and the codecs.

@mattklein123
Copy link
Copy Markdown
Member

I think being able to have a connection level release assert, which allows us to limit blast radius of asserts is kind of an interesting middle ground. It might look in implementation similar to what exists there today, but be more generally useful in other parts of HCM and the codecs.

Interesting idea. Would love to explore this.

In general, as long as we have test coverage for the paranoid paths, no objection from me.

@alyssawilk
Copy link
Copy Markdown
Contributor

Cool. Yang, can you add a test of the connection sending data after an error?

@soya3129
Copy link
Copy Markdown
Contributor Author

Sure!

Yang Song added 2 commits May 19, 2019 16:05
Signed-off-by: Yang Song <yasong@google.com>
Signed-off-by: Yang Song <yasong@google.com>
Signed-off-by: Yang Song <yasong@google.com>
@soya3129
Copy link
Copy Markdown
Contributor Author

I added a unit test to cover the connection sending data after a general nghttp2 callback returns error. There is an end-to-end test in #5656 covers the possible crash when handling metadata (RequestMetadataReachSizeLimit), but I am really struggling with creating a similar end-to-end test for onInvalidFrame(). The reason is, envoy sets up frames for nghttp2 to create, and nghttp2 doesn’t create invalid frames. I also tried to corrupt the frames at connection level. It does trigger onInvalidFrames(), but it doesn’t hit the needed return value (here). It returns here. Another reason that makes this hard is, in the current test frame work, we can easily send headers, data and metadata. But we can't easily send setting, window_update, etc, and those frame types may be easier to generate the needed return value. We also can't invalidate headers, because we need a valid stream. That leaves us with only data frames to mess with. The corruption I tried can't produce the needed return value.

Another way is to introduce a module allowing test code to reach into the production code, and forces onInvalidFrames() to return the needed value only when the test is running. I think that may be a general useful feature. But I don't think it complies with envoy's style.

Please let me know if you have any suggestions/ideas to create an e2e test with onInvalidFrames(). But I think the unit test covers the crash case with general nghttp2 callback functions. What do you think?

Signed-off-by: Yang Song <yasong@google.com>
@mattklein123
Copy link
Copy Markdown
Member

I think if it's this hard to add test coverage for code that should never happen it's a good indication we shouldn't do it. :)

@soya3129
Copy link
Copy Markdown
Contributor Author

Maybe we can revert the change in onInvalidFrames() and keep the change in metadata since they have test coverage? Either way, I feel the unit test for conn manager is useful since no existing test covers resetAllStream() path. What do you think?

@mattklein123
Copy link
Copy Markdown
Member

Maybe we can revert the change in onInvalidFrames() and keep the change in metadata since they have test coverage? Either way, I feel the unit test for conn manager is useful since no existing test covers resetAllStream() path. What do you think?

Sure. I would keep any checks you can reasonably cover and remove those you can't.

/wait

Signed-off-by: Yang Song <yasong@google.com>
Signed-off-by: Yang Song <yasong@google.com>
@soya3129
Copy link
Copy Markdown
Contributor Author

Reverted the change in onInvalidFrame(), but added a warning message in case we hit the case. I will have to leave the e2e metadata test in #5656, since it requires some change in that PR. PTAL. Thanks!

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this. A few comments.

/wait

}

if (stream != nullptr) {
ENVOY_CONN_LOG(warn, "Referring the stream later may cause crash. See #6842. id: {}, error: {}",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I would do a warn here. If you can't test this, let's just do ASSERT(stream != nullptr); ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the stream can have a value. If after onInvalidFrame() returns error, the connection is closed, or stops processing, the crash won't happen.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then I don't think we should warn here if this can happen? This could spam the lags and I'm not sure who would look at this and how they would action on it?

Buffer::OwnedImpl fake_data("hello");
// ASSERT_DEATH doesn't work in MacOS.
#ifndef __APPLE__
ASSERT_DEATH(decoder->decodeData(fake_data, false), "");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we don't crash anymore and that was the point of this change with the tests? What's the reasoning behind having a test that doesn't pass on OSX? I'm pretty sure death tests work there, so there is there some other reason it doesn't work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. This test is to show after nghttp2 callback functions return error, decoder is destroyed. If the connection manager keeps processing data, envoy can crash. The fix is to set decoder to be nullptr when callbacks return error. I agree the test doesn't cover the fix, and it just shows the root cause. I am OK to revert the test.

How about we merge the metadata change first (I will revert the crash fix in that PR as well as the metadata crash test), and after that, I will submit the crash fix and the crash test as a followup PR so it's easy to track what problem it fixes? Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@stale
Copy link
Copy Markdown

stale bot commented May 28, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label May 28, 2019
@stale
Copy link
Copy Markdown

stale bot commented Jun 4, 2019

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Jun 4, 2019
alyssawilk pushed a commit that referenced this pull request Aug 1, 2019
Description: Add a test to cover the case where nghttp2 returns process error because metadata exceed size limit. The PR also adds a few asserts to verify when nghttp2 returns error, no more data is passed to filters (a bug fixed by#6852). This PR is a followup for #6842.
Risk Level: Low.
Testing: N/A
Docs Changes: N/A
Release Notes: N/A
Signed-off-by: Yang Song <yasong@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants