common: http2 codec avoids accessing decoder_ (ActiveStream) if nghttp2 returns failure by soya3129 · Pull Request #6842 · envoyproxy/envoy

soya3129 · 2019-05-07T19:29:35Z

Signed-off-by: Yang Song yasong@google.com

Description: When nghttp2 callback function fails, HCM calls resetAllStreams(here) to delete ActiveStreams. If a client keep pushing data, before the connection is closed, code_->dispatch() will keep processing data, which may refer to decoder_ (aka: ActiveStream). This PR checks if ActiveStream has been deleted before accessing. The problem is discovered in metadata change #5656
Risk Level: Low. No behavior change.
Testing: In #5656, test ProxyMultipleMetadataReachSizeLimit shows without the protection, the test can crash. The test will be submitted with #5656 instead of this PR since it needs the metadata change.

Signed-off-by: Yang Song <yasong@google.com>

alyssawilk · 2019-05-07T20:57:08Z

Thanks for splitting this out! I'll take a look tomorrow but in the meantime do we have coverage of the error paths?

soya3129 · 2019-05-07T21:10:45Z

The test ProxyMultipleMetadataReachSizeLimit in #5656 covers the error path. Since the test needs the metadata change, I didn't include it in the PR.

htuch · 2019-05-08T12:19:13Z

@soya3129 arguably #6852 is the fix here; HCM shouldn't be having any data pushed to it after the error.

alyssawilk · 2019-05-08T14:30:09Z

I'm not sure it's a case where the data is getting pushed from the connection to the codec, it could be that the codec got data including multiple frames. I think if we we hit this nghttp2 is processing after getting an error response from a callback and we should look into fixing that "upstream", but in the interim I'm more than happy to have our code provide defense in depth against nghttp2 bugs. WDYT?

soya3129 · 2019-05-08T15:43:22Z

Correction: the test covers the error path is RequestMetadataReachSizeLimit instead of ProxyMultipleMetadataReachSizeLimit.

The log from the test shows: after resetAllStreams is called, the connection still sets "read ready" and calls ConnectionImpl::onRead(). Does that mean the connection is still pushing data to HCM?

53 [2019-05-08 15:37:54.414][29][error][connection] [source/common/network/connection_impl.cc:489] [C7] read ready
54 [2019-05-08 15:37:54.414][29][error][connection] [source/common/network/connection_impl.cc:247] [C7] ConnectionImpl::onRead called
55 [2019-05-08 15:37:54.414][29][error][misc] [source/common/http/conn_manager_impl.cc:258] ConnectionManagerImpl::onData is called, HCM: 0x55f952a1a100
56 [2019-05-08 15:37:54.421][29][error][misc] [source/common/http/conn_manager_impl.cc:286] ConnectionManagerImpl::onData calls resetAllStreams, HCM: 0x55f952a1a100
57 [2019-05-08 15:37:54.422][29][error][connection] [source/common/network/connection_impl.cc:489] [C7] read ready
58 [2019-05-08 15:37:54.422][29][error][connection] [source/common/network/connection_impl.cc:247] [C7] ConnectionImpl::onRead called
59 [2019-05-08 15:37:54.422][29][error][misc] [source/common/http/conn_manager_impl.cc:258] ConnectionManagerImpl::onData is called, HCM: 0x55f952a1a100
60 [2019-05-08 15:37:54.422][29][critical][assert] [source/common/http/http2/codec_impl.cc:355] assert failure: false.

alyssawilk · 2019-05-08T15:46:56Z

Yeah, if that's the use case, then Harvey's fix will fix. Right after the resetAllStreams there's a connection close(FlushWrite) which currently doesn't disable reads but will in future.

I still prefer we do error handling in more places rather than only one place, so still think this PR is worth landing. I tagged @mattklein123 for a second opinion.

mattklein123 · 2019-05-08T23:05:19Z

I still prefer we do error handling in more places rather than only one place, so still think this PR is worth landing. I tagged @mattklein123 for a second opinion.

My general feeling is that we shouldn't add extra error handling if we should never hit it, it just makes the code harder to reason about. @soya3129 do you want to apply #6852 locally and make sure it fixes your issue and then we can discuss further?

soya3129 · 2019-05-09T14:46:24Z

I still prefer we do error handling in more places rather than only one place, so still think this PR is worth landing. I tagged @mattklein123 for a second opinion.

My general feeling is that we shouldn't add extra error handling if we should never hit it, it just makes the code harder to reason about. @soya3129 do you want to apply #6852 locally and make sure it fixes your issue and then we can discuss further?

Confirmed the latest version fixed the crash.

mattklein123 · 2019-05-09T16:44:24Z

My preference would be to close this then but if @alyssawilk @htuch feel strongly about it one way or the other will defer to them.

alyssawilk · 2019-05-09T17:13:57Z

Yeah, if you're not dead set against I'd prefer to have the checks in two places, especially given the debate of how to not make the Network::Connection fixes fragile.. I'd be fine with ASSERTS of !nullptr "this should never happen" if you think that makes it more clear?

mattklein123 · 2019-05-09T17:23:37Z

I would strongly prefer ASSERTs if you are OK with that.

alyssawilk

To be clear, I meant both nullptr checks and asserts. Doable?

alyssawilk · 2019-05-08T14:22:12Z

source/common/http/http2/codec_impl.cc

    return 0;
  }

+  if (stream != nullptr) {


Looks like we don't have a test for this one. Can we?

mattklein123 · 2019-05-09T17:34:45Z

To be clear, I meant both nullptr checks and asserts. Doable?

This would be a violation of the error handling guidelines that we use everywhere else, so I'm not sure why we feel this code is any different than anywhere else, and I'm not sure how we are going to test this logic to make sure that things still "work" if things are null. So I'm still against adding this code but I'm going to fight it if you feel strongly about it. @htuch any thoughts here?

htuch · 2019-05-09T18:39:46Z

@mattklein123 isn't it fine to have ASSERT(foo != nullptr) providing we're not immediately dereferencing foo in the next line? It's not a redundant check in this situation but a potentially useful statement of expected internal state.

mattklein123 · 2019-05-09T20:42:41Z

@mattklein123 isn't it fine to have ASSERT(foo != nullptr) providing we're not immediately dereferencing foo in the next line? It's not a redundant check in this situation but a potentially useful statement of expected internal state.

IMO the fact that we are asserting something can't happen, and then doing an if check in case it happens, is a bad code smell. It adds complexity for state that we are saying shouldn't ever happen, right? What if it does happen somehow, are we sure things will "work?" Shouldn't we crash?

alyssawilk · 2019-05-09T20:57:46Z

Given in this case we had a fuzzer which could repro this crash, to me it says that a determined attacker could create a global outage for companies running Envoy from when the bug was discovered until it was fixed.

I don't think we can afford to maintain a crash first, debug later mentality for Envoy. I think it's worth aiming for an intermediate "this should never happen and if it does it is a serious bug" without actually causing a global outage, which is why I keep pushing for defense in depth even if it looks like cruft.

mattklein123 · 2019-05-09T21:00:01Z

I don't think we can afford to maintain a crash first, debug later mentality for Envoy. I think it's worth aiming for an intermediate "this should never happen and if it does it is a serious bug" without actually causing a global outage, which is why I keep pushing for defense in depth even if it looks like cruft.

How do you know that with these if statements, things will actually work and not crash later? I don't see what this adds if we don't have tests that cover it. I also don't see how this is any different then any of the other existing bugs that we don't know about (I'm sure there are a bunch!).

Like I said, I don't feel this change adds any real value and adds complexity for no reason. But, if you all want to add it, I will stop there.

alyssawilk · 2019-05-09T21:02:49Z

I'm 100% fine adding tests which cover rogue network::connections, and make sure we don't propagate calls after encountering an nghttp2 error. The connection class today (well when #6852 lands) won't do this, but we can totally reproduce the buggy behavior in unit tests since the connection class can be forced to send spurious data there. I'm all for regression testing that the hardening is useful.

htuch · 2019-05-09T21:14:40Z

Maybe let's see how the #6852 tests play out. I agree we want to be pretty paranoid here, it's a question of whether the tests we add for #6852 are enough to convince that the H2 codec isn't going to see bad behavior.

Philosophically, I agree with both the points that we can't crash if avoidable and also that we shouldn't have code paths that should never be executed. I think being able to have a connection level release assert, which allows us to limit blast radius of asserts is kind of an interesting middle ground. It might look in implementation similar to what exists there today, but be more generally useful in other parts of HCM and the codecs.

mattklein123 · 2019-05-09T22:04:08Z

I think being able to have a connection level release assert, which allows us to limit blast radius of asserts is kind of an interesting middle ground. It might look in implementation similar to what exists there today, but be more generally useful in other parts of HCM and the codecs.

Interesting idea. Would love to explore this.

In general, as long as we have test coverage for the paranoid paths, no objection from me.

alyssawilk · 2019-05-13T17:35:50Z

Cool. Yang, can you add a test of the connection sending data after an error?

soya3129 · 2019-05-13T17:49:08Z

Sure!

Signed-off-by: Yang Song <yasong@google.com>

soya3129 · 2019-05-19T21:54:42Z

I added a unit test to cover the connection sending data after a general nghttp2 callback returns error. There is an end-to-end test in #5656 covers the possible crash when handling metadata (RequestMetadataReachSizeLimit), but I am really struggling with creating a similar end-to-end test for onInvalidFrame(). The reason is, envoy sets up frames for nghttp2 to create, and nghttp2 doesn’t create invalid frames. I also tried to corrupt the frames at connection level. It does trigger onInvalidFrames(), but it doesn’t hit the needed return value (here). It returns here. Another reason that makes this hard is, in the current test frame work, we can easily send headers, data and metadata. But we can't easily send setting, window_update, etc, and those frame types may be easier to generate the needed return value. We also can't invalidate headers, because we need a valid stream. That leaves us with only data frames to mess with. The corruption I tried can't produce the needed return value.

Another way is to introduce a module allowing test code to reach into the production code, and forces onInvalidFrames() to return the needed value only when the test is running. I think that may be a general useful feature. But I don't think it complies with envoy's style.

Please let me know if you have any suggestions/ideas to create an e2e test with onInvalidFrames(). But I think the unit test covers the crash case with general nghttp2 callback functions. What do you think?

Signed-off-by: Yang Song <yasong@google.com>

mattklein123 · 2019-05-20T14:03:26Z

I think if it's this hard to add test coverage for code that should never happen it's a good indication we shouldn't do it. :)

soya3129 · 2019-05-20T15:02:07Z

Maybe we can revert the change in onInvalidFrames() and keep the change in metadata since they have test coverage? Either way, I feel the unit test for conn manager is useful since no existing test covers resetAllStream() path. What do you think?

mattklein123 · 2019-05-20T21:56:58Z

Maybe we can revert the change in onInvalidFrames() and keep the change in metadata since they have test coverage? Either way, I feel the unit test for conn manager is useful since no existing test covers resetAllStream() path. What do you think?

Sure. I would keep any checks you can reasonably cover and remove those you can't.

/wait

Signed-off-by: Yang Song <yasong@google.com>

soya3129 · 2019-05-20T22:49:52Z

Reverted the change in onInvalidFrame(), but added a warning message in case we hit the case. I will have to leave the e2e metadata test in #5656, since it requires some change in that PR. PTAL. Thanks!

mattklein123

Thanks for iterating on this. A few comments.

/wait

mattklein123 · 2019-05-21T02:52:33Z

source/common/http/http2/codec_impl.cc

  }

+  if (stream != nullptr) {
+    ENVOY_CONN_LOG(warn, "Referring the stream later may cause crash. See #6842. id: {}, error: {}",


I don't think I would do a warn here. If you can't test this, let's just do ASSERT(stream != nullptr); ?

I think the stream can have a value. If after onInvalidFrame() returns error, the connection is closed, or stops processing, the crash won't happen.

But then I don't think we should warn here if this can happen? This could spam the lags and I'm not sure who would look at this and how they would action on it?

mattklein123 · 2019-05-21T02:55:20Z

test/common/http/conn_manager_impl_test.cc

+          Buffer::OwnedImpl fake_data("hello");
+      // ASSERT_DEATH doesn't work in MacOS.
+#ifndef __APPLE__
+          ASSERT_DEATH(decoder->decodeData(fake_data, false), "");


I thought we don't crash anymore and that was the point of this change with the tests? What's the reasoning behind having a test that doesn't pass on OSX? I'm pretty sure death tests work there, so there is there some other reason it doesn't work?

I see your point. This test is to show after nghttp2 callback functions return error, decoder is destroyed. If the connection manager keeps processing data, envoy can crash. The fix is to set decoder to be nullptr when callbacks return error. I agree the test doesn't cover the fix, and it just shows the root cause. I am OK to revert the test.

How about we merge the metadata change first (I will revert the crash fix in that PR as well as the metadata crash test), and after that, I will submit the crash fix and the crash test as a followup PR so it's easy to track what problem it fixes? Thanks!

Works for me!

Sounds good.

stale · 2019-05-28T20:02:55Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

stale · 2019-06-04T22:28:10Z

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Description: Add a test to cover the case where nghttp2 returns process error because metadata exceed size limit. The PR also adds a few asserts to verify when nghttp2 returns error, no more data is passed to filters (a bug fixed by#6852). This PR is a followup for #6842. Risk Level: Low. Testing: N/A Docs Changes: N/A Release Notes: N/A Signed-off-by: Yang Song <yasong@google.com>

avoid codec to access decoder_ (ActiveStream) if nghttp2 returns failure

52a666d

Signed-off-by: Yang Song <yasong@google.com>

soya3129 mentioned this pull request May 7, 2019

Common: Consume, proxy and insert metadata from downstream to upstream #5656

Merged

alyssawilk assigned mattklein123 and alyssawilk May 7, 2019

alyssawilk added the waiting:any label May 8, 2019

repokitteh-read-only bot removed the waiting:any label May 8, 2019

mattklein123 added the waiting:any label May 8, 2019

repokitteh-read-only bot removed the waiting:any label May 9, 2019

mattklein123 added the waiting:any label May 9, 2019

repokitteh-read-only bot removed the waiting:any label May 9, 2019

alyssawilk reviewed May 9, 2019

View reviewed changes

mattklein123 added the waiting label May 9, 2019

Yang Song added 2 commits May 19, 2019 16:05

add a test for the crashing case

f03cd0a

Signed-off-by: Yang Song <yasong@google.com>

format change

6235438

Signed-off-by: Yang Song <yasong@google.com>

repokitteh-read-only bot removed the waiting label May 19, 2019

merge master

8d1dd1e

Signed-off-by: Yang Song <yasong@google.com>

try to fix the macos failure

26b83b8

Signed-off-by: Yang Song <yasong@google.com>

repokitteh-read-only bot added the waiting label May 20, 2019

revert onInvalidFrame change and avoid assert_death if running with mac

92ff2b8

Signed-off-by: Yang Song <yasong@google.com>

repokitteh-read-only bot removed the waiting label May 20, 2019

fix format

b54205f

Signed-off-by: Yang Song <yasong@google.com>

mattklein123 requested changes May 21, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label May 21, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label May 28, 2019

stale bot closed this Jun 4, 2019

soya3129 mentioned this pull request Jul 29, 2019

http: add a test case where metadata exceed size limit #7747

Merged

Conversation

soya3129 commented May 7, 2019

Uh oh!

alyssawilk commented May 7, 2019

Uh oh!

soya3129 commented May 7, 2019

Uh oh!

htuch commented May 8, 2019

Uh oh!

alyssawilk commented May 8, 2019

Uh oh!

soya3129 commented May 8, 2019

Uh oh!

alyssawilk commented May 8, 2019

Uh oh!

mattklein123 commented May 8, 2019

Uh oh!

soya3129 commented May 9, 2019

Uh oh!

mattklein123 commented May 9, 2019

Uh oh!

alyssawilk commented May 9, 2019

Uh oh!

mattklein123 commented May 9, 2019

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented May 9, 2019

Uh oh!

htuch commented May 9, 2019

Uh oh!

mattklein123 commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyssawilk commented May 9, 2019

Uh oh!

mattklein123 commented May 9, 2019

Uh oh!

alyssawilk commented May 9, 2019

Uh oh!

htuch commented May 9, 2019

Uh oh!

mattklein123 commented May 9, 2019

Uh oh!

alyssawilk commented May 13, 2019

Uh oh!

soya3129 commented May 13, 2019

Uh oh!

soya3129 commented May 19, 2019

Uh oh!

mattklein123 commented May 20, 2019

Uh oh!

soya3129 commented May 20, 2019

Uh oh!

mattklein123 commented May 20, 2019

Uh oh!

soya3129 commented May 20, 2019

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented May 28, 2019

Uh oh!

mattklein123 commented May 9, 2019 •

edited

Loading