http: fixing a resumption bug in pipelining by alyssawilk · Pull Request #8352 · envoyproxy/envoy

alyssawilk · 2019-09-24T18:53:37Z

The fundamental problem is the way we handle pipelining nowadays is by doing a readDisable(true) when processing a given request and readDisable(false) when we are done. If we've already read the next request, we won't ever resume. Kicking off readDisable(false) creates a fake event, onReadReady does a read, but then if no further data is read that doesn't result in a call to onRead() so the buffered data is not passed up the stack to the codec.

Now tracking that "kick" and making sure we call onRead() when force-kicked even if no new data is ready from the socket.
Intentionally leaving setReadBufferReady as-is as it's supposed to kick off a socket read, and does not currently need to interact with buffered data.

Risk Level: High (changes to network::connection)
Testing: new unit tests, integration test
Docs Changes: n/a
Release Notes: not added

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

mattklein123 · 2019-09-24T20:35:22Z

Yikes. @alyssawilk do we know when this regressed? I'm positive this worked at some point.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk · 2019-09-24T20:58:00Z

I don't think this bug likely causes problems in the wild because almost every client that pipelines requests
sends a request
waits for a response
sends another request

This totally works for Envoy, because in that case we've enabled reads and we actually get a real kernel read rather than a synthetic "kick off an event" read.

Looking at the connection class, I'm not sure when it worked. When we did the refactor for half-close two years ago the behavior of "do not call onRead if there are no bytes read" already existed. We've got codec level tests "verifying" that the readEnable works, but because they used a mocked out connection class they didn't catch the bug. I picked this up turning up more internal integration tests which (unlike most browsers) ship both requests together for just this case. Integration tests FTW! :-)

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

mattklein123 · 2019-09-24T21:00:29Z

Interesting, great find. I swear this used to work as I remember writing the code to handle this case, but I'm probably not remembering correctly!

mattklein123 · 2019-09-24T23:08:05Z

Now if we get a FIN we send the data an extra time before closing the connection, as shown by the unit tests.

Sorry could you expand on this? I'm having a hard time seeing the doubling. Wouldn't the first time generally drain? Or the point is that we didn't drain everything, and then on close we re-dispatch because there is remaining data?

We could also latch "this was a faked event" and send up buffered data any time we try a read when there's faked event?

This sounds like the right idea, agreed the other stuff is scary. I will look at this in more detail in the morning when I am fresher to see if I have any better ideas.

alyssawilk · 2019-09-25T12:56:57Z

Now if we get a FIN we send the data an extra time before closing the connection, as shown by the unit tests.

Sorry could you expand on this? I'm having a hard time seeing the doubling. Wouldn't the first time generally drain? Or the point is that we didn't drain everything, and then on close we re-dispatch because there is remaining data?

Yeah, the second is exactly it. EmptyReadOnCloseTest verifies if we get a read event which is "close the connection" we don't do a spurious onData. I had to change that test to drain the buffer because with this current patch if we get the FIN we don't disambiguate between it being a real or fake kick, and so send the buffered data up.

We could also latch "this was a faked event" and send up buffered data any time we try a read when there's faked event?

This sounds like the right idea, agreed the other stuff is scary. I will look at this in more detail in the morning when I am fresher to see if I have any better ideas.

SGTM, I'll plan on tackling that after you take a look with the extra context. It felt like a bit of a hacky one off and I was hoping for a more elegant solution but I'm happy to have the extra state to avoid the "ondata on close" weirdness.

mattklein123 · 2019-09-25T16:52:45Z

In looking at it more this is the code I was thinking of:

envoy/source/common/http/conn_manager_impl.cc

Lines 322 to 325 in e082812

    
           if (codec_->protocol() != Protocol::Http2) { 
        
             if (read_callbacks_->connection().state() == Network::Connection::State::Open && 
        
                 data.length() > 0 && streams_.empty()) { 
        
               redispatch = true;

I don't recall the history now, but yeah clearly that won't work in the case in which we finish the stream not during dispatch. Whatever we end up doing here though, I feel like we should remove ^ because it would now be redundant assuming readDisable() correctly redispatches, right?

So I guess then my next question would be: can we just have readDisable() inline dispatch data instead of scheduling a fake event at all? I'm guessing the answer is no because we might get into a situation in which we recursively call onData() (the reason the code above was there in the first place), but I will throw that out for consideration, as I think it would simplify things.

Assuming we can't do direct dispatch, I think your idea to keep track of the fake event and handle it differently is the right one. I don't think it would be too hard to implement. As an aside, if we add another bool state field, I might consider making all of those bools into a bit field at this point. There would probably be 64+ bytes of savings per-connection which is probably worthwhile.

WDYT?

alyssawilk · 2019-09-25T17:25:02Z

Yeah, I think we should be able to clean up the redispatch logic if this works cleanly (though I think I'd prefer to TODO it and land one dangerous PR at a time!)

I agree with avoiding a codec dispatch from onMessageComplete. It feels like asking for an opportunity to blow stack if a local reply can happen inline, and while if unwinding from one request is safe unwinding from multiple is probably safe it'd still make me twitchy.

I'll boolean and bitfield when I get a chance, unless someone has any better suggestions.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk · 2019-09-25T20:52:31Z

source/common/network/connection_impl.cc

-    // gets processed regardless.
+    // gets processed regardless and ensure that we push it up via onRead.
    if (read_buffer_.length() > 0) {
+      force_on_read_ = true;


So we call this on the connection in two places, here and setReadBufferReady()

Looking at call sites of setReadBufferReady it looks like we're doing that where we actually want to resume reading from the socket (and don't want to cause a spurious onRead if there's no further data). I think I'm inclined to just leave this as-is, but if we think future users of the network connection might not fully read data and use setReadBufferReady to trigger the dispatch (rather than readEnable/Disable as we do here), we may want to add a boolean to force on read there as well.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk · 2019-09-25T20:55:24Z

source/common/http/conn_manager_impl.cc

    }
  }

+  // TODO(alyssawilk) clean this up after #8352 is well vetted.


Also added this but now I'm wondering if we can land the connection fix or we have to clean this up inline or something weird will happen? I'll take a look tomorrow.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk · 2019-09-26T14:48:40Z

Ok, I think separate clean up is fine now
Assuming pipelined messages A,B, when we finish with message A we'll

do the "kick",
redispatch
-- if we consume all the data, the kick will not have buffered data and will (now) be a no-op
-- if we don't consume all the data, I believe we'll still be processing message B and will readDisable so I think either way we avoid extra dispatch

mattklein123

Thanks this looks great! Small test question/comment.

/wait-any

mattklein123 · 2019-09-26T16:42:41Z

test/common/network/connection_impl_test.cc


+// The HTTP/1 codec handles pipelined connections by relying on readDisable(false) resulting in the
+// subsequent request being dispatched. Regression test this behavior.
+TEST_P(ConnectionImplTest, ReadEnableDispatches) {


Is it possible to have an explicit test for the disconnect case to make sure we don't dispatch on close?

We have one!
If you check out a16cdee
I had to change TEST_P(ConnectionImplTest, EmptyReadOnCloseTest)
to drain the buffer (or it got an extra dispatch) and added
TEST_P(ConnectionImplTest, ReadEventIfBufferedDataOnClose) {
to show with that draft (which I didn't like) how we got the event when we had buffered data.

Now that we have a fix which tests for explicit kick, ConnectionImplTest.EmptyReadOnCloseTest shows that when we close when there's buffered data, we don't push it up.

The one exception is if we explicitly ask for a kick, and then do a read and get the FIN, we will push the data before we push the close. I think that's the correct behavior, as it means the behavior when we have buffered data and ask for dispatch, matches when there's kernel buffered data, which is to say we always pass up the data before closing.

OK, awesome, thanks for clarifying.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

mattklein123

Awesome!

mattklein123 · 2019-09-26T21:56:13Z

/azp run envoy-macos

azure-pipelines · 2019-09-26T21:56:24Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk assigned mattklein123 Sep 24, 2019

http: fixing a resumption bug in pipelining

a16cdee

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk force-pushed the pipeline branch from babad3e to a16cdee Compare September 24, 2019 19:07

schooled by our spelling bot

eaeb3c4

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

Merge branch 'refs/heads/master' into pipeline

d489d9b

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

speeling

d430906

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

mattklein123 added the waiting label Sep 25, 2019

reviewer comments

b5b1070

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

repokitteh-read-only bot removed the waiting label Sep 25, 2019

alyssawilk commented Sep 25, 2019

View reviewed changes

TODO

794a427

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk commented Sep 25, 2019

View reviewed changes

alyssawilk added 2 commits September 26, 2019 10:14

Merge branch 'refs/heads/master' into pipeline

14d8025

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

avoiding spurious kick + unit testing

13c1f80

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

mattklein123 reviewed Sep 26, 2019

View reviewed changes

repokitteh-read-only bot added waiting:any and removed waiting:any labels Sep 26, 2019

macos

bc1594d

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

mattklein123 approved these changes Sep 26, 2019

View reviewed changes

mattklein123 merged commit b0ab9fb into envoyproxy:master Sep 26, 2019

danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019

http: fixing a resumption bug in pipelining (envoyproxy#8352)

beba895

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

alyssawilk mentioned this pull request Nov 20, 2019

Envoy Crash when connection state assert failure #9083

Closed

alyssawilk deleted the pipeline branch April 20, 2020 13:29

Conversation

alyssawilk commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Sep 24, 2019

Uh oh!

alyssawilk commented Sep 24, 2019

Uh oh!

mattklein123 commented Sep 24, 2019

Uh oh!

mattklein123 commented Sep 24, 2019

Uh oh!

alyssawilk commented Sep 25, 2019

Uh oh!

mattklein123 commented Sep 25, 2019

Uh oh!

alyssawilk commented Sep 25, 2019

Uh oh!

alyssawilk Sep 25, 2019

Choose a reason for hiding this comment

Uh oh!

alyssawilk Sep 25, 2019

Choose a reason for hiding this comment

Uh oh!

alyssawilk commented Sep 26, 2019

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

alyssawilk Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

mattklein123 Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Sep 26, 2019

Uh oh!

azure-pipelines bot commented Sep 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alyssawilk commented Sep 24, 2019 •

edited

Loading