Skip to content

http: fixing a resumption bug in pipelining#8352

Merged
mattklein123 merged 9 commits intoenvoyproxy:masterfrom
alyssawilk:pipeline
Sep 26, 2019
Merged

http: fixing a resumption bug in pipelining#8352
mattklein123 merged 9 commits intoenvoyproxy:masterfrom
alyssawilk:pipeline

Conversation

@alyssawilk
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk commented Sep 24, 2019

The fundamental problem is the way we handle pipelining nowadays is by doing a readDisable(true) when processing a given request and readDisable(false) when we are done. If we've already read the next request, we won't ever resume. Kicking off readDisable(false) creates a fake event, onReadReady does a read, but then if no further data is read that doesn't result in a call to onRead() so the buffered data is not passed up the stack to the codec.

Now tracking that "kick" and making sure we call onRead() when force-kicked even if no new data is ready from the socket.
Intentionally leaving setReadBufferReady as-is as it's supposed to kick off a socket read, and does not currently need to interact with buffered data.

Risk Level: High (changes to network::connection)
Testing: new unit tests, integration test
Docs Changes: n/a
Release Notes: not added

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@mattklein123
Copy link
Copy Markdown
Member

Yikes. @alyssawilk do we know when this regressed? I'm positive this worked at some point.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Copy Markdown
Contributor Author

I don't think this bug likely causes problems in the wild because almost every client that pipelines requests
sends a request
waits for a response
sends another request

This totally works for Envoy, because in that case we've enabled reads and we actually get a real kernel read rather than a synthetic "kick off an event" read.

Looking at the connection class, I'm not sure when it worked. When we did the refactor for half-close two years ago the behavior of "do not call onRead if there are no bytes read" already existed. We've got codec level tests "verifying" that the readEnable works, but because they used a mocked out connection class they didn't catch the bug. I picked this up turning up more internal integration tests which (unlike most browsers) ship both requests together for just this case. Integration tests FTW! :-)

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@mattklein123
Copy link
Copy Markdown
Member

Interesting, great find. I swear this used to work as I remember writing the code to handle this case, but I'm probably not remembering correctly!

@mattklein123
Copy link
Copy Markdown
Member

Now if we get a FIN we send the data an extra time before closing the connection, as shown by the unit tests.

Sorry could you expand on this? I'm having a hard time seeing the doubling. Wouldn't the first time generally drain? Or the point is that we didn't drain everything, and then on close we re-dispatch because there is remaining data?

We could also latch "this was a faked event" and send up buffered data any time we try a read when there's faked event?

This sounds like the right idea, agreed the other stuff is scary. I will look at this in more detail in the morning when I am fresher to see if I have any better ideas.

@alyssawilk
Copy link
Copy Markdown
Contributor Author

Now if we get a FIN we send the data an extra time before closing the connection, as shown by the unit tests.

Sorry could you expand on this? I'm having a hard time seeing the doubling. Wouldn't the first time generally drain? Or the point is that we didn't drain everything, and then on close we re-dispatch because there is remaining data?

Yeah, the second is exactly it. EmptyReadOnCloseTest verifies if we get a read event which is "close the connection" we don't do a spurious onData. I had to change that test to drain the buffer because with this current patch if we get the FIN we don't disambiguate between it being a real or fake kick, and so send the buffered data up.

We could also latch "this was a faked event" and send up buffered data any time we try a read when there's faked event?

This sounds like the right idea, agreed the other stuff is scary. I will look at this in more detail in the morning when I am fresher to see if I have any better ideas.

SGTM, I'll plan on tackling that after you take a look with the extra context. It felt like a bit of a hacky one off and I was hoping for a more elegant solution but I'm happy to have the extra state to avoid the "ondata on close" weirdness.

@mattklein123
Copy link
Copy Markdown
Member

In looking at it more this is the code I was thinking of:

if (codec_->protocol() != Protocol::Http2) {
if (read_callbacks_->connection().state() == Network::Connection::State::Open &&
data.length() > 0 && streams_.empty()) {
redispatch = true;

I don't recall the history now, but yeah clearly that won't work in the case in which we finish the stream not during dispatch. Whatever we end up doing here though, I feel like we should remove ^ because it would now be redundant assuming readDisable() correctly redispatches, right?

So I guess then my next question would be: can we just have readDisable() inline dispatch data instead of scheduling a fake event at all? I'm guessing the answer is no because we might get into a situation in which we recursively call onData() (the reason the code above was there in the first place), but I will throw that out for consideration, as I think it would simplify things.

Assuming we can't do direct dispatch, I think your idea to keep track of the fake event and handle it differently is the right one. I don't think it would be too hard to implement. As an aside, if we add another bool state field, I might consider making all of those bools into a bit field at this point. There would probably be 64+ bytes of savings per-connection which is probably worthwhile.

WDYT?

@alyssawilk
Copy link
Copy Markdown
Contributor Author

Yeah, I think we should be able to clean up the redispatch logic if this works cleanly (though I think I'd prefer to TODO it and land one dangerous PR at a time!)

I agree with avoiding a codec dispatch from onMessageComplete. It feels like asking for an opportunity to blow stack if a local reply can happen inline, and while if unwinding from one request is safe unwinding from multiple is probably safe it'd still make me twitchy.

I'll boolean and bitfield when I get a chance, unless someone has any better suggestions.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
// gets processed regardless.
// gets processed regardless and ensure that we push it up via onRead.
if (read_buffer_.length() > 0) {
force_on_read_ = true;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we call this on the connection in two places, here and setReadBufferReady()

Looking at call sites of setReadBufferReady it looks like we're doing that where we actually want to resume reading from the socket (and don't want to cause a spurious onRead if there's no further data). I think I'm inclined to just leave this as-is, but if we think future users of the network connection might not fully read data and use setReadBufferReady to trigger the dispatch (rather than readEnable/Disable as we do here), we may want to add a boolean to force on read there as well.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
}
}

// TODO(alyssawilk) clean this up after #8352 is well vetted.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added this but now I'm wondering if we can land the connection fix or we have to clean this up inline or something weird will happen? I'll take a look tomorrow.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Copy Markdown
Contributor Author

Ok, I think separate clean up is fine now
Assuming pipelined messages A,B, when we finish with message A we'll

  • do the "kick",
  • redispatch
    -- if we consume all the data, the kick will not have buffered data and will (now) be a no-op
    -- if we don't consume all the data, I believe we'll still be processing message B and will readDisable so I think either way we avoid extra dispatch

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks great! Small test question/comment.

/wait-any


// The HTTP/1 codec handles pipelined connections by relying on readDisable(false) resulting in the
// subsequent request being dispatched. Regression test this behavior.
TEST_P(ConnectionImplTest, ReadEnableDispatches) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have an explicit test for the disconnect case to make sure we don't dispatch on close?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have one!
If you check out a16cdee
I had to change TEST_P(ConnectionImplTest, EmptyReadOnCloseTest)
to drain the buffer (or it got an extra dispatch) and added
TEST_P(ConnectionImplTest, ReadEventIfBufferedDataOnClose) {
to show with that draft (which I didn't like) how we got the event when we had buffered data.

Now that we have a fix which tests for explicit kick, ConnectionImplTest.EmptyReadOnCloseTest shows that when we close when there's buffered data, we don't push it up.

The one exception is if we explicitly ask for a kick, and then do a read and get the FIN, we will push the data before we push the close. I think that's the correct behavior, as it means the behavior when we have buffered data and ask for dispatch, matches when there's kernel buffered data, which is to say we always pass up the data before closing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, awesome, thanks for clarifying.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@mattklein123
Copy link
Copy Markdown
Member

/azp run envoy-macos

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mattklein123 mattklein123 merged commit b0ab9fb into envoyproxy:master Sep 26, 2019
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk alyssawilk deleted the pipeline branch April 20, 2020 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants