Skip to content

[EventEngine] Windows Endpoint: optimize reads by chaining synchronous WSARecv operations#32563

Merged
drfloob merged 7 commits intogrpc:masterfrom
drfloob:moar-bytes
Mar 14, 2023
Merged

[EventEngine] Windows Endpoint: optimize reads by chaining synchronous WSARecv operations#32563
drfloob merged 7 commits intogrpc:masterfrom
drfloob:moar-bytes

Conversation

@drfloob
Copy link
Copy Markdown
Member

@drfloob drfloob commented Mar 8, 2023

Built on #32560

When calling EventEngine::Read, if a synchronous WSARecv call completes successfully and 1) the read buffer is not full, and 2) the stream remains open, then the endpoint will now chain execution of more synchronous WSARecvs. The chain is broken and the on_read callback is called when either there are errors, the next call would block, the buffer is full, or the stream is closed.

Something like this is helpful to prevent excessive read callback execution under a flood of tiny payloads, presuming messages are not being combined as one would usually expect (see //test/core/iomgr:endpoint_pair_test, and Nagle's algorithm).

@drfloob drfloob changed the title [WIP][EventEngine] Windows Endpoint: optimize reads by chaining synchronous WSARecv operations [EventEngine] Windows Endpoint: optimize reads by chaining synchronous WSARecv operations Mar 8, 2023
@drfloob drfloob marked this pull request as ready for review March 8, 2023 23:28
@drfloob drfloob requested review from ctiller and yijiem March 8, 2023 23:28
SliceBuffer* buffer, const ReadArgs* /* args */) {
GRPC_EVENT_ENGINE_ENDPOINT_TRACE("WindowsEndpoint::%p reading", this);
if (io_state_->socket->IsShutdown()) {
executor_->Run([on_read = std::move(on_read)]() mutable {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could drop mutable ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the AnyInvocable requires it. The error with mutable removed:

src/core/lib/event_engine/windows/windows_endpoint.cc(132,7): error: no matching function for call to object of type 'const remove_reference_t<absl::AnyInvocable<void (absl::Status)> &>' (aka 'const absl::AnyInvocable<void (absl::Status)>')
      on_read(absl::UnavailableError("Socket is shutting down."));

Copy link
Copy Markdown
Member

@yijiem yijiem Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, seems like one can not call a non-const method in a const absl::AnyInvocable<>. I guess something like absl::AnyInvocable<void (absl::Status) const> would also work but that will change the EE interface and probably change the callback's meaning too.

});
return false;
}
buffer->Clear();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does this mean that the caller of the endpoint should provide an empty SliceBuffer, but in some case the buffer might not be empty, (maybe this simplifies the caller?) and thus the Clear here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Ideally the buffer should be empty. It would be great if any pre-existing slices in the buffer could be reused, but there's no such contract, and it may be the case that they are in use elsewhere and simply not cleaned out of the SliceBuffer.

That said, I would love to change the EventEngine::Read contract to say that any slices in the buffer are now donated to the cause, it could save some allocations. But we'd want to investigate all call sites to see if there is any value in doing so, and we may introduce bugs doing this if slices are left in unintentionally - I don't want to attempt it yet.

// Choose an appropriate size.
size_t min_read_size = kDefaultTargetReadSize;
if (buffer->Length() < min_read_size && buffer->Count() < kMaxWSABUFCount) {
buffer->AppendIndexed(Slice(allocator_.MakeSlice(min_read_size)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic: should this be allocator_.MakeSlice(min_read_size - buffer->Length()) ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it should, honestly. As it's currently written, the buffer size could end up being between min_read_size and (2 * min_read_size) - 1. If we only allocate up to min_read_size, we may end up doing more allocations over the life of the process because we've only allocated the minimum. It is a tradeoff between memory and cpu utilization, especially in applications with a lot of open endpoints. I'd wager that there aren't many low-memory Windows systems that gRPC is running on, so I'd opt for a larger memory footprint here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, interesting. Thanks for the explanation!

io_state_->handle_read_event.ExecuteCallbackAndReset(status);
});
}
return false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action require: seems like Read always return false and this change makes the API async in all cases and now it seems to match with the spec:

/// \a buffer. If the read succeeds immediately, it returns true and the \a
/// on_read callback is not executed. Otherwise it returns false and the \a
.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Other implementations needed this optimization, and the Windows EE can take advantage of it if we end up needing it. There aren't any Windows benchmarks that I'm aware of, so deciding when and what to optimize on the platform is a bit fuzzy. For this PR, the performance difference was stark and the improvement was necessary: one specific test timed out with the EE, whereas the iomgr version finished in < 2s.

Copy link
Copy Markdown
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great optimization!

// Choose an appropriate size.
size_t min_read_size = kDefaultTargetReadSize;
if (buffer->Length() < min_read_size && buffer->Count() < kMaxWSABUFCount) {
buffer->AppendIndexed(Slice(allocator_.MakeSlice(min_read_size)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, interesting. Thanks for the explanation!

@drfloob
Copy link
Copy Markdown
Member Author

drfloob commented Mar 14, 2023

Internal checks passed, merging.

@drfloob drfloob merged commit ae55fb0 into grpc:master Mar 14, 2023
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Mar 14, 2023
XuanWang-Amos pushed a commit to XuanWang-Amos/grpc that referenced this pull request May 1, 2023
…s WSARecv operations (grpc#32563)

Built on grpc#32560

When calling EventEngine::Read, if a synchronous WSARecv call completes
successfully and 1) the read buffer is not full, and 2) the stream
remains open, then the endpoint will now chain execution of more
synchronous WSARecvs. The chain is broken and the on_read callback is
called when either there are errors, the next call would block, the
buffer is full, or the stream is closed.

Something like this is helpful to prevent excessive read callback
execution under a flood of tiny payloads, presuming messages are not
being combined as one would usually expect (see
`//test/core/iomgr:endpoint_pair_test`, and Nagle's algorithm).
wanlin31 pushed a commit that referenced this pull request May 18, 2023
…s WSARecv operations (#32563)

Built on #32560

When calling EventEngine::Read, if a synchronous WSARecv call completes
successfully and 1) the read buffer is not full, and 2) the stream
remains open, then the endpoint will now chain execution of more
synchronous WSARecvs. The chain is broken and the on_read callback is
called when either there are errors, the next call would block, the
buffer is full, or the stream is closed.

Something like this is helpful to prevent excessive read callback
execution under a flood of tiny payloads, presuming messages are not
being combined as one would usually expect (see
`//test/core/iomgr:endpoint_pair_test`, and Nagle's algorithm).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bloat/none imported Specifies if the PR has been imported to the internal repository lang/core per-call-memory/neutral per-channel-memory/neutral

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants