Skip to content

HTTP2 proactive GOAWAY on drain-start#16201

Closed
murray-stripe wants to merge 116 commits intoenvoyproxy:mainfrom
murray-stripe:murray/http2-drain-proactive-goaway
Closed

HTTP2 proactive GOAWAY on drain-start#16201
murray-stripe wants to merge 116 commits intoenvoyproxy:mainfrom
murray-stripe:murray/http2-drain-proactive-goaway

Conversation

@murray-stripe
Copy link
Copy Markdown
Contributor

@murray-stripe murray-stripe commented Apr 28, 2021

Signed-off-by: John Murray murray@stripe.com

Commit Message:
Change the default behavior of draining from a pull-based to a push-based model, proactively notifying interested parties when a drain sequence has begun. For low-volume listeners/connections, this allows for a graceful termination by proactively sending GOAWAY signals. This change resolves #14350.

This change relies the concept of "drain trees" (introduced in #17026), where the drain-manager exists as either a parent or child drain-manager. Allowing drain actions to cascade downward and also allowing for independent "sub-tree" draining (e.g. draining a specific Listener or specific FilterChain).

Additional Description:
Risk Level: moderate
Testing: Additional test coverage added for the HTTP ConnectionManagerImpl as well as the DrainManagerImpl. Many tests updated to reflect changes in behavior (push vs pull).
Docs Changes: Updated comments on DrainStrategy to convey some specifics of timing in how draining is initiated.
Release Notes:
Fixes #14350

<--
[Optional API Considerations:]
-->

Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe murray-stripe changed the title wip [wip] HTTP2 proactive GOAWAY on drain-start Apr 28, 2021
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements to the drain sequence. Here's some early feedback on your WIP changes.

Copy link
Copy Markdown
Contributor Author

@murray-stripe murray-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoniovicente Thank you for the review!! 😄

I've added replies to all of your comments to either add a follow-up action (I'll be working through these) or to answer a question or start a discussion.

Signed-off-by: John Murray <murray@stripe.com>
…truction

Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #16201 was synchronize by murray-stripe.

see: more, trace.

…proactive-goaway

Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

Signed-off-by: John Murray <murray@stripe.com>
@mattklein123
Copy link
Copy Markdown
Member

@murray-stripe sorry I've lost the thread on this. Are you ready to have this looked at again? Can you fix CI?

/wait

Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Copy Markdown
Contributor Author

murray-stripe commented Sep 28, 2021

/retest

@mattklein123 This change is ready for re-review. Just currently dealing with a test timeout that I can't repro locally and assuming it's a flake.

update - Seems it was a flake, feel free to proceed with a review.

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks generally LGTM. Flushing out some more comments.

This change is sufficiently scary that I would appreciate a smoke test. Can you please deploy this change at Stripe in some test capacity to make sure there are no obvious crashes/issues?

/wait

enum DrainStrategy {
// Gradually discourage connections over the course of the drain period.
// Gradually discourage connections over the beginning course of the drain period, discouraging
// all connections by the time 25% of the drain-time has expired.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this. Can you make this comment more robust to explain how you came up with 25% and/or link to the arch overview docs if it is explained there.

additional consideration for draining. "Draining" is the state in which components may discourage
new connections and signal on existing connections the intent to terminate. "Completed Draining"
is the final state of the drain-process in which components will refuse new connections and
terminate any remaining connections (read more at :ref:`Draining<arch_overview_draining>`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this just be part of the other draining docs? Seems closely related? Just a section there?

for "gRPC config stream closed" is now reduced to debug when the status is ``Ok`` or has been
retriable (``DeadlineExceeded``, ``ResourceExhausted``, or ``Unavailable``) for less than 30
seconds.
* grpc: gRPC async client can be cached and shared accross filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge issue, delete

seconds.
* grpc: gRPC async client can be cached and shared accross filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true.
* grpc: gRPC async client can be cached and shared across filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true.
* http: connection draining is now proactive and does not require traffic to trigger graceful draining. This
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref link to arch overview / enum.

// Register callback for drain-close events.
if (use_proactive_draining_) {
start_drain_cb_ = drain_close_.addOnDrainCloseCb([this](std::chrono::milliseconds drain_delay) {
// de-register callback since we only want this to fire once
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all comments start with capital, end with period, proper grammar, etc. please check the diff.

void ConnectionManagerImpl::createStartDrainTimer(std::chrono::milliseconds drain_delay) {
if (!codec_) {
stats_.named_.downstream_cx_drain_close_.inc();
doConnectionClose(Network::ConnectionCloseType::FlushWrite, absl::nullopt, "");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO handling this case isn't worth it. I would just let it get handled by the common code / spread out naturally.

start_drain_cb_.reset();

// create timer to _begin_ draining
stats_.named_.downstream_cx_drain_close_.inc();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance that this can get incremented twice? I see it's still incremented in the legacy path. If not, please add comments in both places explaining why.

// prevent any new streams.
connection_manager_.startDrainSequence();
connection_manager_.stats_.named_.downstream_cx_drain_close_.inc();
ENVOY_STREAM_LOG(debug, "drain closing connection", *this);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed but the stat wasn't? related to my comment above about possible double stat increments.

// Network::DrainDecision
// TODO(junr03): hook up draining to listener state management.
bool drainClose() const override { return false; }
Common::CallbackHandlePtr addOnDrainCloseCb(DrainCloseCb) const override { return nullptr; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more comments.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 29, 2021
@murray-stripe
Copy link
Copy Markdown
Contributor Author

Haven't forgotten about this PR, but will have to circle back around in Q1 due to internal reprioritization in order to provide a sufficient "bake" in Stripe's environment. I have done some initial testing, but would like to do some more sustained testing to provide better assurance of the safety of the change.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 1, 2021
@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 1, 2021

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 1, 2021
@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 8, 2021

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Dec 8, 2021
@bernardoVale
Copy link
Copy Markdown

@murray-stripe any plans to continue this work? I'd be happy to this this in my environment too, we're dealing with the same problem

alyssawilk added a commit that referenced this pull request Dec 9, 2024
…reamble (#17026)" (#37465)

This reverts commit 96dd735.


96dd735
was a precursor to
#16201
which never landed.

Unwinding the change
envoyproxy/envoy-mobile#176

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
phlax pushed a commit to phlax/envoy-openssl that referenced this pull request Mar 7, 2025
…reamble (#17026)" (#37465)

This reverts commit 96dd735.


envoyproxy/envoy@96dd735
was a precursor to
envoyproxy/envoy#16201
which never landed.

Unwinding the change
envoyproxy/envoy-mobile#176

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api deps Approval required for changes to Envoy's external dependencies stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP 2 connection draining is non-graceful for low-volume listeners

8 participants