rfc: optimize the draining process with connection_wait and relevant reporting systems#73586
Conversation
ebfffc5 to
0e980c9
Compare
otan
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 24 at r1 (raw file):
Specifically, to close a server, in PostgreSQL’s case, customers are given the option to use the [“Smart Shutdown” mode](https://www.postgresql.org/docs/current/server-shutdown.html),
if these are getting long and unweld,y you can do [text](link)
and at the bottom of the doc have something like
[link](https://www.postgresql.org/docs/current/server-shutdown.html)
see: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#links
docs/RFCS/20211207_graceful_draining.md, line 27 at r1 (raw file):
in which the server disallows new connections and waits until all sessions are closed by clients to shut down. While in CockroachDB’s case, the draining process consists of two consecutive periods, drain_wait and query_wait.
nit: for drain_wait and query wait, use
`drain_wait`
`query_wait`
docs/RFCS/20211207_graceful_draining.md, line 89 at r1 (raw file):
similar problems again, but since they are encountered by a small number of customers, we don’t want to change too much the default draining workflow, and bring the risk of a longer draining process. Just aiming at giving more
nit, last sentence reads better as: Instead, we're aiming at giving the user more configuration options.
docs/RFCS/20211207_graceful_draining.md, line 139 at r1 (raw file):
during which the server waits for clients to close connections. New connections are not allowed, and any connections without query in-flight will be killed by the connection pool once their lifespans are
nit: replace "killed" with "expire the MaxLifeTime and be closed". it tries to gracefully exit (and wait for the existing txn / query to end), not aggressively die.
docs/RFCS/20211207_graceful_draining.md, line 141 at r1 (raw file):
in-flight will be killed by the connection pool once their lifespans are longer than the pool’s `MaxLifeTime`. I.e. **clients are not encouraged to start new queries within this period, and the connection pools are expected
replace "clients are not encouraged to start new queries" with "no more client queries will establish to the existing node"
docs/RFCS/20211207_graceful_draining.md, line 148 at r1 (raw file):
not be recycled until the user puts it back in the pool. The timeout can be set to be infinitely long. Clients should be notified that the existing connections that are not executing queries will be closed.
wait, are we notifying the user somehow? are we doing something more than a log line?
docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):
remaining connections and close them by the connection_wait timeout.) For an infinite connection_wait, there are concerns that it may introduce
what value is the cluster setting set in this case?
i would rather we don't add the infinite connection time as an option unless we know people need it. at the same time, however, i think any limitation to the cluster setting time should be enforced by CC / some other mechanism than hardcoded in the database (i.e. a CC job that goes around all clusters / tenants and sets the wait to a lower value if we detect it is too high, send an email to the CC customer). OR we have a HIDDEN cluster setting which controls the maximum infinite connection time settable.
docs/RFCS/20211207_graceful_draining.md, line 237 at r1 (raw file):
- drain_wait: health?ready=1 starts to return error, and load balancer listening to this http for health check stops routing new connections to this draining node. But other new connections are still allowed.
nit: specify other connections are still allowed and existing connections still work.
docs/RFCS/20211207_graceful_draining.md, line 243 at r1 (raw file):
will be recycled. Connections not returned to the pool won’t be recycled. If all connections are closed by the user, it does an early exit before the timeout, and proceed to lease_transfer_wait.
nit: specify existing connections still work.
docs/RFCS/20211207_graceful_draining.md, line 246 at r1 (raw file):
- query_wait: Any connections without open queries in-flight will be closed. Any connection, once its query is finished, will be closed. If all
hmm, if the query is running infinitely long, do we have a "safeguard". i'd expect the query to be interrupted. but not important for this RFC.
docs/RFCS/20211207_graceful_draining.md, line 277 at r1 (raw file):
# Survey: configuration for C-Pool and Load Balancer We made [a survey](https://docs.google.com/forms/d/1_BWgza8n5iYBwVR7WSrKT4jL8RClE-PMoOmzlUWvnl8/edit?usp=sharing)
nit: make note this is an internal doc
docs/RFCS/20211207_graceful_draining.md, line 292 at r1 (raw file):
about how to customize the process based on their needs. If wrongly configured, the draining process can lead to existing issues, such as unexpected disconnection or uncommitted queries.
will CC clearly communicate the times that must be set to the user? i expect CC to have very hard limits.
0e980c9 to
e72a96a
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @otan, and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):
i would rather we don't add the infinite connection time as an option unless we know people need it
👍
docs/RFCS/20211207_graceful_draining.md, line 9 at r2 (raw file):
# Summary This RFC proposes to optimize the draining process to be more legible for
nit: before this merges, ensure that you strip all the trailing spaces on every line.
(NB: you should configure your text editor to do so automatically)
docs/RFCS/20211207_graceful_draining.md, line 63 at r2 (raw file):
from a different node. The current workaround that CockroachDB provides is to set the
This paragraph repeats the previous one, but is slightly different. Which one should we read?
docs/RFCS/20211207_graceful_draining.md, line 100 at r2 (raw file):
and a demo recording can be found [here][demo recording current draining]. As shown in the demo, here are some facts about the current draining process:
Please do not refer solely to empirical experiments to describe our drain process.
The current implementation works as designed, and our understanding can be derived directly from the source code.
(I appreciate that you as the RFC author became acquainted with this process via experiments, however the team of folk who will read this RFC will want to get more confidence than can be given by an experiment. That evidence can be provided by making the RFC point at specific steps in the CockroachDB source code.)
I think there is space for an experiment to demonstrate the problem to solve (i.e. why the current implementation, which is working as designed, is not suitable for real-world workloads), but then the RFC should introduce that experiment with that purpose.
docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):
declared as a `*settings.DurationSetting` as follows:
tip: use go after the triple backquotes, to get syntax highlighting.
Also, some indentation spaces would be welcome.
docs/RFCS/20211207_graceful_draining.md, line 176 at r2 (raw file):
function `func (s *Server) DrainConnections(ctx context.Context, connectionWait time.Duration`) at `/pkg/server/drain.go`.
Are you sure about that? Why can't all the logic be implemented in pgwire alongside the existing conn shutdown code?
docs/RFCS/20211207_graceful_draining.md, line 210 at r2 (raw file):
- **Notify customers of the connection status**: we propose to enhance three notification channels:
nit: wrong indentation, need two spaces before "we"
docs/RFCS/20211207_graceful_draining.md, line 213 at r2 (raw file):
- **CLI prompt**: if user start the draining the node using CLI (e.g. "ctrl-c" ), the interface will prompt out messages with current draining stage and
nit: parenthesis should be at end of previous line, without spaces.
docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):
- **CLI prompt**: if user start the draining the node using CLI (e.g. "ctrl-c" ), the interface will prompt out messages with current draining stage and expected timeout. E.g. `connection_wait starts,
Generally it's important for UX to not refer to internal configuration variables inside error messages (which the end-user has no control over); instead favor phrasing that explains to the user what is going on in prose.
docs/RFCS/20211207_graceful_draining.md, line 231 at r2 (raw file):
eventually leading the upgrade to fail (as we need some timeout); this will frustrate the customer and maybe, even more, it will frustrate CRL eng including SRE who own patching fleet (instead of the customer)._ ” With
nit: erroneous _ here
docs/RFCS/20211207_graceful_draining.md, line 293 at r2 (raw file):
# Edge Cases - If the user calls the `Undrain()` method during draining, the server
What is this mechanism?
Currently, there's no mechanism in CockroachDB to abort a graceful shutdown. Adding such a mechanism is not in-scope for this RFC and would also be largely more complex than the proposal made here.
docs/RFCS/20211207_graceful_draining.md, line 306 at r2 (raw file):
# Questions not fully answered - How does this work in a multi-tenant deployment?
Good question - we do not properly implement graceful drain for multi-tenant SQL servers yet.
This need to be filed as an issue and pushed to the server team.
See this discussion thread: https://cockroachlabs.slack.com/archives/C02HWA24541/p1639406405152900
docs/RFCS/20211207_graceful_draining.md, line 312 at r2 (raw file):
- If certain nodes in the cluster are not updated to support this feature, it may lead to confusion. We may need a version gate for `server.shutdown.connection_wait`.
I don't think so. Presumably, when upgrading from a version that doesn't have the cluster setting yet, only the nodes in the old version need to get restarted. Naturally, because they're running at the previous version still, they won't support the new mechanism. There's not much we can do about that.
I encourage you to add a section in the RFC that explains how an operator would go about introducing that feature in their deployment during (after) an upgrade.
docs/RFCS/20211207_graceful_draining.md, line 315 at r2 (raw file):
- Is the result/usage of this change different for CC end-users than for on-prem deployments? How?
In fact, the RFC will need to talk about how our multitenant SQL proxy code will use/handle the new connection mode.
docs/RFCS/20211207_graceful_draining.md, line 321 at r2 (raw file):
how they will troubleshoot things? - More informative log messages of the whole draining process should be added, especially the monitoring of alive connections.
Very good point! That's a good answer, and so I wouldn't call it "not fully answered". You can move it to the technical section above.
docs/RFCS/20211207_graceful_draining.md, line 324 at r2 (raw file):
- Does the change impact performance? How? -It will likely increase the draining process’s duration since we now
nit: missing space after -
docs/RFCS/20211207_graceful_draining.md, line 338 at r2 (raw file):
an error in the implementation? - It will not affect the node and cluster stability as long as multiple nodes are not draining at the same time.
Can you explain this more? Is this a new failure mode introduced by the change of the RFC, or is that a situation that exists no matter what? If the latter, it doesn't need to be mentioned in the RFC.
docs/RFCS/20211207_graceful_draining.md, line 367 at r2 (raw file):
- Is the change affecting asynchronous / background subsystems? Are there new APIs, or API changes (either internal or external)? - We may add a new endpoint to allow users to query their draining status.
Can you explain this more? why is this useful?
docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):
- Is the change visible to users of CockroachDB or operators who run CockroachDB clusters? - We may print the draining status to the cli os as well.
Can you explain this more? What is the "cli os"?
docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):
- Is the change visible to users of CockroachDB or operators who run CockroachDB clusters? - We may print the draining status to the cli os as well.
I also think you should mention that the operators need to be educated about the change.
For example, the RFC could mention which documentation needs to be upgraded to explain the new mechanisms.
Talk to @taroface and @a-entin about which documentation should be changed.
e72a96a to
03fe82b
Compare
03fe82b to
8be3031
Compare
ZhouXing19
left a comment
There was a problem hiding this comment.
Dismissed @otan from a discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @a-entin, @joshimhoff, @knz, @otan, @taroface, and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 141 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
replace "clients are not encouraged to start new queries" with "no more client queries will establish to the existing node"
Done.
docs/RFCS/20211207_graceful_draining.md, line 148 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
wait, are we notifying the user somehow? are we doing something more than a log line?
Ah changed it to Clients can notice from the log.
docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):
Previously, knz (kena) wrote…
i would rather we don't add the infinite connection time as an option unless we know people need it
👍
Hmm maybe ----infinite-connection-wait as a HIDDEN flag for the cockroach node drain CLI command? I.e. this flag won't be documented.
(The thing I need to verify is how the server triggers the draining process when it undergoes an upgrade. )
docs/RFCS/20211207_graceful_draining.md, line 246 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
hmm, if the query is running infinitely long, do we have a "safeguard". i'd expect the query to be interrupted. but not important for this RFC.
Ah if the server waits till query_wait times out, it will force proceeding to the next draining stage. So it won't wait infinitely for the query to finish. Added explanatory lines as well.
docs/RFCS/20211207_graceful_draining.md, line 292 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
will CC clearly communicate the times that must be set to the user? i expect CC to have very hard limits.
Could you explain more what "i expect CC to have very hard limits."? Would a tutorial about how to set server.shutdown.{drain,connection,query}_wait be sufficient?
docs/RFCS/20211207_graceful_draining.md, line 63 at r2 (raw file):
Previously, knz (kena) wrote…
This paragraph repeats the previous one, but is slightly different. Which one should we read?
Removed one of them, thanks!
docs/RFCS/20211207_graceful_draining.md, line 100 at r2 (raw file):
Previously, knz (kena) wrote…
Please do not refer solely to empirical experiments to describe our drain process.
The current implementation works as designed, and our understanding can be derived directly from the source code.(I appreciate that you as the RFC author became acquainted with this process via experiments, however the team of folk who will read this RFC will want to get more confidence than can be given by an experiment. That evidence can be provided by making the RFC point at specific steps in the CockroachDB source code.)
I think there is space for an experiment to demonstrate the problem to solve (i.e. why the current implementation, which is working as designed, is not suitable for real-world workloads), but then the RFC should introduce that experiment with that purpose.
Thanks! Moved the experiment to the end of this Backgroud section as an auxiliary part.
docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):
Previously, knz (kena) wrote…
tip: use
goafter the triple backquotes, to get syntax highlighting.Also, some indentation spaces would be welcome.
Done.
docs/RFCS/20211207_graceful_draining.md, line 176 at r2 (raw file):
Previously, knz (kena) wrote…
Are you sure about that? Why can't all the logic be implemented in
pgwirealongside the existing conn shutdown code?
Yep moved them to pgwire.
docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):
Previously, knz (kena) wrote…
Generally it's important for UX to not refer to internal configuration variables inside error messages (which the end-user has no control over); instead favor phrasing that explains to the user what is going on in prose.
Could you explain more what is "explains to the user what is going on in prose."?
Also I was thinking to let the CLI prompt show the value of server.shutdown.{drain,connection,query}_wait, and they are configurable by the user. So are they counted as "internal configuration variables inside error messages (which the end-user has no control over)"?
docs/RFCS/20211207_graceful_draining.md, line 293 at r2 (raw file):
Previously, knz (kena) wrote…
What is this mechanism?
Currently, there's no mechanism in CockroachDB to abort a graceful shutdown. Adding such a mechanism is not in-scope for this RFC and would also be largely more complex than the proposal made here.
I saw this Undrain() function here, but yeah I've removed this part.
docs/RFCS/20211207_graceful_draining.md, line 306 at r2 (raw file):
Previously, knz (kena) wrote…
Good question - we do not properly implement graceful drain for multi-tenant SQL servers yet.
This need to be filed as an issue and pushed to the server team.
See this discussion thread: https://cockroachlabs.slack.com/archives/C02HWA24541/p1639406405152900
Yep mentioning PR 74412 now. Thanks!
docs/RFCS/20211207_graceful_draining.md, line 312 at r2 (raw file):
Previously, knz (kena) wrote…
I don't think so. Presumably, when upgrading from a version that doesn't have the cluster setting yet, only the nodes in the old version need to get restarted. Naturally, because they're running at the previous version still, they won't support the new mechanism. There's not much we can do about that.
I encourage you to add a section in the RFC that explains how an operator would go about introducing that feature in their deployment during (after) an upgrade.
Not sure if I understand "how an operator would go about introducing that feature in their deployment during (after) an upgrade." correctly. Do you mean "how a user would set server.shutdown.{drain,connection,query}_wait after all nodes in a cluster are upgraded and support this new draining process"? If so, would the Doc changes section in this RFC be sufficient?
docs/RFCS/20211207_graceful_draining.md, line 321 at r2 (raw file):
Previously, knz (kena) wrote…
Very good point! That's a good answer, and so I wouldn't call it "not fully answered". You can move it to the technical section above.
Moved it to the technical design section.
docs/RFCS/20211207_graceful_draining.md, line 338 at r2 (raw file):
Previously, knz (kena) wrote…
Can you explain this more? Is this a new failure mode introduced by the change of the RFC, or is that a situation that exists no matter what? If the latter, it doesn't need to be mentioned in the RFC.
Ah I was referring to the RFC template when proposing these questions. But yeah this is a situation that exists no matter what. Removed it.
docs/RFCS/20211207_graceful_draining.md, line 367 at r2 (raw file):
Previously, knz (kena) wrote…
Can you explain this more? why is this useful?
I was thinking that "we may add a new endpoint to allow the user to query their current draining status so that they can take actions accordingly. e.g. The endpoint shows that query_wait is going to end in 3 minutes, the user may want to expedite or abort the ongoing queries to shorten the remaining draining duration." But since this info can be obtained from the log as well, an additional endpoint may not be necessary. Removed it for now.
docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):
Previously, knz (kena) wrote…
Can you explain this more? What is the "cli os"?
Renamed it to CLI prompt.
docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):
Previously, knz (kena) wrote…
I also think you should mention that the operators need to be educated about the change.
For example, the RFC could mention which documentation needs to be upgraded to explain the new mechanisms.Talk to @taroface and @a-entin about which documentation should be changed.
Yep added a section called Doc changes below, and yes I'll talk to them as well. Thanks!
knz
left a comment
There was a problem hiding this comment.
I like the general gist of the proposal, and I think the proposed delay semantics will help the customers who need it.
I still think the RFC is a bit confused about how this will be reported to system administrators though. Let's polish this a bit. Maybe the polishing needs to happen in the context of the code change PR (is there one already?) so we can discuss over code.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @knz, @otan, and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
Hmm maybe
----infinite-connection-waitas a HIDDEN flag for thecockroach node drainCLI command? I.e. this flag won't be documented.
(The thing I need to verify is how the server triggers the draining process when it undergoes an upgrade. )
You still have not motivated why offering the option for administrators to configure an infinite connection_wait is even desirable.
Why is that useful in the first place?
docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
Could you explain more what is "explains to the user what is going on in prose."?
Also I was thinking to let the CLI prompt show the value ofserver.shutdown.{drain,connection,query}_wait, and they are configurable by the user. So are they counted as "internal configuration variables inside error messages (which the end-user has no control over)"?
When I reviewed the RFC the first time around, it was unclear where the messages were being printed. I had understood that they would be reported for the users of the SQL shell. This would make no sense: these users don't have control over cluster settings.
But now you have changed the text of the RFC, so it's become clearer.
Now I have two comments:
-
I'd like to request that you familiarize yourself with the existing drain reporting machinery available inside CockroachDB. This is a semi-elaborate (albeit currently poorly documented) system that ensures that an ongoing server shutdown gets a progress report, with periodic messages, both in logs and the output of
node drain. -
in light of the above, I'd like to hear your thoughts about whether the new information you propose here can be integrated in that system. If you think that a new solution is necessary, I'd like to hear why, and then I'd also like to hear how to replace the old mechanism by the new one. We don't want to maintain two reporting systems side-by-side.
docs/RFCS/20211207_graceful_draining.md, line 25 at r3 (raw file):
draining process consists of two consecutive periods, `drain_wait` and `query_wait`. Their duration is configured via two
nit: add an empty line before this one.
docs/RFCS/20211207_graceful_draining.md, line 36 at r3 (raw file):
`query_wait`, all connections will be forced to shut down, regardless of active queries. As part of CockroachDB’s draining process, the clients are not notified of
nit: add an empty line before this one.
docs/RFCS/20211207_graceful_draining.md, line 46 at r3 (raw file):
and the workaround setting for `drain_wait`. ![Current Draining Process][current draining pic] The current workaround that CockroachDB provides is to set the
nit: add an empty line before this one.
docs/RFCS/20211207_graceful_draining.md, line 55 at r3 (raw file):
connection pool would then recycle the connection and open a new connection, from a different node. The downsides of this workaround are:
nit: add an empty line before this one.
docs/RFCS/20211207_graceful_draining.md, line 62 at r3 (raw file):
connections are closed before the (true) draining starts, it brings a long downtime for this node. To solve these issues, this design proposes to optimize the waiting stages
nit: add an empty line before this one.
docs/RFCS/20211207_graceful_draining.md, line 65 at r3 (raw file):
during the draining process, and implement an early exit when all connections are closed by the client. Note that our ultimate goal is to provide clear guidance to avoid having
nit: add an empty line before this one.
docs/RFCS/20211207_graceful_draining.md, line 70 at r3 (raw file):
and bring the risk of a longer draining process. Instead, we're aiming at giving the user more configuration options. # Background
nit: add an empty line before this one.
Have you looked at your rendered RFC? https://github.com/ZhouXing19/cockroach/blob/rfc-draining/docs/RFCS/20211207_graceful_draining.md
All the paragraphs are running into each other.
It makes it super hard to read.
Throughout the remainder of this doc: add empty lines to separate paragraphs, and to separate paragraphs from titles.
docs/RFCS/20211207_graceful_draining.md, line 85 at r3 (raw file):
but new connections are **always** allowed to be established during the whole span of `drain_wait`. This is because the load balancer, which is used in a multi-node cluster, needs to make new connections and ping this
"This is because..."
Actually, this is not the reason. If anything, the implication would denounce a design flaw.
This paragraph would do good to separate HTTP connections from SQL connections.
-
HTTP conns/requests remain possible throughout most of the drain process, regardless of the various timeouts discussed here. The continued availability of the HTTP server is indeed design so that load balancers (and monitoring systems) can continue to receive information about the running process.
-
SQL connections also remain possible during
drain_wait, but for a very different reason. At a first approximation, it appears to not be a good idea. As the remainder of the RFC explains, that's actually counter-productive: clients potentially get the wrong impression that the server is still running, even though it's really shutting down. So why do we continue doing this? Why don't we stop accepting SQL connections as soon as the drain_wait period starts?Here the reason is different: it's because load balancers do not react to a change in the health status immediately. It may take a LB a few seconds to "notice" that the server has started shutting down. This delay is because the LB is only requesting health every few seconds, network delay, etc. If we were to stop accepting SQL conns right away, we'd introduce a risk for LBs to encounter a "closed door" before they notice the server is shutting down.
However, nothing precludes us from stopping accepting SQL conns a few seconds after the start of
drain_wait, even ifdrain_waitis set to a long value. (Not saying it's important for us to change this now - but there's no reason to keep accepting SQL conns for that long, unlike what your current explanation suggests.)
Code quote:
whole span of `drain_wait`. This is because the load balancer, which is
used in a multi-node cluster, needs to make new connections and ping this
endpoint to check the node health. Once the load balancer receives that errordocs/RFCS/20211207_graceful_draining.md, line 95 at r3 (raw file):
2. New connections and queries are not allowed from the start of `query_wait`. 3. Neither the command line interface ui nor the log file informs the user which waiting period they are at in the draining process.
This is factually false. The drain process reports messages periodically in the log files, and in the output of the node drain command. This informs the user of the progression.
If we think this progress report is not detailed enough, we can improve it. But saying "there's no information" is actually incorrect.
docs/RFCS/20211207_graceful_draining.md, line 100 at r3 (raw file):
5. If all connections become query-free at a timestamp within `query_wait`, the server performs an early exit. 6. If a user acquire a connection before `query_wait` starts, but send the
nit: sends
docs/RFCS/20211207_graceful_draining.md, line 117 at r3 (raw file):
New connections are not allowed, and any connections that are not pulled by the user and expire the `MaxLifeTime` will be closed by the connection pool. I.e. **no more client queries will establish to this draining node within this
"no more client connections will be allowed to the draining node ..."
docs/RFCS/20211207_graceful_draining.md, line 190 at r3 (raw file):
queries from the beginning of the draining process. E.g. log current alive connections / queries every 3 seconds when the draining starts. - **`health?ready=1` endpoint**: print the current draining stage when
A HTTP service does not "print" anything.
Do you mean that you're thinking about a new field inside the response payload that indicates the state of the server? That may be a good idea, but I don't see it motivated anywhere. How would that information be used? Are LB systems even able to consume that data and do anything useful with it?
docs/RFCS/20211207_graceful_draining.md, line 191 at r3 (raw file):
connections / queries every 3 seconds when the draining starts. - **`health?ready=1` endpoint**: print the current draining stage when users are calling this endpoint. This helps the user to take instant
What's the "user" in this sentence?
(There's no human on the other side of a HTTP health probe. This is exclusively used by programs in non-interactive systems.)
docs/RFCS/20211207_graceful_draining.md, line 216 at r3 (raw file):
this draining node. But new connections from other channels are still allowed, and existing connections still work. - `connection_wait`: No new connections are allowed. Existing connections
"No new SQL connections..."
(http conns / RPC conns still allowed)
docs/RFCS/20211207_graceful_draining.md, line 266 at r3 (raw file):
# Questions not fully answered - How does this work in a multi-tenant deployment? [PR 74412][PR 74412] is to support the drain protocol in a multi-tenant server
This is not a PR, it's an issue.
docs/RFCS/20211207_graceful_draining.md, line 310 at r3 (raw file):
- Is the change visible to users of CockroachDB or operators who run CockroachDB clusters? - We may print the draining status to the CLI prompt as well.
Which CLI prompt?
docs/RFCS/20211207_graceful_draining.md, line 323 at r3 (raw file):
# Docs changes Since we expect the user to approriately set the length of these three waiting
nit: de-indent the two spaces for this section.
docs/RFCS/20211207_graceful_draining.md, line 325 at r3 (raw file):
Since we expect the user to approriately set the length of these three waiting period during the draining process, sufficient tutorial is needed in the docments.
nit: documentation
docs/RFCS/20211207_graceful_draining.md, line 346 at r3 (raw file):
- We now set the server attribute `s.mu.draining` to disallow all new connections. Should we instead send a more connection-pool-dedicated message to stop new connections from a specific pool?
Can you explain this question more?
8be3031 to
52bca9f
Compare
ZhouXing19
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez, @joshimhoff, @knz, and @otan)
docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):
Previously, knz (kena) wrote…
You still have not motivated why offering the option for administrators to configure an infinite connection_wait is even desirable.
Why is that useful in the first place?
Yep let's pause applying an infinite wait for now.
docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):
Previously, knz (kena) wrote…
When I reviewed the RFC the first time around, it was unclear where the messages were being printed. I had understood that they would be reported for the users of the SQL shell. This would make no sense: these users don't have control over cluster settings.
But now you have changed the text of the RFC, so it's become clearer.
Now I have two comments:
I'd like to request that you familiarize yourself with the existing drain reporting machinery available inside CockroachDB. This is a semi-elaborate (albeit currently poorly documented) system that ensures that an ongoing server shutdown gets a progress report, with periodic messages, both in logs and the output of
node drain.in light of the above, I'd like to hear your thoughts about whether the new information you propose here can be integrated in that system. If you think that a new solution is necessary, I'd like to hear why, and then I'd also like to hear how to replace the old mechanism by the new one. We don't want to maintain two reporting systems side-by-side.
Thanks! Modified this part.
docs/RFCS/20211207_graceful_draining.md, line 315 at r2 (raw file):
Previously, knz (kena) wrote…
In fact, the RFC will need to talk about how our multitenant SQL proxy code will use/handle the new connection mode.
Mentioning SQL proxy now.
docs/RFCS/20211207_graceful_draining.md, line 85 at r3 (raw file):
Previously, knz (kena) wrote…
"This is because..."
Actually, this is not the reason. If anything, the implication would denounce a design flaw.
This paragraph would do good to separate HTTP connections from SQL connections.
HTTP conns/requests remain possible throughout most of the drain process, regardless of the various timeouts discussed here. The continued availability of the HTTP server is indeed design so that load balancers (and monitoring systems) can continue to receive information about the running process.
SQL connections also remain possible during
drain_wait, but for a very different reason. At a first approximation, it appears to not be a good idea. As the remainder of the RFC explains, that's actually counter-productive: clients potentially get the wrong impression that the server is still running, even though it's really shutting down. So why do we continue doing this? Why don't we stop accepting SQL connections as soon as the drain_wait period starts?Here the reason is different: it's because load balancers do not react to a change in the health status immediately. It may take a LB a few seconds to "notice" that the server has started shutting down. This delay is because the LB is only requesting health every few seconds, network delay, etc. If we were to stop accepting SQL conns right away, we'd introduce a risk for LBs to encounter a "closed door" before they notice the server is shutting down.
However, nothing precludes us from stopping accepting SQL conns a few seconds after the start of
drain_wait, even ifdrain_waitis set to a long value. (Not saying it's important for us to change this now - but there's no reason to keep accepting SQL conns for that long, unlike what your current explanation suggests.)
Thanks for the explanation! Done rephrasing this part.
docs/RFCS/20211207_graceful_draining.md, line 95 at r3 (raw file):
Previously, knz (kena) wrote…
This is factually false. The drain process reports messages periodically in the log files, and in the output of the
node draincommand. This informs the user of the progression.If we think this progress report is not detailed enough, we can improve it. But saying "there's no information" is actually incorrect.
Done.
docs/RFCS/20211207_graceful_draining.md, line 117 at r3 (raw file):
Previously, knz (kena) wrote…
"no more client connections will be allowed to the draining node ..."
Done.
docs/RFCS/20211207_graceful_draining.md, line 190 at r3 (raw file):
Previously, knz (kena) wrote…
A HTTP service does not "print" anything.
Do you mean that you're thinking about a new field inside the response payload that indicates the state of the server? That may be a good idea, but I don't see it motivated anywhere. How would that information be used? Are LB systems even able to consume that data and do anything useful with it?
Removed this section for the health?ready=1 endpoint
docs/RFCS/20211207_graceful_draining.md, line 191 at r3 (raw file):
Previously, knz (kena) wrote…
What's the "user" in this sentence?
(There's no human on the other side of a HTTP health probe. This is exclusively used by programs in non-interactive systems.)
Removed this section.
docs/RFCS/20211207_graceful_draining.md, line 216 at r3 (raw file):
Previously, knz (kena) wrote…
"No new SQL connections..."
(http conns / RPC conns still allowed)
Done.
docs/RFCS/20211207_graceful_draining.md, line 266 at r3 (raw file):
Previously, knz (kena) wrote…
This is not a PR, it's an issue.
Done.
docs/RFCS/20211207_graceful_draining.md, line 310 at r3 (raw file):
Previously, knz (kena) wrote…
Which CLI prompt?
Deleted this sentence.
docs/RFCS/20211207_graceful_draining.md, line 346 at r3 (raw file):
Previously, knz (kena) wrote…
Can you explain this question more?
Deleted this.
knz
left a comment
There was a problem hiding this comment.
Reviewed 3 of 4 files at r1, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @cameronnunez, @joshimhoff, @otan, and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
Done.
Without a space between the backquote and "go".
docs/RFCS/20211207_graceful_draining.md, line 312 at r2 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
Not sure if I understand "how an operator would go about introducing that feature in their deployment during (after) an upgrade." correctly. Do you mean "how a user would set
server.shutdown.{drain,connection,query}_waitafter all nodes in a cluster are upgraded and support this new draining process"? If so, would theDoc changessection in this RFC be sufficient?
yes
docs/RFCS/20211207_graceful_draining.md, line 166 at r4 (raw file):
continue with the rest of the shutdown process before the end of connection_wait.”, 600*time.Second,
NB: It's customary to add a comment inside the source code that explains how the default value was chosen.
docs/RFCS/20211207_graceful_draining.md, line 225 at r4 (raw file):
eventually leading the upgrade to fail (as we need some timeout); this will frustrate the customer and maybe, even more, it will frustrate CRL eng including SRE who own patching fleet (instead of the customer).” With
nit: stray "
52bca9f to
2d39498
Compare
|
@knz thanks for the approval! I changed the default duration of cc @rafiss |
2d39498 to
9f5d36a
Compare
|
Note that here's a conversation on a buried commit but it's not shown on the timeline: 8be3031#commitcomment-63846504 |
Please copy-paste the current state of the conversation, and give the participatns a prompt. We can't continue the convo on a buried/resolved link. |
|
Forwarding conversation from here to this timeline: From @a-entin:
From @rafiss:
cc @knz |
This is a proposal for optimizing the draining process to be more legible for customers, and introduce a new step, connection_wait, to the draining process, which allows the server to early exit when all connections are closed. Release note: None
cameronnunez
left a comment
There was a problem hiding this comment.
I really like the idea of this proposal, good work!
However, I think this needs to rephrase the drain process.
I do not think the cluster setting names (drain_wait, query_wait, lease_transfer_wait) should be used to refer to the draining steps. These should be disentangled. This is not too much of an issue for the first two steps, the initial wait and the proposed wait for SQL connection closure, which are correctly characterized as waits, but I think the last two steps of the draining process – draining SQL connections and range leases – both comprise the heart of the process and are not best characterized as waiting periods.
As pointed out by @a-entin, these cluster settings work in different ways. For example, the lease and leadership transferral step is executed with an infinite loop, with the timeout of each iteration in that loop being lease_transfer_wait. It is because of this that this step should not be described as lease_transfer_wait, since this name refers to a setting that simply controls the maximum duration of an iteration (thereby only really affecting the frequency of progress messages). Similarly, I think the third step of the final design state of the drain process should not be referred to as the query_wait step but something more along the lines of the “server-side SQL connection closure” stage; I think this more aptly captures the key part of this step. Connections are actively closed as their queries finish up, and after the duration of query_wait (assuming there are remaining connections), all the connections are closed by the server.
Our current documentation is misleading and needs to be updated to reflect this information. Perhaps we should consider renaming these cluster settings (lease_transfer_timeout for instance) to prevent implications of similarity in purpose.
I think for all references to the draining steps in the RFC it would be best to either assign names to the steps or to simply use step numbers. I think this will avoid a lot of confusion. Personally, I think of the process as (1) the initial wait, (2) the proposed wait for client-side SQL connection closure, (3) the server-side SQL connection closure period, and finally (4) lease and leadership transferral.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @joshimhoff, @knz, @otan, and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 53 at r6 (raw file):
![Current Draining Process][current draining pic] The current workaround that CockroachDB provides is to set the `server.shutdown.drain_wait` longer than the sum of the maximum lifetime of a
nit: "set the cluster setting server.shutdown.drain_wait longer..."
docs/RFCS/20211207_graceful_draining.md, line 64 at r6 (raw file):
The downsides of this workaround are: 1. The `server.shutdown.drain_wait` has to be adjusted based on clients’
nit: "The setting server.shutdown.drain_wait has to be adjusted..."
docs/RFCS/20211207_graceful_draining.md, line 172 at r6 (raw file):
The default duration length of
connection_waitis set to be 0 second.
nit: "0 seconds"
docs/RFCS/20211207_graceful_draining.md, line 192 at r6 (raw file):
at `func (s *Server) ServeConn` in `pkg/sql/pgwire/server.go`. - **Early exit if all SQL connections are closed**: we let the server keeps listening to the registered SQL connections (`s.mu.connCancelMap`).
nit: "keep"
docs/RFCS/20211207_graceful_draining.md, line 216 at r6 (raw file):
currently active SQL connections and queries. This is for the case where the operator doesn't have access to the logs, but they want to query the draing progress through network. Note that both the HTTP API and RPC API
nit: "draining"
docs/RFCS/20211207_graceful_draining.md, line 249 at r6 (raw file):
`lease_transfer_wait` phase. In this case, all on-going queries will be left uncommitted. - `lease_transfer_wait`: the server waits to transfer range leases.
Needs a more accurate and lengthy description (and more a reflective step name). This step actively transfers all range leases and raft leaderships to other nodes. It completes only when all transfers are done.
docs/RFCS/20211207_graceful_draining.md, line 360 at r6 (raw file):
Specifically, we should add the following in [Cluster Settings Page][Cluster Setting Page]: - The CockroachDB server's draining procedures. I.e. the behavior of the
nit: "procedures i.e."
docs/RFCS/20211207_graceful_draining.md, line 363 at r6 (raw file):
system during each waiting period. We can refer to the `final state of this design` in this RFC. - How to tune the length of each waiting period when encounter errors in
Something along the lines of "How to correctly tune the cluster settings" would work better here. The end part is unnecessary since these guidelines are important to avoid errors to begin with.
I think this should spell out a bit more how these cluster settings differ as well (e.g. connection_wait is a timeout whereas drain_wait is a minimum duration, clarification that drain_wait is not synonymous with --drain_wait flag, etc.)
docs/RFCS/20211207_graceful_draining.md, line 365 at r6 (raw file):
- How to tune the length of each waiting period when encounter errors in draining. - `server.shutdown.drain_wait` should be longer than the load balancer's
"should be customized in coordination with the load balancer's settings" since there are more factors than just the health check interval, such as the number of tolerable unsuccessful health checks.
docs/RFCS/20211207_graceful_draining.md, line 369 at r6 (raw file):
This should be a bit more specific. Take a look at the following from the node shutdown doc written by @a-entin:
This setting should provide a sufficient time for all transactions in progress to complete. Therefore the server.shutdown.query_wait should be greater than any of the following:
- The longest possible transaction in the workload expected to complete successfully under normal operating conditions.
- The sql.defaults.idle_in_transaction_session_timeout cluster setting.
- The sql.defaults.statement_timeout cluster setting.
docs/RFCS/20211207_graceful_draining.md, line 372 at r6 (raw file):
longer `query_wait` can give guarantee that on-going queries will be committed before the server fully shutdown, but may lead to a longer draining process. We should also note that `query_wait` let the server
nit: "lets"
|
@cameronnunez Thanks for the comment!
I agree that So maybe we can keep the names of the first 3 stages as
It seems that // We've seen all the replicas once. Now we're going to iterate
// until they're all gone, up to the configured timeout.
transferTimeout := leaseTransferWait.Get(&s.cfg.Settings.SV)
drainLeasesOp := "transfer range leases"
if err := contextutil.RunWithTimeout(ctx, drainLeasesOp, transferTimeout,
func(ctx context.Context) error {
opts := retry.Options{
InitialBackoff: 10 * time.Millisecond,
MaxBackoff: time.Second,
Multiplier: 2,
}
everySecond := log.Every(time.Second)
var err error
// Avoid retry.ForDuration because of https://github.com/cockroachdb/cockroach/issues/25091.
for r := retry.StartWithCtx(ctx, opts); r.Next(); {
err = nil
if numRemaining := transferAllAway(ctx); numRemaining > 0 {
// Report progress to the Drain RPC.
if reporter != nil {
reporter(numRemaining, "range lease iterations")
}
err = errors.Errorf("waiting for %d replicas to transfer their lease away", numRemaining)
if everySecond.ShouldLog() {
log.Infof(ctx, "%v", err)
}
}
if err == nil {
// All leases transferred. We can stop retrying.
break
}
}and here the But yes I agree that
Yes step number is a good idea. I'll go with |
You have to go higher in scope to find the infinite loop. The draining sequence does not begin here. A call to In
|
rafiss
left a comment
There was a problem hiding this comment.
this looks great! my comments are all just suggestions
| similar functionality as | ||
| [Postgres’s smart shutdown service][postgres_shutdown]. | ||
|
|
||
| Note that in the current design, we do support an unlimited `connection_wait` |
There was a problem hiding this comment.
what did you end up deciding about enforcing a max allowed value for the new cluster setting? (line 224 mentioned that infinite won't be allowed)
| like to reserve access to the node when the draining process started. | ||
| The possible solution is to add a gate in the connection register that | ||
| looks at a connection’s metadata (e.g. if it’s from admin or not) during | ||
| `drain_wait`. |
There was a problem hiding this comment.
this could even change to allow admins to login during connection_wait also, but that is also just a possible future extension
| declared as a `*settings.DurationSetting` as follows: | ||
|
|
||
| ```go | ||
| connectionWait = settings.RegisterDurationSetting( |
There was a problem hiding this comment.
cluster settings now need to marked as allowed or disallowed for multitenant. i think this one should be "TenantReadOnly" (but not tenant writable)
cf7973c to
c87eaa4
Compare
ZhouXing19
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @a-entin, @cameronnunez, @joshimhoff, @knz, @otan, @rafiss, and @ZhouXing19)
docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):
Previously, knz (kena) wrote…
Without a space between the backquote and "go".
Done.
docs/RFCS/20211207_graceful_draining.md, line 160 at r6 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
cluster settings now need to marked as allowed or disallowed for multitenant. i think this one should be "TenantReadOnly" (but not tenant writable)
Done.
docs/RFCS/20211207_graceful_draining.md, line 249 at r6 (raw file):
Previously, cameronnunez (Cameron Nunez) wrote…
Needs a more accurate and lengthy description (and more a reflective step name). This step actively transfers all range leases and raft leaderships to other nodes. It completes only when all transfers are done.
Thanks! Added more lines and opened an issue for the renaming.
docs/RFCS/20211207_graceful_draining.md, line 276 at r6 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
this could even change to allow admins to login during connection_wait also, but that is also just a possible future extension
Done adding it.
docs/RFCS/20211207_graceful_draining.md, line 363 at r6 (raw file):
Previously, cameronnunez (Cameron Nunez) wrote…
Something along the lines of "How to correctly tune the cluster settings" would work better here. The end part is unnecessary since these guidelines are important to avoid errors to begin with.
I think this should spell out a bit more how these cluster settings differ as well (e.g.
connection_waitis a timeout whereasdrain_waitis a minimum duration, clarification thatdrain_waitis not synonymous with--drain_waitflag, etc.)
Done.
docs/RFCS/20211207_graceful_draining.md, line 365 at r6 (raw file):
Previously, cameronnunez (Cameron Nunez) wrote…
"should be customized in coordination with the load balancer's settings" since there are more factors than just the health check interval, such as the number of tolerable unsuccessful health checks.
Done.
docs/RFCS/20211207_graceful_draining.md, line 369 at r6 (raw file):
Previously, cameronnunez (Cameron Nunez) wrote…
This should be a bit more specific. Take a look at the following from the node shutdown doc written by @a-entin:
This setting should provide a sufficient time for all transactions in progress to complete. Therefore the server.shutdown.query_wait should be greater than any of the following:
- The longest possible transaction in the workload expected to complete successfully under normal operating conditions.
- The sql.defaults.idle_in_transaction_session_timeout cluster setting.
- The sql.defaults.statement_timeout cluster setting.
Done. Thanks!
docs/RFCS/20211207_graceful_draining.md, line 385 at r6 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
what did you end up deciding about enforcing a max allowed value for the new cluster setting? (line 224 mentioned that infinite won't be allowed)
Revised this part.
c87eaa4 to
1683650
Compare
This is a proposal for optimizing the draining process to be more legible for customers, and introduce a new step, connection_wait, to the draining process, which allows the server to early exit when all connections are closed. Release note: None
1683650 to
dc8a9e9
Compare
|
Thanks! |
73586: rfc: optimize the draining process with connection_wait and relevant reporting systems r=ZhouXing19 a=ZhouXing19 This is a proposal for optimizing the draining process to be more legible for customers, and introduce a new step, connection_wait, to the draining process, which allows the server to early exit when all connections are closed. Release note: None 76410: kv: disallow GC requests that bump GC threshold and GC expired versions r=nvanbenschoten a=nvanbenschoten Related to #55293. This commit adds a safeguard to GC requests that prevents them from bumping the GC threshold at the same time that they GC individual MVCC versions. This was found to be unsafe in #55293 because performing both of these actions at the same time could lead to a race where a read request is allowed to evaluate without error while also failing to see MVCC versions that are concurrently GCed. This race is possible because foreground traffic consults the in-memory version of the GC threshold (`r.mu.state.GCThreshold`), which is updated after (in `handleGCThresholdResult`), not atomically with, the application of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This allows a read request to see the effect of a GC on MVCC state without seeing its effect on the in-memory GC threshold. The latches acquired by GC quests look like it will help with this race, but in practice they do not for two reasons: 1. the latches do not protect timestamps below the GC request's batch timestamp. This means that they only conflict with concurrent writes, but not all concurrent reads. 2. the read could be served off a follower, which could be applying the GC request's effect from the raft log. Latches held on the leaseholder would have no impact on a follower read. Thankfully, the GC queue has split these two steps for the past few releases, at least since 87e85eb, so we do not have a bug today. The commit also adds a test that reliably exercises the bug with a few well-placed calls to `time.Sleep`. The test contains a variant where the read is performed on the leaseholder and a variant where it is performed on a follower. Both fail by default. If we switch the GC request to acquire non-MVCC latches then the leaseholder variant passes, but the follower read variant still fails. 76417: ccl/sqlproxyccl: add connector component and support for session revival token r=JeffSwenson a=jaylim-crl Informs #76000. Previously, all the connection establishment logic is coupled with the handler function within proxy_handler.go. This makes connecting to a new SQL pod during connection migration difficult. This commit refactors all of those connection logic out of the proxy handler into a connector component, as described in the connection migration RFC. At the same time, we also add support for the session revival token within this connector component. Note that the overall behavior of the SQL proxy should be unchanged with this commit. Release note: None 76545: cmd/reduce: add -tlp option r=yuzefovich a=yuzefovich **cmd/reduce: remove stdin option and require -file argument** We tend to not use the option of passing input SQL via stdin, so this commit removes it. An additional argument in favor of doing that is that the follow-up commit will introduce another mode of behavior that requires `-file` argument to be specified, so it's just cleaner to always require it now. Release note: None **cmd/reduce: add -tlp option** This commit adds `-tlp` boolean flag that changes the behavior of `reduce`. It is required that `-file` is specified whenever the `-tlp` flag is used. The behavior is such that the last two queries (delimited by empty lines) in the file contain unpartitioned and partitioned queries that return different results although they are equivalent. If TLP check is requested, then we remove the last two queries from the input which we use then to construct a special TLP check query that results in an error if two removed queries return different results. We do not just include the TLP check query into the input string because the reducer would then reduce the check query itself, making the reduction meaningless. Release note: None 76598: server: use channel for DisableAutomaticVersionUpgrade r=RaduBerinde a=RaduBerinde DisableAutomaticVersionUpgrade is an atomic integer which is rechecked in a retry loop. This is not a very clean mechanism, and can lead to issues where you're unknowingly dealing with a copy of the knobs and setting the wrong atomic. The retry loop can also add unnecessary delays in tests. This commit changes DisableAutomaticVersionUpgrade from an atomic integer to a channel. If the channel is set, auto-upgrade waits until the channel is closed. Release note: None Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
|
Build failed (retrying...): |
a-entin
left a comment
There was a problem hiding this comment.
Reviewed 3 of 4 files at r1, 1 of 1 files at r9, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @a-entin, @cameronnunez, @joshimhoff, @knz, @otan, @rafiss, and @ZhouXing19)
|
Build succeeded: |

This is a proposal for optimizing the draining process to be more legible for
customers, and introduce a new step, connection_wait, to the draining process,
which allows the server to early exit when all connections are closed.
Release note: None