Skip to content

rfc: optimize the draining process with connection_wait and relevant reporting systems#73586

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
ZhouXing19:rfc-draining
Feb 16, 2022
Merged

rfc: optimize the draining process with connection_wait and relevant reporting systems#73586
craig[bot] merged 1 commit intocockroachdb:masterfrom
ZhouXing19:rfc-draining

Conversation

@ZhouXing19
Copy link
Copy Markdown
Collaborator

This is a proposal for optimizing the draining process to be more legible for
customers, and introduce a new step, connection_wait, to the draining process,
which allows the server to early exit when all connections are closed.

Release note: None

@ZhouXing19 ZhouXing19 requested a review from a team as a code owner December 8, 2021 00:53
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ZhouXing19 ZhouXing19 requested review from a team and joshimhoff December 8, 2021 01:17
Copy link
Copy Markdown
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 24 at r1 (raw file):

Specifically, to close a server, in PostgreSQL’s case, customers are given 
the option to use the [“Smart Shutdown” mode](https://www.postgresql.org/docs/current/server-shutdown.html), 

if these are getting long and unweld,y you can do [text](link)

and at the bottom of the doc have something like

[link](https://www.postgresql.org/docs/current/server-shutdown.html)

see: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#links


docs/RFCS/20211207_graceful_draining.md, line 27 at r1 (raw file):

in which the server disallows new connections and waits until all sessions 
are closed by clients to shut down.  While in CockroachDB’s case, the 
draining process consists of two consecutive periods, drain_wait and query_wait. 

nit: for drain_wait and query wait, use

`drain_wait`
`query_wait`

docs/RFCS/20211207_graceful_draining.md, line 89 at r1 (raw file):

similar problems again, but since they are encountered by a small number of 
customers, we don’t want to change too much the default draining workflow, 
and bring the risk of a longer draining process. Just aiming at giving more 

nit, last sentence reads better as: Instead, we're aiming at giving the user more configuration options.


docs/RFCS/20211207_graceful_draining.md, line 139 at r1 (raw file):

during which the server waits for clients to close connections. 
New connections are not allowed, and any connections without query 
in-flight will be killed by the connection pool once their lifespans are 

nit: replace "killed" with "expire the MaxLifeTime and be closed". it tries to gracefully exit (and wait for the existing txn / query to end), not aggressively die.


docs/RFCS/20211207_graceful_draining.md, line 141 at r1 (raw file):

in-flight will be killed by the connection pool once their lifespans are 
longer than the pool’s `MaxLifeTime`. I.e. **clients are not encouraged to 
start new queries within this period, and the connection pools are expected 

replace "clients are not encouraged to start new queries" with "no more client queries will establish to the existing node"


docs/RFCS/20211207_graceful_draining.md, line 148 at r1 (raw file):

not be recycled until the user puts it back in the pool. The timeout can be 
set to be infinitely long. Clients should be notified that the existing 
connections that are not executing queries will be closed.

wait, are we notifying the user somehow? are we doing something more than a log line?


docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):

    remaining connections and close them by the connection_wait timeout.)

For an infinite connection_wait, there are concerns that it may introduce 

what value is the cluster setting set in this case?

i would rather we don't add the infinite connection time as an option unless we know people need it. at the same time, however, i think any limitation to the cluster setting time should be enforced by CC / some other mechanism than hardcoded in the database (i.e. a CC job that goes around all clusters / tenants and sets the wait to a lower value if we detect it is too high, send an email to the CC customer). OR we have a HIDDEN cluster setting which controls the maximum infinite connection time settable.


docs/RFCS/20211207_graceful_draining.md, line 237 at r1 (raw file):

- drain_wait: health?ready=1 starts to return error, and load balancer 
  listening to this http for health check stops routing new connections to 
  this draining node. But other new connections are still allowed.

nit: specify other connections are still allowed and existing connections still work.


docs/RFCS/20211207_graceful_draining.md, line 243 at r1 (raw file):

  will be recycled. Connections not returned to the pool won’t be recycled. 
  If all connections are closed by the user, it does an early exit before 
  the timeout, and proceed to lease_transfer_wait.

nit: specify existing connections still work.


docs/RFCS/20211207_graceful_draining.md, line 246 at r1 (raw file):

- query_wait: Any connections without open queries in-flight will be closed.
  Any connection, once its query is finished, will be closed. If all 

hmm, if the query is running infinitely long, do we have a "safeguard". i'd expect the query to be interrupted. but not important for this RFC.


docs/RFCS/20211207_graceful_draining.md, line 277 at r1 (raw file):

# Survey: configuration for C-Pool and Load Balancer
We made [a survey](https://docs.google.com/forms/d/1_BWgza8n5iYBwVR7WSrKT4jL8RClE-PMoOmzlUWvnl8/edit?usp=sharing) 

nit: make note this is an internal doc


docs/RFCS/20211207_graceful_draining.md, line 292 at r1 (raw file):

  about how to customize the process based on their needs. If wrongly 
  configured, the draining process can lead to existing issues, such as 
  unexpected disconnection or uncommitted queries.

will CC clearly communicate the times that must be set to the user? i expect CC to have very hard limits.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @otan, and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):

i would rather we don't add the infinite connection time as an option unless we know people need it

👍


docs/RFCS/20211207_graceful_draining.md, line 9 at r2 (raw file):

# Summary
This RFC proposes to optimize the draining process to be more legible for 

nit: before this merges, ensure that you strip all the trailing spaces on every line.

(NB: you should configure your text editor to do so automatically)


docs/RFCS/20211207_graceful_draining.md, line 63 at r2 (raw file):

from a different node. 

The current workaround that CockroachDB provides is to set the 

This paragraph repeats the previous one, but is slightly different. Which one should we read?


docs/RFCS/20211207_graceful_draining.md, line 100 at r2 (raw file):

and a demo recording can be found [here][demo recording current draining]. 

As shown in the demo, here are some facts about the current draining process:

Please do not refer solely to empirical experiments to describe our drain process.
The current implementation works as designed, and our understanding can be derived directly from the source code.

(I appreciate that you as the RFC author became acquainted with this process via experiments, however the team of folk who will read this RFC will want to get more confidence than can be given by an experiment. That evidence can be provided by making the RFC point at specific steps in the CockroachDB source code.)

I think there is space for an experiment to demonstrate the problem to solve (i.e. why the current implementation, which is working as designed, is not suitable for real-world workloads), but then the RFC should introduce that experiment with that purpose.


docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):

declared as a `*settings.DurationSetting` as follows:

tip: use go after the triple backquotes, to get syntax highlighting.

Also, some indentation spaces would be welcome.


docs/RFCS/20211207_graceful_draining.md, line 176 at r2 (raw file):

function 
`func (s *Server) DrainConnections(ctx context.Context, connectionWait time.Duration`) 
at `/pkg/server/drain.go`. 

Are you sure about that? Why can't all the logic be implemented in pgwire alongside the existing conn shutdown code?


docs/RFCS/20211207_graceful_draining.md, line 210 at r2 (raw file):

- **Notify customers of the connection status**:

we propose to enhance three notification channels:

nit: wrong indentation, need two spaces before "we"


docs/RFCS/20211207_graceful_draining.md, line 213 at r2 (raw file):

  
  - **CLI prompt**: if user start the draining the node using CLI (e.g. "ctrl-c"
    ), the interface will prompt out messages with current draining stage and 

nit: parenthesis should be at end of previous line, without spaces.


docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):

  - **CLI prompt**: if user start the draining the node using CLI (e.g. "ctrl-c"
    ), the interface will prompt out messages with current draining stage and 
    expected timeout. E.g. `connection_wait starts, 

Generally it's important for UX to not refer to internal configuration variables inside error messages (which the end-user has no control over); instead favor phrasing that explains to the user what is going on in prose.


docs/RFCS/20211207_graceful_draining.md, line 231 at r2 (raw file):

eventually leading the upgrade to fail (as we need some timeout); this will 
frustrate the customer and maybe, even more, it will frustrate CRL eng 
including SRE who own patching fleet (instead of the customer)._ ” With 

nit: erroneous _ here


docs/RFCS/20211207_graceful_draining.md, line 293 at r2 (raw file):

# Edge Cases
- If the user calls the `Undrain()` method during draining, the server 

What is this mechanism?
Currently, there's no mechanism in CockroachDB to abort a graceful shutdown. Adding such a mechanism is not in-scope for this RFC and would also be largely more complex than the proposal made here.


docs/RFCS/20211207_graceful_draining.md, line 306 at r2 (raw file):

# Questions not fully answered
- How does this work in a multi-tenant deployment?

Good question - we do not properly implement graceful drain for multi-tenant SQL servers yet.

This need to be filed as an issue and pushed to the server team.
See this discussion thread: https://cockroachlabs.slack.com/archives/C02HWA24541/p1639406405152900


docs/RFCS/20211207_graceful_draining.md, line 312 at r2 (raw file):

  - If certain nodes in the cluster are not updated to support this feature,
    it may lead to confusion. We may need a version gate for 
    `server.shutdown.connection_wait`.

I don't think so. Presumably, when upgrading from a version that doesn't have the cluster setting yet, only the nodes in the old version need to get restarted. Naturally, because they're running at the previous version still, they won't support the new mechanism. There's not much we can do about that.

I encourage you to add a section in the RFC that explains how an operator would go about introducing that feature in their deployment during (after) an upgrade.


docs/RFCS/20211207_graceful_draining.md, line 315 at r2 (raw file):

- Is the result/usage of this change different for CC end-users than for 
  on-prem deployments? How?

In fact, the RFC will need to talk about how our multitenant SQL proxy code will use/handle the new connection mode.


docs/RFCS/20211207_graceful_draining.md, line 321 at r2 (raw file):

  how they will troubleshoot things?
  - More informative log messages of the whole draining process should be 
    added, especially the monitoring of alive connections.

Very good point! That's a good answer, and so I wouldn't call it "not fully answered". You can move it to the technical section above.


docs/RFCS/20211207_graceful_draining.md, line 324 at r2 (raw file):

- Does the change impact performance? How?
  -It will likely increase the draining process’s duration since we now 

nit: missing space after -


docs/RFCS/20211207_graceful_draining.md, line 338 at r2 (raw file):

  an error in the implementation?
  - It will not affect the node and cluster stability as long as multiple 
    nodes are not draining at the same time.

Can you explain this more? Is this a new failure mode introduced by the change of the RFC, or is that a situation that exists no matter what? If the latter, it doesn't need to be mentioned in the RFC.


docs/RFCS/20211207_graceful_draining.md, line 367 at r2 (raw file):

- Is the change affecting asynchronous / background subsystems? Are there 
  new APIs, or API changes (either internal or external)?
  - We may add a new endpoint to allow users to query their draining status.

Can you explain this more? why is this useful?


docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):

- Is the change visible to users of CockroachDB or operators who run 
  CockroachDB clusters?
  - We may print the draining status to the cli os as well.

Can you explain this more? What is the "cli os"?


docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):

- Is the change visible to users of CockroachDB or operators who run 
  CockroachDB clusters?
  - We may print the draining status to the cli os as well.

I also think you should mention that the operators need to be educated about the change.
For example, the RFC could mention which documentation needs to be upgraded to explain the new mechanisms.

Talk to @taroface and @a-entin about which documentation should be changed.

Copy link
Copy Markdown
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @otan from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @a-entin, @joshimhoff, @knz, @otan, @taroface, and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 141 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

replace "clients are not encouraged to start new queries" with "no more client queries will establish to the existing node"

Done.


docs/RFCS/20211207_graceful_draining.md, line 148 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

wait, are we notifying the user somehow? are we doing something more than a log line?

Ah changed it to Clients can notice from the log.


docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):

Previously, knz (kena) wrote…

i would rather we don't add the infinite connection time as an option unless we know people need it

👍

Hmm maybe ----infinite-connection-wait as a HIDDEN flag for the cockroach node drain CLI command? I.e. this flag won't be documented.
(The thing I need to verify is how the server triggers the draining process when it undergoes an upgrade. )


docs/RFCS/20211207_graceful_draining.md, line 246 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

hmm, if the query is running infinitely long, do we have a "safeguard". i'd expect the query to be interrupted. but not important for this RFC.

Ah if the server waits till query_wait times out, it will force proceeding to the next draining stage. So it won't wait infinitely for the query to finish. Added explanatory lines as well.


docs/RFCS/20211207_graceful_draining.md, line 292 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

will CC clearly communicate the times that must be set to the user? i expect CC to have very hard limits.

Could you explain more what "i expect CC to have very hard limits."? Would a tutorial about how to set server.shutdown.{drain,connection,query}_wait be sufficient?


docs/RFCS/20211207_graceful_draining.md, line 63 at r2 (raw file):

Previously, knz (kena) wrote…

This paragraph repeats the previous one, but is slightly different. Which one should we read?

Removed one of them, thanks!


docs/RFCS/20211207_graceful_draining.md, line 100 at r2 (raw file):

Previously, knz (kena) wrote…

Please do not refer solely to empirical experiments to describe our drain process.
The current implementation works as designed, and our understanding can be derived directly from the source code.

(I appreciate that you as the RFC author became acquainted with this process via experiments, however the team of folk who will read this RFC will want to get more confidence than can be given by an experiment. That evidence can be provided by making the RFC point at specific steps in the CockroachDB source code.)

I think there is space for an experiment to demonstrate the problem to solve (i.e. why the current implementation, which is working as designed, is not suitable for real-world workloads), but then the RFC should introduce that experiment with that purpose.

Thanks! Moved the experiment to the end of this Backgroud section as an auxiliary part.


docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):

Previously, knz (kena) wrote…

tip: use go after the triple backquotes, to get syntax highlighting.

Also, some indentation spaces would be welcome.

Done.


docs/RFCS/20211207_graceful_draining.md, line 176 at r2 (raw file):

Previously, knz (kena) wrote…

Are you sure about that? Why can't all the logic be implemented in pgwire alongside the existing conn shutdown code?

Yep moved them to pgwire.


docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):

Previously, knz (kena) wrote…

Generally it's important for UX to not refer to internal configuration variables inside error messages (which the end-user has no control over); instead favor phrasing that explains to the user what is going on in prose.

Could you explain more what is "explains to the user what is going on in prose."?
Also I was thinking to let the CLI prompt show the value of server.shutdown.{drain,connection,query}_wait, and they are configurable by the user. So are they counted as "internal configuration variables inside error messages (which the end-user has no control over)"?


docs/RFCS/20211207_graceful_draining.md, line 293 at r2 (raw file):

Previously, knz (kena) wrote…

What is this mechanism?
Currently, there's no mechanism in CockroachDB to abort a graceful shutdown. Adding such a mechanism is not in-scope for this RFC and would also be largely more complex than the proposal made here.

I saw this Undrain() function here, but yeah I've removed this part.


docs/RFCS/20211207_graceful_draining.md, line 306 at r2 (raw file):

Previously, knz (kena) wrote…

Good question - we do not properly implement graceful drain for multi-tenant SQL servers yet.

This need to be filed as an issue and pushed to the server team.
See this discussion thread: https://cockroachlabs.slack.com/archives/C02HWA24541/p1639406405152900

Yep mentioning PR 74412 now. Thanks!


docs/RFCS/20211207_graceful_draining.md, line 312 at r2 (raw file):

Previously, knz (kena) wrote…

I don't think so. Presumably, when upgrading from a version that doesn't have the cluster setting yet, only the nodes in the old version need to get restarted. Naturally, because they're running at the previous version still, they won't support the new mechanism. There's not much we can do about that.

I encourage you to add a section in the RFC that explains how an operator would go about introducing that feature in their deployment during (after) an upgrade.

Not sure if I understand "how an operator would go about introducing that feature in their deployment during (after) an upgrade." correctly. Do you mean "how a user would set server.shutdown.{drain,connection,query}_wait after all nodes in a cluster are upgraded and support this new draining process"? If so, would the Doc changes section in this RFC be sufficient?


docs/RFCS/20211207_graceful_draining.md, line 321 at r2 (raw file):

Previously, knz (kena) wrote…

Very good point! That's a good answer, and so I wouldn't call it "not fully answered". You can move it to the technical section above.

Moved it to the technical design section.


docs/RFCS/20211207_graceful_draining.md, line 338 at r2 (raw file):

Previously, knz (kena) wrote…

Can you explain this more? Is this a new failure mode introduced by the change of the RFC, or is that a situation that exists no matter what? If the latter, it doesn't need to be mentioned in the RFC.

Ah I was referring to the RFC template when proposing these questions. But yeah this is a situation that exists no matter what. Removed it.


docs/RFCS/20211207_graceful_draining.md, line 367 at r2 (raw file):

Previously, knz (kena) wrote…

Can you explain this more? why is this useful?

I was thinking that "we may add a new endpoint to allow the user to query their current draining status so that they can take actions accordingly. e.g. The endpoint shows that query_wait is going to end in 3 minutes, the user may want to expedite or abort the ongoing queries to shorten the remaining draining duration." But since this info can be obtained from the log as well, an additional endpoint may not be necessary. Removed it for now.


docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):

Previously, knz (kena) wrote…

Can you explain this more? What is the "cli os"?

Renamed it to CLI prompt.


docs/RFCS/20211207_graceful_draining.md, line 372 at r2 (raw file):

Previously, knz (kena) wrote…

I also think you should mention that the operators need to be educated about the change.
For example, the RFC could mention which documentation needs to be upgraded to explain the new mechanisms.

Talk to @taroface and @a-entin about which documentation should be changed.

Yep added a section called Doc changes below, and yes I'll talk to them as well. Thanks!

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general gist of the proposal, and I think the proposed delay semantics will help the customers who need it.

I still think the RFC is a bit confused about how this will be reported to system administrators though. Let's polish this a bit. Maybe the polishing needs to happen in the context of the code change PR (is there one already?) so we can discuss over code.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @knz, @otan, and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Hmm maybe ----infinite-connection-wait as a HIDDEN flag for the cockroach node drain CLI command? I.e. this flag won't be documented.
(The thing I need to verify is how the server triggers the draining process when it undergoes an upgrade. )

You still have not motivated why offering the option for administrators to configure an infinite connection_wait is even desirable.
Why is that useful in the first place?


docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Could you explain more what is "explains to the user what is going on in prose."?
Also I was thinking to let the CLI prompt show the value of server.shutdown.{drain,connection,query}_wait, and they are configurable by the user. So are they counted as "internal configuration variables inside error messages (which the end-user has no control over)"?

When I reviewed the RFC the first time around, it was unclear where the messages were being printed. I had understood that they would be reported for the users of the SQL shell. This would make no sense: these users don't have control over cluster settings.

But now you have changed the text of the RFC, so it's become clearer.

Now I have two comments:

  1. I'd like to request that you familiarize yourself with the existing drain reporting machinery available inside CockroachDB. This is a semi-elaborate (albeit currently poorly documented) system that ensures that an ongoing server shutdown gets a progress report, with periodic messages, both in logs and the output of node drain.

  2. in light of the above, I'd like to hear your thoughts about whether the new information you propose here can be integrated in that system. If you think that a new solution is necessary, I'd like to hear why, and then I'd also like to hear how to replace the old mechanism by the new one. We don't want to maintain two reporting systems side-by-side.


docs/RFCS/20211207_graceful_draining.md, line 25 at r3 (raw file):

draining process consists of two consecutive periods, `drain_wait` and
`query_wait`.
Their duration is configured via two

nit: add an empty line before this one.


docs/RFCS/20211207_graceful_draining.md, line 36 at r3 (raw file):

`query_wait`, all connections will be forced to shut down, regardless of active
queries.
As part of CockroachDB’s draining process, the clients are not notified of

nit: add an empty line before this one.


docs/RFCS/20211207_graceful_draining.md, line 46 at r3 (raw file):

and the workaround setting for `drain_wait`.
![Current Draining Process][current draining pic]
The current workaround that CockroachDB provides is to set the

nit: add an empty line before this one.


docs/RFCS/20211207_graceful_draining.md, line 55 at r3 (raw file):

connection pool would then recycle the connection and open a new connection,
from a different node.
The downsides of this workaround are:

nit: add an empty line before this one.


docs/RFCS/20211207_graceful_draining.md, line 62 at r3 (raw file):

     connections are closed before the (true) draining starts, it brings a
     long downtime for this node.
To solve these issues, this design proposes to optimize the waiting stages

nit: add an empty line before this one.


docs/RFCS/20211207_graceful_draining.md, line 65 at r3 (raw file):

during the draining process, and implement an early exit when all
connections are closed by the client.
Note that our ultimate goal is to provide clear guidance to avoid having

nit: add an empty line before this one.


docs/RFCS/20211207_graceful_draining.md, line 70 at r3 (raw file):

and bring the risk of a longer draining process. Instead, we're aiming at giving
the user more configuration options.
# Background

nit: add an empty line before this one.

Have you looked at your rendered RFC? https://github.com/ZhouXing19/cockroach/blob/rfc-draining/docs/RFCS/20211207_graceful_draining.md
All the paragraphs are running into each other.
image.png

It makes it super hard to read.

Throughout the remainder of this doc: add empty lines to separate paragraphs, and to separate paragraphs from titles.


docs/RFCS/20211207_graceful_draining.md, line 85 at r3 (raw file):

  but new connections are **always** allowed to be established during the
  whole span of `drain_wait`. This is because the load balancer, which is
  used in a multi-node cluster, needs to make new connections and ping this

"This is because..."

Actually, this is not the reason. If anything, the implication would denounce a design flaw.

This paragraph would do good to separate HTTP connections from SQL connections.

  1. HTTP conns/requests remain possible throughout most of the drain process, regardless of the various timeouts discussed here. The continued availability of the HTTP server is indeed design so that load balancers (and monitoring systems) can continue to receive information about the running process.

  2. SQL connections also remain possible during drain_wait, but for a very different reason. At a first approximation, it appears to not be a good idea. As the remainder of the RFC explains, that's actually counter-productive: clients potentially get the wrong impression that the server is still running, even though it's really shutting down. So why do we continue doing this? Why don't we stop accepting SQL connections as soon as the drain_wait period starts?

    Here the reason is different: it's because load balancers do not react to a change in the health status immediately. It may take a LB a few seconds to "notice" that the server has started shutting down. This delay is because the LB is only requesting health every few seconds, network delay, etc. If we were to stop accepting SQL conns right away, we'd introduce a risk for LBs to encounter a "closed door" before they notice the server is shutting down.

    However, nothing precludes us from stopping accepting SQL conns a few seconds after the start of drain_wait, even if drain_wait is set to a long value. (Not saying it's important for us to change this now - but there's no reason to keep accepting SQL conns for that long, unlike what your current explanation suggests.)

Code quote:

  whole span of `drain_wait`. This is because the load balancer, which is
  used in a multi-node cluster, needs to make new connections and ping this
  endpoint to check the node health. Once the load balancer receives that error

docs/RFCS/20211207_graceful_draining.md, line 95 at r3 (raw file):

2. New connections and queries are not allowed from the start of `query_wait`.
3. Neither the command line interface ui nor the log file informs the user which
   waiting period they are at in the draining process.

This is factually false. The drain process reports messages periodically in the log files, and in the output of the node drain command. This informs the user of the progression.

If we think this progress report is not detailed enough, we can improve it. But saying "there's no information" is actually incorrect.


docs/RFCS/20211207_graceful_draining.md, line 100 at r3 (raw file):

5. If all connections become query-free at a timestamp within `query_wait`,
   the server performs an early exit.
6. If a user acquire a connection before `query_wait` starts, but send the

nit: sends


docs/RFCS/20211207_graceful_draining.md, line 117 at r3 (raw file):

New connections are not allowed, and any connections that are not pulled by the
user and expire the `MaxLifeTime` will be closed by the connection pool.
I.e. **no more client queries will establish to this draining node within this

"no more client connections will be allowed to the draining node ..."


docs/RFCS/20211207_graceful_draining.md, line 190 at r3 (raw file):

    queries from the beginning of the draining process. E.g. log current alive
    connections / queries every 3 seconds when the draining starts.
  - **`health?ready=1` endpoint**: print the current draining stage when

A HTTP service does not "print" anything.

Do you mean that you're thinking about a new field inside the response payload that indicates the state of the server? That may be a good idea, but I don't see it motivated anywhere. How would that information be used? Are LB systems even able to consume that data and do anything useful with it?


docs/RFCS/20211207_graceful_draining.md, line 191 at r3 (raw file):

    connections / queries every 3 seconds when the draining starts.
  - **`health?ready=1` endpoint**: print the current draining stage when
    users are calling this endpoint. This helps the user to take instant

What's the "user" in this sentence?

(There's no human on the other side of a HTTP health probe. This is exclusively used by programs in non-interactive systems.)


docs/RFCS/20211207_graceful_draining.md, line 216 at r3 (raw file):

  this draining node. But new connections from other channels are still allowed,
  and existing connections still work.
- `connection_wait`: No new connections are allowed. Existing connections

"No new SQL connections..."

(http conns / RPC conns still allowed)


docs/RFCS/20211207_graceful_draining.md, line 266 at r3 (raw file):

# Questions not fully answered
- How does this work in a multi-tenant deployment?
  [PR 74412][PR 74412] is to support the drain protocol in a multi-tenant server

This is not a PR, it's an issue.


docs/RFCS/20211207_graceful_draining.md, line 310 at r3 (raw file):

- Is the change visible to users of CockroachDB or operators who run
  CockroachDB clusters?
  - We may print the draining status to the CLI prompt as well.

Which CLI prompt?


docs/RFCS/20211207_graceful_draining.md, line 323 at r3 (raw file):

# Docs changes
  Since we expect the user to approriately set the length of these three waiting

nit: de-indent the two spaces for this section.


docs/RFCS/20211207_graceful_draining.md, line 325 at r3 (raw file):

  Since we expect the user to approriately set the length of these three waiting
  period during the draining process, sufficient tutorial is needed in the
  docments.

nit: documentation


docs/RFCS/20211207_graceful_draining.md, line 346 at r3 (raw file):

- We now set the server attribute `s.mu.draining` to disallow all new
  connections. Should we instead send a more connection-pool-dedicated
  message to stop new connections from a specific pool?

Can you explain this question more?

Copy link
Copy Markdown
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez, @joshimhoff, @knz, and @otan)


docs/RFCS/20211207_graceful_draining.md, line 221 at r1 (raw file):

Previously, knz (kena) wrote…

You still have not motivated why offering the option for administrators to configure an infinite connection_wait is even desirable.
Why is that useful in the first place?

Yep let's pause applying an infinite wait for now.


docs/RFCS/20211207_graceful_draining.md, line 214 at r2 (raw file):

Previously, knz (kena) wrote…

When I reviewed the RFC the first time around, it was unclear where the messages were being printed. I had understood that they would be reported for the users of the SQL shell. This would make no sense: these users don't have control over cluster settings.

But now you have changed the text of the RFC, so it's become clearer.

Now I have two comments:

  1. I'd like to request that you familiarize yourself with the existing drain reporting machinery available inside CockroachDB. This is a semi-elaborate (albeit currently poorly documented) system that ensures that an ongoing server shutdown gets a progress report, with periodic messages, both in logs and the output of node drain.

  2. in light of the above, I'd like to hear your thoughts about whether the new information you propose here can be integrated in that system. If you think that a new solution is necessary, I'd like to hear why, and then I'd also like to hear how to replace the old mechanism by the new one. We don't want to maintain two reporting systems side-by-side.

Thanks! Modified this part.


docs/RFCS/20211207_graceful_draining.md, line 315 at r2 (raw file):

Previously, knz (kena) wrote…

In fact, the RFC will need to talk about how our multitenant SQL proxy code will use/handle the new connection mode.

Mentioning SQL proxy now.


docs/RFCS/20211207_graceful_draining.md, line 85 at r3 (raw file):

Previously, knz (kena) wrote…

"This is because..."

Actually, this is not the reason. If anything, the implication would denounce a design flaw.

This paragraph would do good to separate HTTP connections from SQL connections.

  1. HTTP conns/requests remain possible throughout most of the drain process, regardless of the various timeouts discussed here. The continued availability of the HTTP server is indeed design so that load balancers (and monitoring systems) can continue to receive information about the running process.

  2. SQL connections also remain possible during drain_wait, but for a very different reason. At a first approximation, it appears to not be a good idea. As the remainder of the RFC explains, that's actually counter-productive: clients potentially get the wrong impression that the server is still running, even though it's really shutting down. So why do we continue doing this? Why don't we stop accepting SQL connections as soon as the drain_wait period starts?

    Here the reason is different: it's because load balancers do not react to a change in the health status immediately. It may take a LB a few seconds to "notice" that the server has started shutting down. This delay is because the LB is only requesting health every few seconds, network delay, etc. If we were to stop accepting SQL conns right away, we'd introduce a risk for LBs to encounter a "closed door" before they notice the server is shutting down.

    However, nothing precludes us from stopping accepting SQL conns a few seconds after the start of drain_wait, even if drain_wait is set to a long value. (Not saying it's important for us to change this now - but there's no reason to keep accepting SQL conns for that long, unlike what your current explanation suggests.)

Thanks for the explanation! Done rephrasing this part.


docs/RFCS/20211207_graceful_draining.md, line 95 at r3 (raw file):

Previously, knz (kena) wrote…

This is factually false. The drain process reports messages periodically in the log files, and in the output of the node drain command. This informs the user of the progression.

If we think this progress report is not detailed enough, we can improve it. But saying "there's no information" is actually incorrect.

Done.


docs/RFCS/20211207_graceful_draining.md, line 117 at r3 (raw file):

Previously, knz (kena) wrote…

"no more client connections will be allowed to the draining node ..."

Done.


docs/RFCS/20211207_graceful_draining.md, line 190 at r3 (raw file):

Previously, knz (kena) wrote…

A HTTP service does not "print" anything.

Do you mean that you're thinking about a new field inside the response payload that indicates the state of the server? That may be a good idea, but I don't see it motivated anywhere. How would that information be used? Are LB systems even able to consume that data and do anything useful with it?

Removed this section for the health?ready=1 endpoint


docs/RFCS/20211207_graceful_draining.md, line 191 at r3 (raw file):

Previously, knz (kena) wrote…

What's the "user" in this sentence?

(There's no human on the other side of a HTTP health probe. This is exclusively used by programs in non-interactive systems.)

Removed this section.


docs/RFCS/20211207_graceful_draining.md, line 216 at r3 (raw file):

Previously, knz (kena) wrote…

"No new SQL connections..."

(http conns / RPC conns still allowed)

Done.


docs/RFCS/20211207_graceful_draining.md, line 266 at r3 (raw file):

Previously, knz (kena) wrote…

This is not a PR, it's an issue.

Done.


docs/RFCS/20211207_graceful_draining.md, line 310 at r3 (raw file):

Previously, knz (kena) wrote…

Which CLI prompt?

Deleted this sentence.


docs/RFCS/20211207_graceful_draining.md, line 346 at r3 (raw file):

Previously, knz (kena) wrote…

Can you explain this question more?

Deleted this.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 4 files at r1, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cameronnunez, @joshimhoff, @otan, and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Done.

Without a space between the backquote and "go".


docs/RFCS/20211207_graceful_draining.md, line 312 at r2 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Not sure if I understand "how an operator would go about introducing that feature in their deployment during (after) an upgrade." correctly. Do you mean "how a user would set server.shutdown.{drain,connection,query}_wait after all nodes in a cluster are upgraded and support this new draining process"? If so, would the Doc changes section in this RFC be sufficient?

yes


docs/RFCS/20211207_graceful_draining.md, line 166 at r4 (raw file):

    continue with the rest of the shutdown process before the end of
    connection_wait.”,
    600*time.Second,

NB: It's customary to add a comment inside the source code that explains how the default value was chosen.


docs/RFCS/20211207_graceful_draining.md, line 225 at r4 (raw file):

eventually leading the upgrade to fail (as we need some timeout); this will
frustrate the customer and maybe, even more, it will frustrate CRL eng
including SRE who own patching fleet (instead of the customer).” With

nit: stray "

@ZhouXing19
Copy link
Copy Markdown
Collaborator Author

ZhouXing19 commented Jan 18, 2022

@knz thanks for the approval!

I changed the default duration of connection_wait back to 0 seconds. This is because this workaround is dedicated to a small number of users who encountered an error during draining, while adding a default waiting period that's longer than 0s will potentially make the draining process longer for all users.

cc @rafiss

@ZhouXing19
Copy link
Copy Markdown
Collaborator Author

Note that here's a conversation on a buried commit but it's not shown on the timeline: 8be3031#commitcomment-63846504

cc @rafiss @a-entin @knz

@knz
Copy link
Copy Markdown
Contributor

knz commented Jan 18, 2022

Note that here's a conversation on a buried commit

Please copy-paste the current state of the conversation, and give the participatns a prompt. We can't continue the convo on a buried/resolved link.

@ZhouXing19
Copy link
Copy Markdown
Collaborator Author

ZhouXing19 commented Jan 18, 2022

Forwarding conversation from here to this timeline:

From @a-entin:

Here is the recipe for uninterrupted routine maintenance we (cs) provide to customers: https://github.com/cockroachlabs/cockroachdb-runbook-template/blob/main/routine-maintenance/node-shutdown.md.

TLDR; The current product allows an app to implement uninterrupted node shutdown (drain) handling. There are 3 distinct periods (the 2 periods mentioned in the "motivation" are just a prelude to the actual drain). Adding yet another, 4th, period with its own config parameter may confuse users. And potentially add an extra step overwriting a seemingly disruptive connection_wait unlimited default (set it to 0 to disable). It appears that the new period/config setting is purposely implemented to support a specific workaround relying on a max connection life setting in a connection pool (most pools have it). This workaround would be undesirable in many environments I observed.

Comments

  1. Relying on max connection life is not palatable for most user environments. Max connection life should be set to something around 1h to avoid a harmful reconnect overhead. In any case, it should be much longer than a reasonable drain_wait. Setting max life to under a minute or a small # of minutes is not an efficient practice, users should not be encouraged to do that. in short, setting the drain-wait to something comparable to conn max life is not a solution for uninterrupted handling of a node shutdown. (The quoted write up suggests holistic best practices and the product currently provides the necessary means imho.)

  2. If I understood correctly, the point in "Rationale and Alternative" covers a typical situation - unlimited connection_timeout is not desirable, while the practical effect if set to a finite value could be achieved with the currently available means.

  3. There are usability issues with the current implementation. Correcting them would bring tangible benefits, reducing user errors and increasing out of the box use simplicity:
    3a) the cluster setting drain_wait and --drain_wait flag for cockroach node drain imply they are related, while in fact they are entirely different. And should be set to dramatically different values - the former to a few seconds, the latter to a few minutes. The former refers to the period 1 while the latter to the period 3. The former (cluster config) is not a "drain wait". Customers are routinely confused by the name similarity and make wrong settings.
    3b. The defaults for the 3 cluster settings (drain_wait, query_wait, lease_transfer_wait) can be 10s each, which should work for most requirements. Currently the defaults must be adjusted.
    3c. The semantic of each of the 3 settings (for the corresponding periods of the node drain) have different semantics.
    3d. server.shutdown.lease_transfer_wait is a particular misnomer. It's not a timeout, but merely controls progress messages in the log.

From @rafiss:

Adding yet another, 4th, period with its own config parameter may confuse users.

On balance, I still think most users will benefit. We've heard directly from some of our most prominent customers that this setting is exactly what they want. For one example, see this internal thread: https://cockroachlabs.slack.com/archives/CFR37CA4A/p1642187568061000?thread_ts=1642171305.041300&cid=CFR37CA4A @vy-ton may be able to help with more examples.

The quoted write up suggests holistic best practices and the product currently provides the necessary means imho.

I disagree. The write up you linked to says: "Transactions that were not admitted by the server or were interrupted because of the forced connection closure may fail with error 57P01 server is shutting down or error 08006 An I/O error occurred while sending to the backend. Other error codes may be encountered as well." These errors are the exact problem that several customers have told us they need fixed, and it is why we are doing the connection_wait project. For these customers, it is important to see exactly 0 errors during the shutdown process.

Max connection life should be set to something around 1h to avoid a harmful reconnect overhead. In any case, it should be much longer than a reasonable drain_wait. Setting max life to under a minute or a small # of minutes is not an efficient practice, users should not be encouraged to do that.

This project is presenting users with an opt-in setting. We should only encourage users to use the setting if they cannot tolerate errors during shutdown. As with most engineering problems there is a tradeoff:

  1. Don't use the new setting, and make the application able to handle connection errors during draining.
  2. If errors are unacceptable, then use the new connection_wait and lower the MaxLifetime in the application which leads to more frequent reconnection.
  3. If more frequent reconnection is unacceptable, then use a longer conn_wait which leads to a longer draining process.

Which one of those 3 choices the user chooses depends on which problems are more important to them. We don’t currently have the ability to create a silver bullet that solves all of the problems at the same time.

the cluster setting drain_wait and --drain_wait flag for cockroach node drain imply they are related, while in fact they are entirely different.

I completely agree that this naming is confusing. It is a pre-existing problem. I suggest to file an issue for the DB Server team and PM.

cc @knz

ZhouXing19 referenced this pull request Jan 18, 2022
This is a proposal for optimizing the draining process to be more legible for
customers, and introduce a new step, connection_wait, to the draining process,
which allows the server to early exit when all connections are closed.

Release note: None
@ZhouXing19 ZhouXing19 changed the title rfc: optimize the draining process with connection_wait and improved log rfc: optimize the draining process with connection_wait and relevant reporting systems Jan 18, 2022
Copy link
Copy Markdown
Contributor

@cameronnunez cameronnunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of this proposal, good work!

However, I think this needs to rephrase the drain process.

I do not think the cluster setting names (drain_wait, query_wait, lease_transfer_wait) should be used to refer to the draining steps. These should be disentangled. This is not too much of an issue for the first two steps, the initial wait and the proposed wait for SQL connection closure, which are correctly characterized as waits, but I think the last two steps of the draining process – draining SQL connections and range leases – both comprise the heart of the process and are not best characterized as waiting periods.

As pointed out by @a-entin, these cluster settings work in different ways. For example, the lease and leadership transferral step is executed with an infinite loop, with the timeout of each iteration in that loop being lease_transfer_wait. It is because of this that this step should not be described as lease_transfer_wait, since this name refers to a setting that simply controls the maximum duration of an iteration (thereby only really affecting the frequency of progress messages). Similarly, I think the third step of the final design state of the drain process should not be referred to as the query_wait step but something more along the lines of the “server-side SQL connection closure” stage; I think this more aptly captures the key part of this step. Connections are actively closed as their queries finish up, and after the duration of query_wait (assuming there are remaining connections), all the connections are closed by the server.

Our current documentation is misleading and needs to be updated to reflect this information. Perhaps we should consider renaming these cluster settings (lease_transfer_timeout for instance) to prevent implications of similarity in purpose.

I think for all references to the draining steps in the RFC it would be best to either assign names to the steps or to simply use step numbers. I think this will avoid a lot of confusion. Personally, I think of the process as (1) the initial wait, (2) the proposed wait for client-side SQL connection closure, (3) the server-side SQL connection closure period, and finally (4) lease and leadership transferral.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @joshimhoff, @knz, @otan, and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 53 at r6 (raw file):

![Current Draining Process][current draining pic]
The current workaround that CockroachDB provides is to set the
`server.shutdown.drain_wait` longer than the sum of the maximum lifetime of a

nit: "set the cluster setting server.shutdown.drain_wait longer..."


docs/RFCS/20211207_graceful_draining.md, line 64 at r6 (raw file):

The downsides of this workaround are:

  1. The `server.shutdown.drain_wait` has to be adjusted based on clients’

nit: "The setting server.shutdown.drain_wait has to be adjusted..."


docs/RFCS/20211207_graceful_draining.md, line 172 at r6 (raw file):

The default duration length of connection_wait is set to be 0 second.

nit: "0 seconds"


docs/RFCS/20211207_graceful_draining.md, line 192 at r6 (raw file):

  at `func (s *Server) ServeConn` in `pkg/sql/pgwire/server.go`.
- **Early exit if all SQL connections are closed**:
  we let the server keeps listening to the registered SQL connections (`s.mu.connCancelMap`).

nit: "keep"


docs/RFCS/20211207_graceful_draining.md, line 216 at r6 (raw file):

    currently active SQL connections and queries. This is for the case where
    the operator doesn't have access to the logs, but they want to query the
    draing progress through network. Note that both the HTTP API and RPC API

nit: "draining"


docs/RFCS/20211207_graceful_draining.md, line 249 at r6 (raw file):

  `lease_transfer_wait` phase.
  In this case, all on-going queries will be left uncommitted.
- `lease_transfer_wait`: the server waits to transfer range leases.

Needs a more accurate and lengthy description (and more a reflective step name). This step actively transfers all range leases and raft leaderships to other nodes. It completes only when all transfers are done.


docs/RFCS/20211207_graceful_draining.md, line 360 at r6 (raw file):

  Specifically, we should add the following in
  [Cluster Settings Page][Cluster Setting Page]:
  - The CockroachDB server's draining procedures. I.e. the behavior of the

nit: "procedures i.e."


docs/RFCS/20211207_graceful_draining.md, line 363 at r6 (raw file):

    system during each waiting period. We can refer to the `final state of this
    design` in this RFC.
  - How to tune the length of each waiting period when encounter errors in

Something along the lines of "How to correctly tune the cluster settings" would work better here. The end part is unnecessary since these guidelines are important to avoid errors to begin with.

I think this should spell out a bit more how these cluster settings differ as well (e.g. connection_wait is a timeout whereas drain_wait is a minimum duration, clarification that drain_wait is not synonymous with --drain_wait flag, etc.)


docs/RFCS/20211207_graceful_draining.md, line 365 at r6 (raw file):

  - How to tune the length of each waiting period when encounter errors in
    draining.
      - `server.shutdown.drain_wait` should be longer than the load balancer's

"should be customized in coordination with the load balancer's settings" since there are more factors than just the health check interval, such as the number of tolerable unsuccessful health checks.


docs/RFCS/20211207_graceful_draining.md, line 369 at r6 (raw file):
This should be a bit more specific. Take a look at the following from the node shutdown doc written by @a-entin:

This setting should provide a sufficient time for all transactions in progress to complete. Therefore the server.shutdown.query_wait should be greater than any of the following:

  • The longest possible transaction in the workload expected to complete successfully under normal operating conditions.
  • The sql.defaults.idle_in_transaction_session_timeout cluster setting.
  • The sql.defaults.statement_timeout cluster setting.

docs/RFCS/20211207_graceful_draining.md, line 372 at r6 (raw file):

        longer `query_wait` can give guarantee that on-going queries will be
        committed before the server fully shutdown, but may lead to a longer
        draining process. We should also note that `query_wait` let the server

nit: "lets"

@ZhouXing19
Copy link
Copy Markdown
Collaborator Author

@cameronnunez Thanks for the comment!

but I think the last two steps of the draining process – draining SQL connections and range leases – both comprise the heart of the process and are not best characterized as waiting periods.

I agree that query_wait naming doesn't reflect that "connections are actively closed as their queries finish up, and after the duration of query_wait (assuming there are remaining connections), all the connections are closed by the server."
But I still think query_wait can still be counted as a waiting period -- at this stage, the server is actually waiting for the ongoing SQL queries to finish. Also, if we have the stage name consistent with the cluster settings, I think the customer will know more clearly which duration they are tuning.

So maybe we can keep the names of the first 3 stages as drain_wait, connection_wait and query_wait, better document them, and maybe give a more informative log at each stage? E.g. when connection_wait phase starts, we log "query_wait phase starts, no new SQL connections are allowed now. When any SQL query finishes, the server closes the underlying SQL connections. The draining will proceed to the lease_transfer stage once all SQL queries are finished or timeout after n seconds."

For example, the lease and leadership transferral step is executed with an infinite loop, with the timeout of each iteration in that loop being lease_transfer_wait.

It seems that lease_transfer_wait specifies the timeout for the whole lease transfer process, instead of the duration of each iteration:

// We've seen all the replicas once. Now we're going to iterate
// until they're all gone, up to the configured timeout.
transferTimeout := leaseTransferWait.Get(&s.cfg.Settings.SV)

drainLeasesOp := "transfer range leases"
if err := contextutil.RunWithTimeout(ctx, drainLeasesOp, transferTimeout,
func(ctx context.Context) error {
	opts := retry.Options{
		InitialBackoff: 10 * time.Millisecond,
		MaxBackoff:     time.Second,
		Multiplier:     2,
	}
	everySecond := log.Every(time.Second)
	var err error
	// Avoid retry.ForDuration because of https://github.com/cockroachdb/cockroach/issues/25091.
	for r := retry.StartWithCtx(ctx, opts); r.Next(); {
		err = nil
		if numRemaining := transferAllAway(ctx); numRemaining > 0 {
			// Report progress to the Drain RPC.
			if reporter != nil {
				reporter(numRemaining, "range lease iterations")
			}
			err = errors.Errorf("waiting for %d replicas to transfer their lease away", numRemaining)
			if everySecond.ShouldLog() {
				log.Infof(ctx, "%v", err)
			}
		}
		if err == nil {
			// All leases transferred. We can stop retrying.
			break
		}
	}

and here the transferTimeout is outside of the retry for loop. Or am I missing something?

But yes I agree that lease_transfer_wait should be renamed since it's not waiting for the completion of client-side operations. lease_transfer_timeout sounds good to me.

I think for all references to the draining steps in the RFC it would be best to either assign names to the steps or to simply use step numbers.

Yes step number is a good idea. I'll go with (1) drain_wait, (2) query_wait, (3) connection_wait, and (4) lease_transfer_timeout.

@cameronnunez
Copy link
Copy Markdown
Contributor

cameronnunez commented Jan 21, 2022

It seems that lease_transfer_wait specifies the timeout for the whole lease transfer process, instead of the duration of each iteration:
and here the transferTimeout is outside of the retry for loop. Or am I missing something?

You have to go higher in scope to find the infinite loop. The draining sequence does not begin here. A call to SetDraining, the function where that code resides, represents a single iteration. See below:

In cli/start.go:

// Perform a graceful drain. We keep retrying forever, in
// case there are many range leases or some unavailability
// preventing progress. If the operator wants to expedite
// the shutdown, they will need to make it ungraceful
// via a 2nd signal.
  for {
  	remaining, _, err := s.Drain(drainCtx)
  	if err != nil {
  		log.Ops.Errorf(drainCtx, "graceful drain failed: %v", err)
  		break
  	}
  	if remaining == 0 {
  		// No more work to do.
  		break
  	}
  	// Avoid a busy wait with high CPU usage if the server replies
  	// with an incomplete drain too quickly.
  	time.Sleep(200 * time.Millisecond)
  }

Drain is iterated on, and each iteration is another call to SetDraining. And lease_transfer_wait is a timeout pertaining to a single call to SetDraining, and so a single iteration. It's for this reason that this stage of the draining process is not a wait, but a period where the server is actively shedding leases and raft leadership. And the drain will only end when all leases have been transferred.

Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great! my comments are all just suggestions

similar functionality as
[Postgres’s smart shutdown service][postgres_shutdown].

Note that in the current design, we do support an unlimited `connection_wait`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what did you end up deciding about enforcing a max allowed value for the new cluster setting? (line 224 mentioned that infinite won't be allowed)

like to reserve access to the node when the draining process started.
The possible solution is to add a gate in the connection register that
looks at a connection’s metadata (e.g. if it’s from admin or not) during
`drain_wait`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could even change to allow admins to login during connection_wait also, but that is also just a possible future extension

declared as a `*settings.DurationSetting` as follows:

```go
connectionWait = settings.RegisterDurationSetting(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster settings now need to marked as allowed or disallowed for multitenant. i think this one should be "TenantReadOnly" (but not tenant writable)

@ZhouXing19 ZhouXing19 force-pushed the rfc-draining branch 2 times, most recently from cf7973c to c87eaa4 Compare January 26, 2022 21:00
Copy link
Copy Markdown
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @a-entin, @cameronnunez, @joshimhoff, @knz, @otan, @rafiss, and @ZhouXing19)


docs/RFCS/20211207_graceful_draining.md, line 158 at r2 (raw file):

Previously, knz (kena) wrote…

Without a space between the backquote and "go".

Done.


docs/RFCS/20211207_graceful_draining.md, line 160 at r6 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

cluster settings now need to marked as allowed or disallowed for multitenant. i think this one should be "TenantReadOnly" (but not tenant writable)

Done.


docs/RFCS/20211207_graceful_draining.md, line 249 at r6 (raw file):

Previously, cameronnunez (Cameron Nunez) wrote…

Needs a more accurate and lengthy description (and more a reflective step name). This step actively transfers all range leases and raft leaderships to other nodes. It completes only when all transfers are done.

Thanks! Added more lines and opened an issue for the renaming.


docs/RFCS/20211207_graceful_draining.md, line 276 at r6 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this could even change to allow admins to login during connection_wait also, but that is also just a possible future extension

Done adding it.


docs/RFCS/20211207_graceful_draining.md, line 363 at r6 (raw file):

Previously, cameronnunez (Cameron Nunez) wrote…

Something along the lines of "How to correctly tune the cluster settings" would work better here. The end part is unnecessary since these guidelines are important to avoid errors to begin with.

I think this should spell out a bit more how these cluster settings differ as well (e.g. connection_wait is a timeout whereas drain_wait is a minimum duration, clarification that drain_wait is not synonymous with --drain_wait flag, etc.)

Done.


docs/RFCS/20211207_graceful_draining.md, line 365 at r6 (raw file):

Previously, cameronnunez (Cameron Nunez) wrote…

"should be customized in coordination with the load balancer's settings" since there are more factors than just the health check interval, such as the number of tolerable unsuccessful health checks.

Done.


docs/RFCS/20211207_graceful_draining.md, line 369 at r6 (raw file):

Previously, cameronnunez (Cameron Nunez) wrote…

This should be a bit more specific. Take a look at the following from the node shutdown doc written by @a-entin:

This setting should provide a sufficient time for all transactions in progress to complete. Therefore the server.shutdown.query_wait should be greater than any of the following:

  • The longest possible transaction in the workload expected to complete successfully under normal operating conditions.
  • The sql.defaults.idle_in_transaction_session_timeout cluster setting.
  • The sql.defaults.statement_timeout cluster setting.

Done. Thanks!


docs/RFCS/20211207_graceful_draining.md, line 385 at r6 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

what did you end up deciding about enforcing a max allowed value for the new cluster setting? (line 224 mentioned that infinite won't be allowed)

Revised this part.

This is a proposal for optimizing the draining process to be more legible for
customers, and introduce a new step, connection_wait, to the draining process,
which allows the server to early exit when all connections are closed.

Release note: None
@ZhouXing19
Copy link
Copy Markdown
Collaborator Author

Thanks!
bors r+

craig bot pushed a commit that referenced this pull request Feb 15, 2022
73586: rfc: optimize the draining process with connection_wait and relevant reporting systems r=ZhouXing19 a=ZhouXing19

This is a proposal for optimizing the draining process to be more legible for
customers, and introduce a new step, connection_wait, to the draining process,
which allows the server to early exit when all connections are closed.

Release note: None

76410: kv: disallow GC requests that bump GC threshold and GC expired versions r=nvanbenschoten a=nvanbenschoten

Related to #55293.

This commit adds a safeguard to GC requests that prevents them from
bumping the GC threshold at the same time that they GC individual MVCC
versions. This was found to be unsafe in #55293 because performing both
of these actions at the same time could lead to a race where a read
request is allowed to evaluate without error while also failing to see
MVCC versions that are concurrently GCed.

This race is possible because foreground traffic consults the in-memory
version of the GC threshold (`r.mu.state.GCThreshold`), which is updated
after (in `handleGCThresholdResult`), not atomically with, the application
of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This
allows a read request to see the effect of a GC on MVCC state without seeing
its effect on the in-memory GC threshold.

The latches acquired by GC quests look like it will help with this race,
but in practice they do not for two reasons:
1. the latches do not protect timestamps below the GC request's batch
   timestamp. This means that they only conflict with concurrent writes,
   but not all concurrent reads.
2. the read could be served off a follower, which could be applying the
   GC request's effect from the raft log. Latches held on the leaseholder
   would have no impact on a follower read.

Thankfully, the GC queue has split these two steps for the past few
releases, at least since 87e85eb, so we do not have a bug today.

The commit also adds a test that reliably exercises the bug with a few
well-placed calls to `time.Sleep`. The test contains a variant where the
read is performed on the leaseholder and a variant where it is performed
on a follower. Both fail by default. If we switch the GC request to
acquire non-MVCC latches then the leaseholder variant passes, but the
follower read variant still fails.

76417: ccl/sqlproxyccl: add connector component and support for session revival token r=JeffSwenson a=jaylim-crl

Informs #76000.

Previously, all the connection establishment logic is coupled with the handler
function within proxy_handler.go. This makes connecting to a new SQL pod during
connection migration difficult. This commit refactors all of those connection
logic out of the proxy handler into a connector component, as described in the
connection migration RFC. At the same time, we also add support for the session
revival token within this connector component.

Note that the overall behavior of the SQL proxy should be unchanged with this
commit.

Release note: None

76545: cmd/reduce: add -tlp option r=yuzefovich a=yuzefovich

**cmd/reduce: remove stdin option and require -file argument**

We tend to not use the option of passing input SQL via stdin, so this
commit removes it. An additional argument in favor of doing that is that
the follow-up commit will introduce another mode of behavior that
requires `-file` argument to be specified, so it's just cleaner to
always require it now.

Release note: None

**cmd/reduce: add -tlp option**

This commit adds `-tlp` boolean flag that changes the behavior of
`reduce`. It is required that `-file` is specified whenever the `-tlp`
flag is used. The behavior is such that the last two queries (delimited
by empty lines) in the file contain unpartitioned and partitioned queries
that return different results although they are equivalent.

If TLP check is requested, then we remove the last two queries from the
input which we use then to construct a special TLP check query that
results in an error if two removed queries return different results.

We do not just include the TLP check query into the input string because
the reducer would then reduce the check query itself, making the
reduction meaningless.

Release note: None

76598: server: use channel for DisableAutomaticVersionUpgrade r=RaduBerinde a=RaduBerinde

DisableAutomaticVersionUpgrade is an atomic integer which is rechecked
in a retry loop. This is not a very clean mechanism, and can lead to
issues where you're unknowingly dealing with a copy of the knobs and
setting the wrong atomic. The retry loop can also add unnecessary
delays in tests.

This commit changes DisableAutomaticVersionUpgrade from an atomic
integer to a channel. If the channel is set, auto-upgrade waits until
the channel is closed.

Release note: None

Co-authored-by: Jane Xing <zhouxing@uchicago.edu>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Jay <jay@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 15, 2022

Build failed (retrying...):

Copy link
Copy Markdown

@a-entin a-entin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @a-entin, @cameronnunez, @joshimhoff, @knz, @otan, @rafiss, and @ZhouXing19)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 16, 2022

Build succeeded:

@craig craig bot merged commit b429f25 into cockroachdb:master Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants