rfc: RFC for migrating pgwire connections between gateways by andreimatei · Pull Request #47060 · cockroachdb/cockroach

andreimatei · 2020-04-05T22:56:04Z

The proposal is to introduce an optional proxy in the CRDB architecture
which, in coordination with SQL gateways, can ensure that gateway
shutdowns are transparent to clients in the majority of cases.

Release note: None

cockroach-teamcity · 2020-04-05T22:56:10Z

This change is

The proposal is to introduce an optional proxy in the CRDB architecture which, in coordination with SQL gateways, can ensure that gateway shutdowns are transparent to clients in the majority of cases. Release note: None

andreimatei · 2020-04-05T23:27:57Z

After a Friday of discussing and very minor prototyping this with @ajwerner and @nvanbenschoten, I've written this RFC draft to record our thoughts (although perhaps what's in it goes beyond what has support from Andrew and Nathan, we'll see).

It's all about how to get connections (and even transactions, and even most queries!) to survive gateway restarts. Generally useful, but in particular useful for emerging architectures based on ephemeral SQL gateways.

@bdarnell @andy-kimball @tbg I'd be curious to get your opinions. No hurry, though.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 84 at r1 (raw file):

of pgwire used by a gateway signaling that it's about to shut down. As such, the
proxy has to decrypt the traffic, so it'll terminate the client TLS connections
and perform certificate-based authentication. The connections to the gateways

I recommend that you leave the word "certificate-based" out here and instead create a section about authentication where you can make a point that it doesn't matter much for this RFC so long the proxy owns the authn. I think right now it's very likely we won't be doing cert-based authn and instead do something else, but this should be irrelevant to this RFC. No need to derail your proposal by over-selling certs here.

docs/RFCS/20200404_connection_migration_between_gateways.md, line 93 at r1 (raw file):

are still open connections coming from our enlightened proxies, the gw will
coordinate with each proxy for saving the state of its respective connections
into the database. The proxy receives a token for every connection, which can be used to

I am not fond of this idea to "tell clients to go away, and only if there are clients that don't want to go away coordinate with the proxy". This seems unnecessarily complicated.

Why not have the proxy be informed first of a node shutdown, and take all its clients elsewhere first proactively, before the node even starts to drain? If we did that we would not even need to change the node's own drain logic.

docs/RFCS/20200404_connection_migration_between_gateways.md, line 119 at r1 (raw file):

2. Open transactions, additionally:
    - transaction record
    - savepoint information (read spans, lock spans, in-flight writes)

add:

pending schema changes
current error state
current savepoint stack
current sql-level txn state: read-only bit, number of DDL statements, phasetimes (for txn stats), etc

docs/RFCS/20200404_connection_migration_between_gateways.md, line 124 at r1 (raw file):

    query started running
    - the in-flight query
    - information on the query's results that were already sent to the client

add: phasetimes for execution stats

docs/RFCS/20200404_connection_migration_between_gateways.md, line 130 at r1 (raw file):

    "deterministic": we'll verify that the first results generated on the new gw
    correspond to what was sent to the client already. If they don't, then the
    query needs to be aborted.

add a notice here that this is intending to also catch results that are dependent on the node ID, such as queries to crdb_internal vtables
add a disclaimer that this will cause most SQL mutations to fail, because unique_rowid() and uuid_v4() are non-reproducible.

bdarnell

Can existing postgres proxies (like pgpool and pgbouncer) do any of this? I think they can at least handle transitions cleanly for connections that are idle when the server goes down. I don't think we can avoid building our own proxy here but this might be some interesting prior art for preserving the state of a connection.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 66 at r1 (raw file):

ephemeral gw architecture, the amount of time idle connections or transactions
are allowed to stay alive should be a matter of policy, divorced from the
scaling down of gw capacity. The implementation of the three stages will be

I'm skeptical that it will ever make sense to scale down SQL node capacity while there is an open transaction. I think it should always remain viable to set the transaction timeout high enough that all reasonable transactions are allowed to complete while the node is shutting down.

docs/RFCS/20200404_connection_migration_between_gateways.md, line 82 at r1 (raw file):

The proxy has to be protocol-aware, since it'll have to understand an extension
of pgwire used by a gateway signaling that it's about to shut down. As such, the

I'm not sure about returning a shutdown signal in-band in the pgwire connection (I think I'd prefer health checks by the proxy). The proxy still has to be protocol-aware, though to detect when the connection is in the idle state.

docs/RFCS/20200404_connection_migration_between_gateways.md, line 86 at r1 (raw file):

and perform certificate-based authentication. The connections to the gateways
will be encrypted in some cheaper way, or perhaps still with TLS but with node
certs instead of client certs.

Mucking with authentication here is scary. If the multi-tenant proxy must be trusted by the single-tenant backends, we lose a lot of security. I think it's important that the authentication be end-to-end. I'm not sure if that's possible for certificate auth, but fortunately cloud uses password auth.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @knz)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 66 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm skeptical that it will ever make sense to scale down SQL node capacity while there is an open transaction. I think it should always remain viable to set the transaction timeout high enough that all reasonable transactions are allowed to complete while the node is shutting down.

I think there are some use cases for long-running transactions. In particular, any transaction that has queued up schema changes waits for those changes to complete before returning from the COMMIT (and I think @ajwerner wants to make the semantics of schema change transactions even stronger).

docs/RFCS/20200404_connection_migration_between_gateways.md, line 82 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm not sure about returning a shutdown signal in-band in the pgwire connection (I think I'd prefer health checks by the proxy). The proxy still has to be protocol-aware, though to detect when the connection is in the idle state.

As described later, there's still an out of band notification that a gw sends to let proxies know that it's draining (e.g. the health checks). On this signal, the proxies start routing new connections away, and start buffering traffic.
But we still need, I think, a separate signal from the gw on a particular connection that says that the gw has finished processing all the commands that the proxy has sent (all the commands up to the no_more_traffic packet). Without it, I think it'd be hard for the proxy to know when there's nothing in flight. I think it'd be a trickier proposition to get the proxy in the business of understanding the protocol enough and tracking the communication enough so that it know when all the responses have come back from the gw. Plus tracking the transaction state, if we only support saving conns without an open txn.

docs/RFCS/20200404_connection_migration_between_gateways.md, line 93 at r1 (raw file):

Previously, knz (kena) wrote…

I am not fond of this idea to "tell clients to go away, and only if there are clients that don't want to go away coordinate with the proxy". This seems unnecessarily complicated.

Why not have the proxy be informed first of a node shutdown, and take all its clients elsewhere first proactively, before the node even starts to drain? If we did that we would not even need to change the node's own drain logic.

I think what you say is the intention here, I'll clarify more. When a gw begins draining, the proxy finds out and starts directing new connections away. It's only connections that have not closed during the draining procedure that will do more handshake with the proxy.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, and @knz)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 66 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think there are some use cases for long-running transactions. In particular, any transaction that has queued up schema changes waits for those changes to complete before returning from the COMMIT (and I think @ajwerner wants to make the semantics of schema change transactions even stronger).

But as long as that schema change is running, a SQL node will be processing it, so you can't scale down to zero. Maybe there's room for some improvement (consolidating multiple schema changes that started on separate nodes onto a single one), or we could make changes in the future (like a separate pool of job servers for schema changes, separate from SQL frontends), but for now we seem pretty far away from supporting a scenario in which it would make sense to turn off a SQL node with an open transaction

docs/RFCS/20200404_connection_migration_between_gateways.md, line 82 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

As described later, there's still an out of band notification that a gw sends to let proxies know that it's draining (e.g. the health checks). On this signal, the proxies start routing new connections away, and start buffering traffic.
But we still need, I think, a separate signal from the gw on a particular connection that says that the gw has finished processing all the commands that the proxy has sent (all the commands up to the no_more_traffic packet). Without it, I think it'd be hard for the proxy to know when there's nothing in flight. I think it'd be a trickier proposition to get the proxy in the business of understanding the protocol enough and tracking the communication enough so that it know when all the responses have come back from the gw. Plus tracking the transaction state, if we only support saving conns without an open txn.

Can't the server just close the connection when it's processed everything it's going to process?

This is a case were perfection is hard (a new packet type, which requires a customized implementation of the protocol...) and the cost of imperfection is low (a dropped connection that might have been salvageable). I prefer to just use timers in such a situation.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @knz)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 82 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can't the server just close the connection when it's processed everything it's going to process?

This is a case were perfection is hard (a new packet type, which requires a customized implementation of the protocol...) and the cost of imperfection is low (a dropped connection that might have been salvageable). I prefer to just use timers in such a situation.

The gateway needs to return some sort of continuation token to the proxy - the encoded state of the connection, or a pointer to this state. So the proxy does need to understand this in the protocol... Or, how do you propose the proxy gets this state? Some other out of band communication with the gw after the gw closes the conn?

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, and @knz)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 82 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The gateway needs to return some sort of continuation token to the proxy - the encoded state of the connection, or a pointer to this state. So the proxy does need to understand this in the protocol... Or, how do you propose the proxy gets this state? Some other out of band communication with the gw after the gw closes the conn?

Hmm, good point. I was mainly focusing on moving open transactions out of scope to limit how much needs to be retained, but that doesn't get it down to zero. We'd still need to retain session variables and prepared statements (maybe more?) Is it feasible for the proxy to understand enough of the protocol to collect and replay this state? That's what some connection pools do, although they're able to operate at a slightly higher level than the protocol.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, and @knz)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 117 at r1 (raw file):

    - pgwire commands that are queued up for execution (see [Gateway shutdown
    handshake](gateway-shutdown-handshake))
2. Open transactions, additionally:

I wonder if the complexity of migrating in-flight queries and transactions is worthwhile. I imagine one of the use cases for this migration capability is to migrate from one SQL pod size to a larger SQL pod size. When we support multiple SQL pods per tenant, the only downside to not supporting migration a transaction or in-flight query is that we have to keep the old SQL pod around for longer. We take on expense in exchange for reduced engineering effort.

That said, migrating an open transaction seems feasible. Maybe not worth the effort, but feasible. Migrating an in-flight query feels really difficult. I know you've thought about this more than I have, so please correct me if these assessments are wrong.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @knz, and @petermattis)

docs/RFCS/20200404_connection_migration_between_gateways.md, line 117 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if the complexity of migrating in-flight queries and transactions is worthwhile. I imagine one of the use cases for this migration capability is to migrate from one SQL pod size to a larger SQL pod size. When we support multiple SQL pods per tenant, the only downside to not supporting migration a transaction or in-flight query is that we have to keep the old SQL pod around for longer. We take on expense in exchange for reduced engineering effort.

That said, migrating an open transaction seems feasible. Maybe not worth the effort, but feasible. Migrating an in-flight query feels really difficult. I know you've thought about this more than I have, so please correct me if these assessments are wrong.

Right, migrating a running query is more difficult, and I only know how to do it for queries that return rows in a deterministic order. Which is a lot of our queries, particularly since so far we've been constantly reducing the amount of parallelism used within a query (for better or worse).

andreimatei requested a review from a team as a code owner April 5, 2020 22:56

andreimatei force-pushed the rfc.connection-migration branch from b28b8a8 to d203661 Compare April 5, 2020 22:56

rfc: RFC for migrating pgwire connections between gateways

9277647

The proposal is to introduce an optional proxy in the CRDB architecture which, in coordination with SQL gateways, can ensure that gateway shutdowns are transparent to clients in the majority of cases. Release note: None

andreimatei force-pushed the rfc.connection-migration branch from d203661 to 9277647 Compare April 5, 2020 23:21

knz reviewed Apr 6, 2020

View reviewed changes

bdarnell reviewed Apr 8, 2020

View reviewed changes

andreimatei commented Apr 9, 2020

View reviewed changes

bdarnell reviewed Apr 13, 2020

View reviewed changes

andreimatei commented Apr 13, 2020

View reviewed changes

bdarnell reviewed Apr 13, 2020

View reviewed changes

petermattis reviewed Sep 15, 2020

View reviewed changes

andreimatei commented Sep 21, 2020

View reviewed changes

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

rafiss mentioned this pull request Apr 9, 2022

server: new HTTP API to execute SQL statements #79663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: RFC for migrating pgwire connections between gateways#47060

rfc: RFC for migrating pgwire connections between gateways#47060
andreimatei wants to merge 1 commit intocockroachdb:masterfrom
andreimatei:rfc.connection-migration

andreimatei commented Apr 5, 2020

Uh oh!

cockroach-teamcity commented Apr 5, 2020

Uh oh!

andreimatei commented Apr 5, 2020

Uh oh!

knz left a comment

Uh oh!

bdarnell left a comment

Uh oh!

andreimatei left a comment

Uh oh!

bdarnell left a comment

Uh oh!

andreimatei left a comment

Uh oh!

bdarnell left a comment

Uh oh!

petermattis left a comment

Uh oh!

andreimatei left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

andreimatei commented Apr 5, 2020

Uh oh!

cockroach-teamcity commented Apr 5, 2020

Uh oh!

andreimatei commented Apr 5, 2020

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants