ccl/sqlproxyccl: complete connection migration support in the forwarder#76805
Conversation
jaylim-crl
left a comment
There was a problem hiding this comment.
The structure for the connection migration work is there. Note that this is untested, and was inspired mostly from the initial prototype. Putting this up in case anyone has thoughts on the direction. I'll be adding tests next.
When it comes to the forwarder's API, I plan to have a Forwarder interface that implements just two methods: RequestTransfer and Close. With the current design, callers wouldn't know whether the transfer has failed or succeeded after calling RequestTransfer. I think that is fine for now, and if we really need to know the last transfer error, we could address that in the upcoming RFC. That said, the initial thought is to have the forwarder notify a "forwarder tracking" component (i.e. map of pods to forwarders) if it connects/disconnects from serverConn. The proxy can then check which pod the forwarder is assigned to (versus the previous pod) to know whether the transfer has succeeded.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 52 at r1 (raw file):
) var (
These error messages might be unnecessary, at least for the ones where they are only used once. We could just do return errors.New there if we wanted to.
4797bdf to
009f2d6
Compare
009f2d6 to
e7730d1
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
TFTRs!
This draft PR is ready for another look. I added more tests, and changed how we handle timeout handlers. I also tested this, and it works as expected.
This test is still WIP, and is not included in this PR. Will update that later today. Here, we call RequestTransfer in a loop, and constantly query the SQL pod details. We can see that we alternate between pods, which implies that connection migration works:
=== RUN TestConnectionMigration/successful_connection
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
There are three more items to address:
- Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
- Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
- Add
TestConnectionMigration(which is an end-to-end test) to this PR.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 269 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Can we do this without atomics? I do not like using atomics for control flow. Getting atomic programming right is infamously difficult. In this case there is a race condition that occurs if RequestTransfer is called concurrently and it's not clear to me how to fix it. At a minimum CompareAndSwap needs to be used as the guard for this state transfer. But then there is a race with setting transferKey and transferCh.
Done - good catch re: concurrent calls. As discussed, I have moved this to use locks. My benchmarks show that using locks within a 30s workload run only slows the overall time by a few hundred ms, so I'd be fine with using locks everywhere instead of a mixture of both.
pkg/ccl/sqlproxyccl/forwarder.go, line 440 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
idea: Instead of implementing the forwarding goroutines as a state machine, initiating the transfer could ask both of the goroutines to terminate. That allows us to implement transfer the same way authenticate is implemented. As a single function and goroutine. Once the transfer is canceled or completed, the goroutine managing the transfer could restart the forwarding goroutines.
IMO the main advantage to implementing it by stopping the forwarding goroutines is we can keep the steady state code simple and we can group all of the transfer logic together.
As discussed, this would increase the overall transfer latency since we'd have to wait for the server-to-client processor to stop before sending the transfer state message, so I kept the current approach. I like the idea as it makes things cleaner though, but the tradeoff wouldn't be worth it.
pkg/ccl/sqlproxyccl/forwarder.go, line 514 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
This logic can be simplified by context.WithTimeout.
transferCtx, cancel := context.WithTimeout(timeout, ctx)
defer cancel()
err := transfer(trasferCtx)
if transferCtx.Err() && !ctx.Err() {
return errTransferTimeout
}
return errAlternatively, if we do not specifically care about errTransferTimeout, it can use context.WithTimeout.
Hm, this probably isn't relevant anymore since I moved it elsewhere.
e7730d1 to
6a8d610
Compare
6a8d610 to
87b8ea5
Compare
87b8ea5 to
e3bb0d3
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
There are three more items to address:
1 .Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
3. Add TestConnectionMigration (which is an end-to-end test) to this PR.
(2) and (3) are done. I've decided to just leave them in this PR, but happy to break them down into separate PRs if reviewers prefer otherwise. (1) will definitely come in a later PR.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 629 at r3 (raw file):
// NOTE: This should only be called during a transfer process. We assume that // transferCloserCh has already been initialized. func (f *forwarder) runTransferTimeoutHandler(waitTimeout time.Duration) {
The end-to-end transfer process works, but I'm not happy with this function. The unfortunate thing is that we unblock clients through context, but for servers, we'd do that by closing the forwarder. This goroutine is needed to ensure that we can terminate if we timeout when sending the SHOW TRANSFER STATE query to the server. Also perhaps I ran into a rabbit hole with all the transferConnRecoverable stuff, but I don't see a better way to address that. Happy to discuss.
closer is needed instead of relying on transferCtx.Done() to ensure that we can distinguish between a transfer timeout, and an actual transfer completion. If we block on transferCtx.Done() in the client-to-server processor, then when the timeout handler gets triggered, we will unblock automatically, which is incorrect.
Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None
jaylim-crl
left a comment
There was a problem hiding this comment.
FYI, I extracted some of the work here in #77109. I'll do the same for the auth stuff.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None
Informs cockroachdb#76000. Extracted from cockroachdb#76805. Previously, we assumed that with the token-based authentication, the server is ready to accept queries the moment we connect to it. This assumption has been proved wrong during the integration tests with the forwarder, and there's an implicit AuthenticateOK step that happens after connecting to the server. During that time, initial connection data such as ParameterStatus and BackendKeyData messages will be sent to the client as well. For now, we will ignore those messages. Once we start implementing query cancellation within the proxy, that has to be updated to cache the new BackendKeyData. Release note: None
Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None
e3bb0d3 to
4357e45
Compare
a095f21 to
cd6e8dd
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
TFTRs!
OK. I have been mulling over this for a couple of days, and I like the new approach that I have. See last commit in #77419. That has a couple of major architectural change:
- We no longer keep track of forwarder states.
- Request and response processors have been split into their own internal components. We can suspend/resume these processors at any point in time by setting their appropriate deadlines. I took this idea from net/http: https://github.com/golang/go/blob/b5e06545b3c3b3b9c2ed8204e5ba3fd25a85db0f/src/net/http/server.go#L671-L723.
- Due to the suspend/resume ability in (2), readTimeoutConn has been removed (This was pointed out in a previous comment about being slow. I didn't like this approach either, but couldn't seem to figure out a way to unblock the processors easily back then.)
- All transfer related operations are now scoped in a single
runTransfermethod withinconn_migration.go. - One major drawback to this design is that we incur more mutex calls during forwarding.
Thoughts?
I ran major tests, and they seem to work as intended:
=== RUN TestConnectionMigration
test_log_scope.go:79: test logs captured to: /var/folders/79/72cqh3ps56v7ts4h7zv92vz80000gq/T/logTestConnectionMigration4246808810
test_log_scope.go:80: use -show-logs to present logs inline
=== RUN TestConnectionMigration/successful
=== RUN TestConnectionMigration/successful/normal_transfer
=== RUN TestConnectionMigration/successful/failed_transfers_with_tx
=== RUN TestConnectionMigration/successful/failed_transfers_with_dial_issues
=== RUN TestConnectionMigration/transfer_timeout_in_response
=== CONT TestConnectionMigration
proxy_handler_test.go:1135: -- test log scope end --
--- PASS: TestConnectionMigration (12.97s)
--- PASS: TestConnectionMigration/successful (8.92s)
--- PASS: TestConnectionMigration/successful/normal_transfer (5.28s)
--- PASS: TestConnectionMigration/successful/failed_transfers_with_tx (0.54s)
--- PASS: TestConnectionMigration/successful/failed_transfers_with_dial_issues (3.03s)
--- PASS: TestConnectionMigration/transfer_timeout_in_response (3.34s)
PASS
Have you made sure to run all available tests/scenarios using the Go -race detector? I always do this for any non-trivial multi-threaded code.
Not explicitly. I relied on the stress/race test infra that we have on TeamCity. That did catch a couple of bugs, but they have all been addressed.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 224 at r16 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
I'm a bit worried that we're "waking up" every second on every connection, since I expect the majority of connections to be otherwise idle. I'd expect that to have real overhead. Maybe if perf tests did show a lot of overhead, we would want to consider a 3rd goroutine to do the transfer...
Even if there was a 3rd goroutine, it would be tricky to do the transfer without unblocking the request processor. That said, I found a new approach to suspend/resume processors, so I think we should do that instead. One benefit to the 3rd goroutine is that we can now decouple most of the transferring logic from the forwarding logic.
pkg/ccl/sqlproxyccl/forwarder.go, line 372 at r16 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
I assume this custom wrapper will also allow unblocking the call to
PeekMsgabove, right? That seems to be the much more important case, since that might be blocked for very long periods of time simply b/c there are no client messages to read.
Yes, that was the original intention of the readTimeoutConn wrapper, but as you've noted, we'd need to wake up every second for idle connections:
cockroach/pkg/sql/pgwire/conn.go
Lines 1800 to 1804 in 3b7f761
Note that we're also using this same approach within the SQL server.
pkg/ccl/sqlproxyccl/forwarder.go, line 410 at r16 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
While technically you're protected from race conditions by the
statemutex, it's pretty non-standard to be reading fields on one goroutine without a lock when another goroutine will later update them. Typically you'd just put all fields that are read or written on different goroutines under the protection of a. mutex.
Hm yeah, but the original intention was to avoid grabbing a mutex when we don't need it. I used a similar idea in the new approach on the serverConn field: 226d7d8#diff-f50d75f5a5e7b6d840c918b7ac35bb1f7605fca1f5434189c6eb99cd4bf67db2R397-R399. That said, if we wanted to use a mutex, that's fine too.
|
I like the alternate approach, since idle connections incur no overhead - that seems to be the common case to optimize for. One thing I wasn't able to find when I took a look was the additional mutex calls during forwarding that you mention. |
jaylim-crl
left a comment
There was a problem hiding this comment.
One thing I wasn't able to find when I took a look was the additional mutex calls during forwarding that you mention.
That's within the request and response processors. Here's one example:
cockroach/pkg/ccl/sqlproxyccl/forwarder.go
Lines 260 to 329 in 226d7d8
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
2fdda2d to
fddfd30
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
OK - this is ready for another round of review. Some tests are TODO, which will be addressed before merging, but I'm putting this up for a review first. Only look at the last three commits.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
97f79e3 to
9661512
Compare
jeffswenson
left a comment
There was a problem hiding this comment.
This is looking great! I have a few minor comments.
| } | ||
| }() | ||
|
|
||
| f.mu.Lock() |
There was a problem hiding this comment.
This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.
The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.
I'm thinking something like:
func (f* forwarder) getProcessors() (request processor, response processor) {
f.mu.Lock()
defer f.mu.Unlock()
return f.mu.request, f.mu.response
}
func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
f.mu.Lock()
defer f.mu.Unlock()
// Transfer was successful.
clockFn := makeLogicalClockFn()
f.mu.serverConn.Close()
f.mu.serverConn = newServerConn
f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}
...
request, response := getProcessors()
// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
return errors.Wrap(err, "suspending response processor")
}
// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
return nil
This commit adds last message details to the forwarder's processor, and in particular, lastMessageType and lastMessageTransferredAt. The latter is implemented using a simple logical clock, which will be used to determine ordering of events. These two will be used during connection migration to determine a safe transfer point. At the same time, we plumb the connector and metrics into the processor, which will be used in subsequent commits. Release justification: Low risk, sqlproxy-only change. Release note: None
…resumption Previously, there could be a race where suspend() was called right after resuming the processors. If the processor goroutines have not started, suspend will implicitly return, leading to a violation of an invariant, where we want the processors to be suspended before proceeding. This commit adds a new waitResumed method on the processor that allows callers to block until the processors have been resumed. Release justification: sqlproxy-only change. Release note: None
9661512 to
d6a7488
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
I don't think logCtx is necessary. I looked at the implementation of log and I did not find any calls that check ctx.Err(). If it did, I would argue that is a bug. It is normal to log something that goes wrong during the cancellation process.
I didn't have this initially, but the process panics with a span-use-after-close error:
panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl 2.852s
pkg/ccl/sqlproxyccl/conn_migration.go, line 148 at r22 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.
The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.
I'm thinking something like:
func (f* forwarder) getProcessors() (request processor, response processor) { f.mu.Lock() defer f.mu.Unlock() return f.mu.request, f.mu.response } func (f* forwarder) replaceServerConn(newServerConn net.Conn) { f.mu.Lock() defer f.mu.Unlock() // Transfer was successful. clockFn := makeLogicalClockFn() f.mu.serverConn.Close() f.mu.serverConn = newServerConn f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn) f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn) } ... request, response := getProcessors() // Suspend both processors before starting the transfer. if err := request.suspend(ctx); err != nil { return errors.Wrap(err, "suspending request processor") } if err := response.suspend(ctx); err != nil { return errors.Wrap(err, "suspending response processor") } // Transfer the connection. newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn) if err != nil { return errors.Wrap(err, "transferring connection") } f.replaceServerConn(newServerConn) return nil
Done. Updated it to your suggestion, but also introduced getConns for the client and server connections.
pkg/ccl/sqlproxyccl/forwarder.go, line 190 at r18 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
😄 On a 100gbps ethernet line, sending the smallest possible pg message, with no protocol overhead, it would take 217 years of continuous transmission to overflow an int64 counter.
>>> line_rate = 100 * 2 ** 30 # 100 gibs >>> message_size = 5 * 8 # minimum message size of 8 bytes >>> msg_per_second = line_rate / message_size >>> seconds_to_overflow = 2 ** 64 / msg_per_second >>> years_to_overflow = seconds_to_overflow / (60 * 60 * 24 * 365) >>> years_to_overflow 217.908031253171
😄
pkg/ccl/sqlproxyccl/forwarder.go, line 320 at r18 (raw file):
It would require passing peekErr and typ as arguments.
Doing this muddies the logic in exitPeek, and I'd rather keep it this way with the extra mutex lock+unlock. If this proves to be an issue, we could easily optimize this at a later time. Doing it the current way also makes sure that updating the last message details is tied to the ForwardMsg call, and not when exiting.
pkg/ccl/sqlproxyccl/forwarder.go, line 83 at r20 (raw file):
Done. Added a note about that.
Generally I would advise against acquiring multiple locks at the same time because it is a deadlock risk.
I think as long as we're aware about it, such risks can be minimized. Also, using a single lock for all three structs is going to cause contention between the request and response processors especially when queries are pipelined.
d6a7488 to
64166a9
Compare
Informs cockroachdb#76000. This commit completes the connection migration feature in the the forwarder within sqlproxy. The idea is as described in the RFC. A couple of new sqlproxy metrics have been added as well: - proxy.conn_migration.success - proxy.conn_migration.error_fatal - proxy.conn_migration.error_recoverable - proxy.conn_migration.attempted For more details, see metrics.go in the sqlproxyccl package. Release justification: This completes the first half of the connection migration feature. This is low risk as part of the code is guarded behind the connection migration feature, which is currently not being used in production. To add on, CockroachCloud is the only user of sqlproxy. Release note: None
64166a9 to
8610953
Compare
jeffswenson
left a comment
There was a problem hiding this comment.
LGTM!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I didn't have this initially, but the process panics with a span-use-after-close error:
panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish> goroutine 2224 [running]: github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134 github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395 github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580}) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1}) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8 github.com/cockroachdb/cockroach/pkg/util/log.Infof(...) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263 github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2() /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8 github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7 github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1() /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b FAIL github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl 2.852s
Jay and I dug into this off line. The root cause is the sqlproxy is using stopper incorrectly. This will be fixed in a separate PR. #77689
pkg/ccl/sqlproxyccl/forwarder.go, line 158 at r16 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
@jeffswenson, how strongly do you feel that we need to guard these variables with a mutex when they don't need them? I've attempted to do so, but it was extremely difficult to read/follow, so I aborted the idea.
This was resolved by the redesign.
jaylim-crl
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):
The root cause is the sqlproxy is using stopper incorrectly.
I guess it's more like we're not using the stopper entirely. Some places use stopper.RunAsyncTask, whereas others use the go keyword directly. The latter is handy as we don't need to thread the stopper all the way down. Fixing all uses of the go keyword isn't straightforward. We'd also have to rethink how we want traces to be displayed.
|
bors r=JeffSwenson |
|
Build succeeded: |
Previously, we added support for connection migration in cockroachdb#76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None
…e size Informs cockroachdb#76000, and follow up to cockroachdb#76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None
ccl/sqlproxyccl: add last message details to the forwarder's processor
This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.
At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.
Release justification: Low risk, sqlproxy-only change.
Release note: None
ccl/sqlproxyccl: support waitResumed on the processor to block until resumption
Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.
Release justification: sqlproxy-only change.
Release note: None
ccl/sqlproxyccl: complete connection migration support in the forwarder
Informs #76000.
This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.
A couple of new sqlproxy metrics have been added as well:
For more details, see metrics.go in the sqlproxyccl package.
Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.
Release note: None