Skip to content

ccl/sqlproxyccl: complete connection migration support in the forwarder#76805

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
jaylim-crl:jay/connection-migration
Mar 13, 2022
Merged

ccl/sqlproxyccl: complete connection migration support in the forwarder#76805
craig[bot] merged 3 commits intocockroachdb:masterfrom
jaylim-crl:jay/connection-migration

Conversation

@jaylim-crl
Copy link
Copy Markdown
Contributor

@jaylim-crl jaylim-crl commented Feb 19, 2022

ccl/sqlproxyccl: add last message details to the forwarder's processor

This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.

At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.

Release justification: Low risk, sqlproxy-only change.

Release note: None

ccl/sqlproxyccl: support waitResumed on the processor to block until resumption

Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.

Release justification: sqlproxy-only change.

Release note: None

ccl/sqlproxyccl: complete connection migration support in the forwarder

Informs #76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:

  • proxy.conn_migration.success
  • proxy.conn_migration.error_fatal
  • proxy.conn_migration.error_recoverable
  • proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure for the connection migration work is there. Note that this is untested, and was inspired mostly from the initial prototype. Putting this up in case anyone has thoughts on the direction. I'll be adding tests next.

When it comes to the forwarder's API, I plan to have a Forwarder interface that implements just two methods: RequestTransfer and Close. With the current design, callers wouldn't know whether the transfer has failed or succeeded after calling RequestTransfer. I think that is fine for now, and if we really need to know the last transfer error, we could address that in the upcoming RFC. That said, the initial thought is to have the forwarder notify a "forwarder tracking" component (i.e. map of pods to forwarders) if it connects/disconnects from serverConn. The proxy can then check which pod the forwarder is assigned to (versus the previous pod) to know whether the transfer has succeeded.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 52 at r1 (raw file):

)

var (

These error messages might be unnecessary, at least for the ones where they are only used once. We could just do return errors.New there if we wanted to.

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 009f2d6 to e7730d1 Compare February 24, 2022 21:19
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

This draft PR is ready for another look. I added more tests, and changed how we handle timeout handlers. I also tested this, and it works as expected.

This test is still WIP, and is not included in this PR. Will update that later today. Here, we call RequestTransfer in a loop, and constantly query the SQL pod details. We can see that we alternate between pods, which implies that connection migration works:

=== RUN   TestConnectionMigration/successful_connection
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://root@127.0.0.1:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}

There are three more items to address:

  1. Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
  2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
  3. Add TestConnectionMigration (which is an end-to-end test) to this PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 269 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Can we do this without atomics? I do not like using atomics for control flow. Getting atomic programming right is infamously difficult. In this case there is a race condition that occurs if RequestTransfer is called concurrently and it's not clear to me how to fix it. At a minimum CompareAndSwap needs to be used as the guard for this state transfer. But then there is a race with setting transferKey and transferCh.

Done - good catch re: concurrent calls. As discussed, I have moved this to use locks. My benchmarks show that using locks within a 30s workload run only slows the overall time by a few hundred ms, so I'd be fine with using locks everywhere instead of a mixture of both.


pkg/ccl/sqlproxyccl/forwarder.go, line 440 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

idea: Instead of implementing the forwarding goroutines as a state machine, initiating the transfer could ask both of the goroutines to terminate. That allows us to implement transfer the same way authenticate is implemented. As a single function and goroutine. Once the transfer is canceled or completed, the goroutine managing the transfer could restart the forwarding goroutines.

IMO the main advantage to implementing it by stopping the forwarding goroutines is we can keep the steady state code simple and we can group all of the transfer logic together.

As discussed, this would increase the overall transfer latency since we'd have to wait for the server-to-client processor to stop before sending the transfer state message, so I kept the current approach. I like the idea as it makes things cleaner though, but the tradeoff wouldn't be worth it.


pkg/ccl/sqlproxyccl/forwarder.go, line 514 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

This logic can be simplified by context.WithTimeout.

transferCtx, cancel := context.WithTimeout(timeout, ctx)
defer cancel()
err := transfer(trasferCtx)
if transferCtx.Err() && !ctx.Err() {
return errTransferTimeout
}
return err

Alternatively, if we do not specifically care about errTransferTimeout, it can use context.WithTimeout.

Hm, this probably isn't relevant anymore since I moved it elsewhere.

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from e7730d1 to 6a8d610 Compare February 25, 2022 05:32
@jaylim-crl jaylim-crl changed the title [DNM] ccl/sqlproxyccl: add connection migration support to the forwarder ccl/sqlproxyccl: add connection migration support to the forwarder Feb 25, 2022
@jaylim-crl jaylim-crl changed the title ccl/sqlproxyccl: add connection migration support to the forwarder ccl/sqlproxyccl: add partial connection migration support to the forwarder Feb 25, 2022
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 6a8d610 to 87b8ea5 Compare February 25, 2022 05:33
@jaylim-crl jaylim-crl marked this pull request as ready for review February 25, 2022 05:35
@jaylim-crl jaylim-crl requested a review from a team as a code owner February 25, 2022 05:35
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 87b8ea5 to e3bb0d3 Compare February 25, 2022 06:21
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three more items to address:

1 .Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
3. Add TestConnectionMigration (which is an end-to-end test) to this PR.

(2) and (3) are done. I've decided to just leave them in this PR, but happy to break them down into separate PRs if reviewers prefer otherwise. (1) will definitely come in a later PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 629 at r3 (raw file):

// NOTE: This should only be called during a transfer process. We assume that
// transferCloserCh has already been initialized.
func (f *forwarder) runTransferTimeoutHandler(waitTimeout time.Duration) {

The end-to-end transfer process works, but I'm not happy with this function. The unfortunate thing is that we unblock clients through context, but for servers, we'd do that by closing the forwarder. This goroutine is needed to ensure that we can terminate if we timeout when sending the SHOW TRANSFER STATE query to the server. Also perhaps I ran into a rabbit hole with all the transferConnRecoverable stuff, but I don't see a better way to address that. Happy to discuss.

closer is needed instead of relying on transferCtx.Done() to ensure that we can distinguish between a transfer timeout, and an actual transfer completion. If we block on transferCtx.Done() in the client-to-server processor, then when the timeout handler gets triggered, we will unblock automatically, which is incorrect.

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 27, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

We also introduce a new error marker for errors that denote connections which
are recoverable. In our case, we convert SQL errors from first column of the
SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark
them with the errConnRecoverableSentinel sentinel.

Release note: None
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I extracted some of the work here in #77109. I'll do the same for the auth stuff.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 28, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

We also introduce a new error marker for errors that denote connections which
are recoverable. In our case, we convert SQL errors from first column of the
SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark
them with the errConnRecoverableSentinel sentinel.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 28, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

Previously, we assumed that with the token-based authentication, the server is
ready to accept queries the moment we connect to it. This assumption has been
proved wrong during the integration tests with the forwarder, and there's
an implicit AuthenticateOK step that happens after connecting to the server.
During that time, initial connection data such as ParameterStatus and
BackendKeyData messages will be sent to the client as well. For now, we will
ignore those messages. Once we start implementing query cancellation within
the proxy, that has to be updated to cache the new BackendKeyData.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 28, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

We also introduce a new error marker for errors that denote connections which
are recoverable. In our case, we convert SQL errors from first column of the
SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark
them with the errConnRecoverableSentinel sentinel.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from e3bb0d3 to 4357e45 Compare February 28, 2022 06:17
@jaylim-crl jaylim-crl changed the title ccl/sqlproxyccl: add partial connection migration support to the forwarder ccl/sqlproxyccl: complete connection migration support in the forwarder Feb 28, 2022
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch 2 times, most recently from a095f21 to cd6e8dd Compare February 28, 2022 12:47
@jaylim-crl jaylim-crl requested a review from andy-kimball March 7, 2022 01:59
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

OK. I have been mulling over this for a couple of days, and I like the new approach that I have. See last commit in #77419. That has a couple of major architectural change:

  1. We no longer keep track of forwarder states.
  2. Request and response processors have been split into their own internal components. We can suspend/resume these processors at any point in time by setting their appropriate deadlines. I took this idea from net/http: https://github.com/golang/go/blob/b5e06545b3c3b3b9c2ed8204e5ba3fd25a85db0f/src/net/http/server.go#L671-L723.
  3. Due to the suspend/resume ability in (2), readTimeoutConn has been removed (This was pointed out in a previous comment about being slow. I didn't like this approach either, but couldn't seem to figure out a way to unblock the processors easily back then.)
  4. All transfer related operations are now scoped in a single runTransfer method within conn_migration.go.
  5. One major drawback to this design is that we incur more mutex calls during forwarding.

Thoughts?

I ran major tests, and they seem to work as intended:

=== RUN   TestConnectionMigration
    test_log_scope.go:79: test logs captured to: /var/folders/79/72cqh3ps56v7ts4h7zv92vz80000gq/T/logTestConnectionMigration4246808810
    test_log_scope.go:80: use -show-logs to present logs inline
=== RUN   TestConnectionMigration/successful
=== RUN   TestConnectionMigration/successful/normal_transfer
=== RUN   TestConnectionMigration/successful/failed_transfers_with_tx
=== RUN   TestConnectionMigration/successful/failed_transfers_with_dial_issues
=== RUN   TestConnectionMigration/transfer_timeout_in_response
=== CONT  TestConnectionMigration
    proxy_handler_test.go:1135: -- test log scope end --
--- PASS: TestConnectionMigration (12.97s)
    --- PASS: TestConnectionMigration/successful (8.92s)
        --- PASS: TestConnectionMigration/successful/normal_transfer (5.28s)
        --- PASS: TestConnectionMigration/successful/failed_transfers_with_tx (0.54s)
        --- PASS: TestConnectionMigration/successful/failed_transfers_with_dial_issues (3.03s)
    --- PASS: TestConnectionMigration/transfer_timeout_in_response (3.34s)
PASS

Have you made sure to run all available tests/scenarios using the Go -race detector? I always do this for any non-trivial multi-threaded code.

Not explicitly. I relied on the stress/race test infra that we have on TeamCity. That did catch a couple of bugs, but they have all been addressed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 224 at r16 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I'm a bit worried that we're "waking up" every second on every connection, since I expect the majority of connections to be otherwise idle. I'd expect that to have real overhead. Maybe if perf tests did show a lot of overhead, we would want to consider a 3rd goroutine to do the transfer...

Even if there was a 3rd goroutine, it would be tricky to do the transfer without unblocking the request processor. That said, I found a new approach to suspend/resume processors, so I think we should do that instead. One benefit to the 3rd goroutine is that we can now decouple most of the transferring logic from the forwarding logic.


pkg/ccl/sqlproxyccl/forwarder.go, line 372 at r16 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I assume this custom wrapper will also allow unblocking the call to PeekMsg above, right? That seems to be the much more important case, since that might be blocked for very long periods of time simply b/c there are no client messages to read.

Yes, that was the original intention of the readTimeoutConn wrapper, but as you've noted, we'd need to wake up every second for idle connections:

// readTimeout is the amount of time ReadTimeoutConn should wait on a
// read before checking for exit conditions. The tradeoff is between the
// time it takes to react to session context cancellation and the overhead
// of waking up and checking for exit conditions.
const readTimeout = 1 * time.Second

Note that we're also using this same approach within the SQL server.


pkg/ccl/sqlproxyccl/forwarder.go, line 410 at r16 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

While technically you're protected from race conditions by the state mutex, it's pretty non-standard to be reading fields on one goroutine without a lock when another goroutine will later update them. Typically you'd just put all fields that are read or written on different goroutines under the protection of a. mutex.

Hm yeah, but the original intention was to avoid grabbing a mutex when we don't need it. I used a similar idea in the new approach on the serverConn field: 226d7d8#diff-f50d75f5a5e7b6d840c918b7ac35bb1f7605fca1f5434189c6eb99cd4bf67db2R397-R399. That said, if we wanted to use a mutex, that's fine too.

@andy-kimball
Copy link
Copy Markdown
Contributor

I like the alternate approach, since idle connections incur no overhead - that seems to be the common case to optimize for. One thing I wasn't able to find when I took a look was the additional mutex calls during forwarding that you mention.

Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wasn't able to find when I took a look was the additional mutex calls during forwarding that you mention.

That's within the request and response processors. Here's one example:

func (p *requestProcessor) Resume() error {
p.wg.Add(1)
defer p.wg.Done()
if p.f.ctx.Err() != nil {
return p.f.ctx.Err()
}
p.lock()
if p.mu.resumed {
p.unlock()
return nil
}
p.mu.resumed = true
p.unlock()
defer func() {
p.lock()
defer p.unlock()
p.mu.resumed = false
}()
for p.f.ctx.Err() == nil {
// inPeek has to be false before this because the resumed field guards
// against concurrent runs.
p.lock()
if p.mu.suspendReq {
p.mu.suspendReq = false
p.unlock()
return nil
}
p.mu.inPeek = true
p.unlock()
// Always peek the message to ensure that we're blocked on reading the
// header, rather than when forwarding.
typ, _, err := p.f.clientConn.PeekMsg()
p.lock()
suspend := p.mu.suspendReq
p.mu.suspendReq = false
p.mu.inPeek = false
p.unlock()
p.mu.cond.Broadcast()
if err != nil {
if ne, ok := err.(net.Error); ok && suspend && ne.Timeout() {
// Do nothing.
err = nil
} else {
err = errors.Wrap(err, "peeking message in client-to-server")
}
}
if err != nil || suspend {
return err
}
p.f.mu.Lock()
p.f.mu.clientMessageTypeSent = typ
p.f.mu.isServerMsgReadyReceived = false
p.f.mu.Unlock()
// NOTE: No need to obtain lock for serverConn since that can only
// be updated during a transfer, and when that happens, the request
// processor must have already been suspended.
if _, err := p.f.clientConn.ForwardMsg(p.f.mu.serverConn); err != nil {
return errors.Wrap(err, "forwarding message in client-to-server")
}
}
return p.f.ctx.Err()
}

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 2fdda2d to fddfd30 Compare March 10, 2022 19:29
@jaylim-crl jaylim-crl requested a review from a team as a code owner March 10, 2022 19:29
@jaylim-crl jaylim-crl removed the request for review from a team March 10, 2022 19:30
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this is ready for another round of review. Some tests are TODO, which will be addressed before merging, but I'm putting this up for a review first. Only look at the last three commits.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch 2 times, most recently from 97f79e3 to 9661512 Compare March 10, 2022 21:02
Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! I have a few minor comments.

}
}()

f.mu.Lock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.

The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.

I'm thinking something like:

func (f* forwarder) getProcessors() (request processor, response processor)  {
 	f.mu.Lock()
	defer f.mu.Unlock()
	return f.mu.request, f.mu.response
}

func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
	f.mu.Lock()
	defer f.mu.Unlock()
	// Transfer was successful.
	clockFn := makeLogicalClockFn()
	f.mu.serverConn.Close()
	f.mu.serverConn = newServerConn
	f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
	f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}

...

request, response := getProcessors()    

// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending response processor")
}

// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
    return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
	
return nil

This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.

At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.

Release justification: Low risk, sqlproxy-only change.

Release note: None
…resumption

Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.

Release justification: sqlproxy-only change.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 9661512 to d6a7488 Compare March 11, 2022 16:50
@jaylim-crl jaylim-crl requested a review from jeffswenson March 11, 2022 16:51
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I don't think logCtx is necessary. I looked at the implementation of log and I did not find any calls that check ctx.Err(). If it did, I would argue that is a bug. It is normal to log something that goes wrong during the cancellation process.

I didn't have this initially, but the process panics with a span-use-after-close error:

panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl	2.852s

pkg/ccl/sqlproxyccl/conn_migration.go, line 148 at r22 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.

The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.

I'm thinking something like:

func (f* forwarder) getProcessors() (request processor, response processor)  {
 	f.mu.Lock()
	defer f.mu.Unlock()
	return f.mu.request, f.mu.response
}

func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
	f.mu.Lock()
	defer f.mu.Unlock()
	// Transfer was successful.
	clockFn := makeLogicalClockFn()
	f.mu.serverConn.Close()
	f.mu.serverConn = newServerConn
	f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
	f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}

...

request, response := getProcessors()    

// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending response processor")
}

// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
    return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
	
return nil

Done. Updated it to your suggestion, but also introduced getConns for the client and server connections.


pkg/ccl/sqlproxyccl/forwarder.go, line 190 at r18 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

😄 On a 100gbps ethernet line, sending the smallest possible pg message, with no protocol overhead, it would take 217 years of continuous transmission to overflow an int64 counter.

>>> line_rate = 100 * 2 ** 30 # 100 gibs
>>> message_size = 5 * 8 # minimum message size of 8 bytes
>>> msg_per_second = line_rate / message_size
>>> seconds_to_overflow = 2 ** 64 / msg_per_second 
>>> years_to_overflow = seconds_to_overflow / (60 * 60 * 24 * 365)
>>> years_to_overflow
217.908031253171

😄


pkg/ccl/sqlproxyccl/forwarder.go, line 320 at r18 (raw file):

It would require passing peekErr and typ as arguments.

Doing this muddies the logic in exitPeek, and I'd rather keep it this way with the extra mutex lock+unlock. If this proves to be an issue, we could easily optimize this at a later time. Doing it the current way also makes sure that updating the last message details is tied to the ForwardMsg call, and not when exiting.


pkg/ccl/sqlproxyccl/forwarder.go, line 83 at r20 (raw file):
Done. Added a note about that.

Generally I would advise against acquiring multiple locks at the same time because it is a deadlock risk.

I think as long as we're aware about it, such risks can be minimized. Also, using a single lock for all three structs is going to cause contention between the request and response processors especially when queries are pipelined.

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from d6a7488 to 64166a9 Compare March 11, 2022 18:09
Informs cockroachdb#76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:
- proxy.conn_migration.success
- proxy.conn_migration.error_fatal
- proxy.conn_migration.error_recoverable
- proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 64166a9 to 8610953 Compare March 11, 2022 18:28
Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)


pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I didn't have this initially, but the process panics with a span-use-after-close error:

panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl	2.852s

Jay and I dug into this off line. The root cause is the sqlproxy is using stopper incorrectly. This will be fixed in a separate PR. #77689


pkg/ccl/sqlproxyccl/forwarder.go, line 158 at r16 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

@jeffswenson, how strongly do you feel that we need to guard these variables with a mutex when they don't need them? I've attempted to do so, but it was extremely difficult to read/follow, so I aborted the idea.

This was resolved by the redesign.

Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

The root cause is the sqlproxy is using stopper incorrectly.

I guess it's more like we're not using the stopper entirely. Some places use stopper.RunAsyncTask, whereas others use the go keyword directly. The latter is handy as we don't need to thread the stopper all the way down. Fixing all uses of the go keyword isn't straightforward. We'd also have to rethink how we want traces to be displayed.

@jaylim-crl
Copy link
Copy Markdown
Contributor Author

bors r=JeffSwenson

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 13, 2022

Build succeeded:

@craig craig bot merged commit dc5432d into cockroachdb:master Mar 13, 2022
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Mar 13, 2022
Previously, we added support for connection migration in cockroachdb#76805. This commit
adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency
for attempted connection migrations. Having this metric would be beneficial
as we can now know how long users were blocked during connection migrations.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Mar 13, 2022
…e size

Informs cockroachdb#76000, and follow up to cockroachdb#76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants