ccl/sqlproxyccl: implement suspend/resume for request/response processors #77467
Conversation
04921df to
bb94cdb
Compare
jeffswenson
left a comment
There was a problem hiding this comment.
I really like this approach. It is a massive improvement 💯
bb94cdb to
cbb0fec
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
TFTR! Reverting back to reviewable. I believe I have addressed all the comments, but do point out if I missed any. The only action item left would be tests for PGConn, which should be straightforward.
nit: the processor interface is unnecessary. I like the fact both structs conform to a pattern, but the pattern does not need to be enforced via an interface.
Done, and removed.
nit: instead of lazily initializing the condition variable, initialize it when the request processor is constructed.
Done.
nit: delete the lock/unlock helpers. Accessing the locks directly makes it obvious the lock protects the .mu struct.
Done.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 203 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
nit: What do you think of renaming "resumed" to "running"?
I left this as resumed to be consistent with the resume method.
pkg/ccl/sqlproxyccl/forwarder.go, line 219 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
Possibly, but I definitely do not want to do that in this PR. That change is not trivial, and requires more design thought. Also note that this structure looks similar for now, but with the connection migration work in, it will be different because the processors will need to do different operations (e.g. setting various variables), though you can argue that some kind of callbacks may work here.
Done. Merged both, we use a PGConn here. During the transfer, we'll promote clientConn and serverConn (which are both PGConn objects) to their specialized types.
pkg/ccl/sqlproxyccl/forwarder.go, line 241 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
Let me think about the suggestion that you mentioned above. There was a reason why I did the current approach, but I don't recall anymore.
Done.
pkg/ccl/sqlproxyccl/forwarder.go, line 247 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
nit: The manual p.lock() and p.unlock() pairings are error prone.
Option 1:
Extract these into helpers. I'm thinking something like:
setResumed func() error
setInPeek func() (suspended bool, error)
setForwarding func() (suspended bool, error)Option 2:
Use function blocks.err := func() error { p.lock() defer p.unlock() if p.mu.resumed { // Already resumed - don't start another one. return errProcessorResumed } p.mu.resumed = true return nil }() if err != nil { return err }
Done.
pkg/ccl/sqlproxyccl/forwarder.go, line 255 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I don't think this is possible. The root problem is that SetReadDeadline is asynchronous:
// A deadline is an absolute time after which I/O operations
// fail instead of blocking. The deadline applies to all future
// and pending I/O, not just the immediately following call to
// Read or Write. After a deadline has been exceeded, the
// connection can be refreshed by setting a deadline in the future.Without setting this, I see that we get races in two conditions:
- PeekMsg exits with a deadline error when
suspendReq= false- ForwardMsg exits with a deadline error when that's not the intention
If multiple resumption and suspension operations were called at the same time (as illustrated in one of the tests), both cases are possible. My understanding is that Go uses non-blocking reads under the hood, and it really depends on what the runtime is doing when SetReadDeadline is called.
Consider the following case (where we don't reset deadlines within transfer loop):
_, _, err := p.f.clientConn.PeekMsg() p.lock() suspend := p.mu.suspendReq p.mu.suspendReq = false p.mu.inPeek = false p.unlock() // Broadcast ... // ForwardMsgWhen we signal the blocked wait calls, there's a possibility where a context switch has not occurred, and the control continues to ForwardMsg. In that case, ForwardMsg may be unblocked, but we don't want that to happen. This is just one case, but I've seen other possible cases, where we return from the initial PeekMsg with a timeout error, but suspendReq is false. If ForwardMsg gets unblocked when it has started forwarding, the connection is non-recoverable.
Done. For some reason, I couldn't reproduce the issue that I described above anymore once I enforced that only one suspend can occur at any point in time, which is fine for our use-case.
pkg/ccl/sqlproxyccl/forwarder.go, line 306 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
Hm, I guess doing this also means that we can remove waitUntilSuspended. I'll look into this, and see how this ties back to the connection migration.
Done.
pkg/ccl/sqlproxyccl/forwarder.go, line 136 at r2 (raw file):
nit: Can we remove the Close call here? Reporting an error on the channel causes the proxy_handler goroutine to close the forwarder.
I'd like to keep this here to be safe. I added a comment to describe that.
pkg/ccl/sqlproxyccl/forwarder.go, line 181 at r2 (raw file):
nit: I like to avoid circular references. It is possible to break the circular reference between the forwarder and the requestProcessor/responseProcessor by adding ctx, clienConn, and serverConn to the processor structs.
There will be shared state in the future (with the connection migration PR). Eventually this forwarder field will come back if we remove it now. Perhaps you can also argue that the shared state doesn't need to be stored in the forwarder (i.e. it can be in another struct). Regardless, I can add a TODO comment to revisit this.
Done. Did not feed ctx into processors as both suspend/resume now take in a context object. Also, as discussed offline, since we have a way to avoid this shared state, we can do it this way.
pkg/ccl/sqlproxyccl/forwarder.go
Outdated
| } | ||
|
|
||
| // Has context been cancelled? | ||
| if ctx.Err() != nil { |
There was a problem hiding this comment.
nit: this context check is unecessary
pkg/ccl/sqlproxyccl/forwarder.go
Outdated
| p.mu.inPeek = true | ||
| return false, nil | ||
| } | ||
| exitPeek := func() (suspendReq bool, err error) { |
There was a problem hiding this comment.
nit: can we remove the error from enterPeek and exitPeek? They always return nil.
cbb0fec to
b804020
Compare
jaylim-crl
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 234 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
nit: can we remove the error from enterPeek and exitPeek? They always return nil.
Done. Good catch, this was a leftover when the deadline was set within exitPeek.
pkg/ccl/sqlproxyccl/forwarder.go, line 242 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
nit: this context check is unecessary
Done.
…sors Previously, we were treating request/response processors as forwarder methods, and the original intention was to perform a connection migration inline with the forwarding. However, this has proved to cause confusions and complications. The new connection migration design uses a different approach by performing the connection migration out-of-band with the forwarding processors. For this to work, we would need to be able to suspend and resume those processors. This commit implements support for that. Additionally, we no longer wrap clientConn with a custom readTimeoutConn wrapper as that seems to be problematic with idle connections. There's a similar discussion here: cockroachdb#25585. Due to that, context cancellations from the parent no longer close the connection if we are blocked on IO. We could theoretically spin up a new goroutine in the forwarder to check for this, and close the connections accordingly, but this case is rare, so we'll let the caller handle this. Release justification: sqlproxy-only change, and only used within CockroachCloud. Release note: None
b804020 to
2c537f2
Compare
|
TFTR! bors r=JeffSwenson |
|
Build failed (retrying...): |
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
Build succeeded: |
Previously, we were treating request/response processors as forwarder methods,
and the original intention was to perform a connection migration inline with
the forwarding. However, this has proved to cause confusions and complications.
The new connection migration design uses a different approach by performing
the connection migration out-of-band with the forwarding processors. For this
to work, we would need to be able to suspend and resume those processors.
This commit implements support for that.
Release justification: sqlproxy-only change, and only used within
CockroachCloud.
Release note: None