ccl/sqlproxyccl: implement suspend/resume for request/response processors by jaylim-crl · Pull Request #77467 · cockroachdb/cockroach

jaylim-crl · 2022-03-08T04:24:05Z

Previously, we were treating request/response processors as forwarder methods,
and the original intention was to perform a connection migration inline with
the forwarding. However, this has proved to cause confusions and complications.
The new connection migration design uses a different approach by performing
the connection migration out-of-band with the forwarding processors. For this
to work, we would need to be able to suspend and resume those processors.
This commit implements support for that.

Release justification: sqlproxy-only change, and only used within
CockroachCloud.

Release note: None

cockroach-teamcity · 2022-03-08T04:24:12Z

This change is

jeffswenson

I really like this approach. It is a massive improvement 💯

pkg/ccl/sqlproxyccl/forwarder.go

jaylim-crl

TFTR!

pkg/ccl/sqlproxyccl/forwarder.go

jaylim-crl

TFTR! Reverting back to reviewable. I believe I have addressed all the comments, but do point out if I missed any. The only action item left would be tests for PGConn, which should be straightforward.

nit: the processor interface is unnecessary. I like the fact both structs conform to a pattern, but the pattern does not need to be enforced via an interface.

Done, and removed.

nit: instead of lazily initializing the condition variable, initialize it when the request processor is constructed.

Done.

nit: delete the lock/unlock helpers. Accessing the locks directly makes it obvious the lock protects the .mu struct.

Done.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/forwarder.go, line 203 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: What do you think of renaming "resumed" to "running"?

I left this as resumed to be consistent with the resume method.

pkg/ccl/sqlproxyccl/forwarder.go, line 219 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Possibly, but I definitely do not want to do that in this PR. That change is not trivial, and requires more design thought. Also note that this structure looks similar for now, but with the connection migration work in, it will be different because the processors will need to do different operations (e.g. setting various variables), though you can argue that some kind of callbacks may work here.

Done. Merged both, we use a PGConn here. During the transfer, we'll promote clientConn and serverConn (which are both PGConn objects) to their specialized types.

pkg/ccl/sqlproxyccl/forwarder.go, line 241 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Let me think about the suggestion that you mentioned above. There was a reason why I did the current approach, but I don't recall anymore.

Done.

pkg/ccl/sqlproxyccl/forwarder.go, line 247 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: The manual p.lock() and p.unlock() pairings are error prone.

Option 1:
Extract these into helpers. I'm thinking something like:
setResumed func() error
setInPeek func() (suspended bool, error)
setForwarding func() (suspended bool, error)

Option 2:
Use function blocks.
  err := func() error {
  p.lock()
  defer p.unlock()
  if p.mu.resumed {
	  // Already resumed - don't start another one.
	  return errProcessorResumed
  }
  p.mu.resumed = true
      return nil
  }()
  if err != nil {
      return err
  }

Done.

pkg/ccl/sqlproxyccl/forwarder.go, line 255 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I don't think this is possible. The root problem is that SetReadDeadline is asynchronous:

// A deadline is an absolute time after which I/O operations
// fail instead of blocking. The deadline applies to all future
// and pending I/O, not just the immediately following call to
// Read or Write. After a deadline has been exceeded, the
// connection can be refreshed by setting a deadline in the future.

Without setting this, I see that we get races in two conditions:

PeekMsg exits with a deadline error when suspendReq = false

ForwardMsg exits with a deadline error when that's not the intention

If multiple resumption and suspension operations were called at the same time (as illustrated in one of the tests), both cases are possible. My understanding is that Go uses non-blocking reads under the hood, and it really depends on what the runtime is doing when SetReadDeadline is called.

Consider the following case (where we don't reset deadlines within transfer loop):
_, _, err := p.f.clientConn.PeekMsg()

p.lock()
suspend := p.mu.suspendReq
p.mu.suspendReq = false
p.mu.inPeek = false
p.unlock()
// Broadcast
... 
// ForwardMsg
When we signal the blocked wait calls, there's a possibility where a context switch has not occurred, and the control continues to ForwardMsg. In that case, ForwardMsg may be unblocked, but we don't want that to happen. This is just one case, but I've seen other possible cases, where we return from the initial PeekMsg with a timeout error, but suspendReq is false. If ForwardMsg gets unblocked when it has started forwarding, the connection is non-recoverable.

Done. For some reason, I couldn't reproduce the issue that I described above anymore once I enforced that only one suspend can occur at any point in time, which is fine for our use-case.

pkg/ccl/sqlproxyccl/forwarder.go, line 306 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Hm, I guess doing this also means that we can remove waitUntilSuspended. I'll look into this, and see how this ties back to the connection migration.

Done.

pkg/ccl/sqlproxyccl/forwarder.go, line 136 at r2 (raw file):

nit: Can we remove the Close call here? Reporting an error on the channel causes the proxy_handler goroutine to close the forwarder.

I'd like to keep this here to be safe. I added a comment to describe that.

pkg/ccl/sqlproxyccl/forwarder.go, line 181 at r2 (raw file):

nit: I like to avoid circular references. It is possible to break the circular reference between the forwarder and the requestProcessor/responseProcessor by adding ctx, clienConn, and serverConn to the processor structs.

There will be shared state in the future (with the connection migration PR). Eventually this forwarder field will come back if we remove it now. Perhaps you can also argue that the shared state doesn't need to be stored in the forwarder (i.e. it can be in another struct). Regardless, I can add a TODO comment to revisit this.

Done. Did not feed ctx into processors as both suspend/resume now take in a context object. Also, as discussed offline, since we have a way to avoid this shared state, we can do it this way.

jeffswenson

LGTM! 💯

jeffswenson · 2022-03-10T14:27:05Z

pkg/ccl/sqlproxyccl/forwarder.go

+	}
+
+	// Has context been cancelled?
+	if ctx.Err() != nil {


nit: this context check is unecessary

jeffswenson · 2022-03-10T14:32:22Z

pkg/ccl/sqlproxyccl/forwarder.go

+		p.mu.inPeek = true
+		return false, nil
+	}
+	exitPeek := func() (suspendReq bool, err error) {


nit: can we remove the error from enterPeek and exitPeek? They always return nil.

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/forwarder.go, line 234 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: can we remove the error from enterPeek and exitPeek? They always return nil.

Done. Good catch, this was a leftover when the deadline was set within exitPeek.

pkg/ccl/sqlproxyccl/forwarder.go, line 242 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: this context check is unecessary

Done.

…sors Previously, we were treating request/response processors as forwarder methods, and the original intention was to perform a connection migration inline with the forwarding. However, this has proved to cause confusions and complications. The new connection migration design uses a different approach by performing the connection migration out-of-band with the forwarding processors. For this to work, we would need to be able to suspend and resume those processors. This commit implements support for that. Additionally, we no longer wrap clientConn with a custom readTimeoutConn wrapper as that seems to be problematic with idle connections. There's a similar discussion here: cockroachdb#25585. Due to that, context cancellations from the parent no longer close the connection if we are blocked on IO. We could theoretically spin up a new goroutine in the forwarder to check for this, and close the connections accordingly, but this case is rare, so we'll let the caller handle this. Release justification: sqlproxy-only change, and only used within CockroachCloud. Release note: None

jaylim-crl · 2022-03-10T19:59:22Z

TFTR!

bors r=JeffSwenson

craig · 2022-03-10T20:58:17Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-03-10T22:29:25Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2022-03-10T23:03:29Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2022-03-11T02:12:20Z

Build succeeded:

GitHub CI (Cockroach)

jaylim-crl requested review from andy-kimball and jeffswenson March 8, 2022 04:24

jaylim-crl requested review from a team as code owners March 8, 2022 04:24

jaylim-crl removed the request for review from a team March 8, 2022 04:24

jaylim-crl force-pushed the jay/suspend-resume-processors branch 5 times, most recently from 04921df to bb94cdb Compare March 8, 2022 18:45

jeffswenson reviewed Mar 8, 2022

View reviewed changes

jaylim-crl commented Mar 8, 2022

View reviewed changes

jaylim-crl force-pushed the jay/suspend-resume-processors branch from bb94cdb to cbb0fec Compare March 10, 2022 05:30

jaylim-crl requested a review from jeffswenson March 10, 2022 06:01

jaylim-crl commented Mar 10, 2022

View reviewed changes

jeffswenson approved these changes Mar 10, 2022

View reviewed changes

jaylim-crl force-pushed the jay/suspend-resume-processors branch from cbb0fec to b804020 Compare March 10, 2022 15:41

jaylim-crl requested a review from jeffswenson March 10, 2022 15:42

jaylim-crl commented Mar 10, 2022

View reviewed changes

jaylim-crl force-pushed the jay/suspend-resume-processors branch from b804020 to 2c537f2 Compare March 10, 2022 15:43

jeffswenson approved these changes Mar 10, 2022

View reviewed changes

craig bot merged commit 0fb2c0c into cockroachdb:master Mar 11, 2022

jaylim-crl deleted the jay/suspend-resume-processors branch March 11, 2022 02:37

Conversation

jaylim-crl commented Mar 8, 2022

Uh oh!

cockroach-teamcity commented Mar 8, 2022

Uh oh!

jeffswenson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jeffswenson left a comment

Choose a reason for hiding this comment

Uh oh!

jeffswenson Mar 10, 2022

Choose a reason for hiding this comment

Uh oh!

jeffswenson Mar 10, 2022

Choose a reason for hiding this comment

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jaylim-crl commented Mar 10, 2022

Uh oh!

craig bot commented Mar 10, 2022

Uh oh!

craig bot commented Mar 10, 2022

Uh oh!

craig bot commented Mar 10, 2022

Uh oh!

craig bot commented Mar 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants