Skip to content

ccl/sqlproxyccl: implement suspend/resume for request/response processors #77467

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
jaylim-crl:jay/suspend-resume-processors
Mar 11, 2022
Merged

ccl/sqlproxyccl: implement suspend/resume for request/response processors #77467
craig[bot] merged 1 commit intocockroachdb:masterfrom
jaylim-crl:jay/suspend-resume-processors

Conversation

@jaylim-crl
Copy link
Copy Markdown
Contributor

Previously, we were treating request/response processors as forwarder methods,
and the original intention was to perform a connection migration inline with
the forwarding. However, this has proved to cause confusions and complications.
The new connection migration design uses a different approach by performing
the connection migration out-of-band with the forwarding processors. For this
to work, we would need to be able to suspend and resume those processors.
This commit implements support for that.

Release justification: sqlproxy-only change, and only used within
CockroachCloud.

Release note: None

@jaylim-crl jaylim-crl requested review from a team as code owners March 8, 2022 04:24
@jaylim-crl jaylim-crl removed the request for review from a team March 8, 2022 04:24
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jaylim-crl jaylim-crl force-pushed the jay/suspend-resume-processors branch 5 times, most recently from 04921df to bb94cdb Compare March 8, 2022 18:45
Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach. It is a massive improvement 💯

Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

@jaylim-crl jaylim-crl force-pushed the jay/suspend-resume-processors branch from bb94cdb to cbb0fec Compare March 10, 2022 05:30
@jaylim-crl jaylim-crl requested a review from jeffswenson March 10, 2022 06:01
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Reverting back to reviewable. I believe I have addressed all the comments, but do point out if I missed any. The only action item left would be tests for PGConn, which should be straightforward.

nit: the processor interface is unnecessary. I like the fact both structs conform to a pattern, but the pattern does not need to be enforced via an interface.

Done, and removed.

nit: instead of lazily initializing the condition variable, initialize it when the request processor is constructed.

Done.

nit: delete the lock/unlock helpers. Accessing the locks directly makes it obvious the lock protects the .mu struct.

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 203 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: What do you think of renaming "resumed" to "running"?

I left this as resumed to be consistent with the resume method.


pkg/ccl/sqlproxyccl/forwarder.go, line 219 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Possibly, but I definitely do not want to do that in this PR. That change is not trivial, and requires more design thought. Also note that this structure looks similar for now, but with the connection migration work in, it will be different because the processors will need to do different operations (e.g. setting various variables), though you can argue that some kind of callbacks may work here.

Done. Merged both, we use a PGConn here. During the transfer, we'll promote clientConn and serverConn (which are both PGConn objects) to their specialized types.


pkg/ccl/sqlproxyccl/forwarder.go, line 241 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Let me think about the suggestion that you mentioned above. There was a reason why I did the current approach, but I don't recall anymore.

Done.


pkg/ccl/sqlproxyccl/forwarder.go, line 247 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: The manual p.lock() and p.unlock() pairings are error prone.

Option 1:
Extract these into helpers. I'm thinking something like:
setResumed func() error
setInPeek func() (suspended bool, error)
setForwarding func() (suspended bool, error)

Option 2:
Use function blocks.

  err := func() error {
  p.lock()
  defer p.unlock()
  if p.mu.resumed {
	  // Already resumed - don't start another one.
	  return errProcessorResumed
  }
  p.mu.resumed = true
      return nil
  }()
  if err != nil {
      return err
  }

Done.


pkg/ccl/sqlproxyccl/forwarder.go, line 255 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I don't think this is possible. The root problem is that SetReadDeadline is asynchronous:

// A deadline is an absolute time after which I/O operations
// fail instead of blocking. The deadline applies to all future
// and pending I/O, not just the immediately following call to
// Read or Write. After a deadline has been exceeded, the
// connection can be refreshed by setting a deadline in the future.

Without setting this, I see that we get races in two conditions:

  1. PeekMsg exits with a deadline error when suspendReq = false
  2. ForwardMsg exits with a deadline error when that's not the intention

If multiple resumption and suspension operations were called at the same time (as illustrated in one of the tests), both cases are possible. My understanding is that Go uses non-blocking reads under the hood, and it really depends on what the runtime is doing when SetReadDeadline is called.

Consider the following case (where we don't reset deadlines within transfer loop):

_, _, err := p.f.clientConn.PeekMsg()

p.lock()
suspend := p.mu.suspendReq
p.mu.suspendReq = false
p.mu.inPeek = false
p.unlock()
// Broadcast
... 
// ForwardMsg

When we signal the blocked wait calls, there's a possibility where a context switch has not occurred, and the control continues to ForwardMsg. In that case, ForwardMsg may be unblocked, but we don't want that to happen. This is just one case, but I've seen other possible cases, where we return from the initial PeekMsg with a timeout error, but suspendReq is false. If ForwardMsg gets unblocked when it has started forwarding, the connection is non-recoverable.

Done. For some reason, I couldn't reproduce the issue that I described above anymore once I enforced that only one suspend can occur at any point in time, which is fine for our use-case.


pkg/ccl/sqlproxyccl/forwarder.go, line 306 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Hm, I guess doing this also means that we can remove waitUntilSuspended. I'll look into this, and see how this ties back to the connection migration.

Done.


pkg/ccl/sqlproxyccl/forwarder.go, line 136 at r2 (raw file):

nit: Can we remove the Close call here? Reporting an error on the channel causes the proxy_handler goroutine to close the forwarder.

I'd like to keep this here to be safe. I added a comment to describe that.


pkg/ccl/sqlproxyccl/forwarder.go, line 181 at r2 (raw file):

nit: I like to avoid circular references. It is possible to break the circular reference between the forwarder and the requestProcessor/responseProcessor by adding ctx, clienConn, and serverConn to the processor structs.

There will be shared state in the future (with the connection migration PR). Eventually this forwarder field will come back if we remove it now. Perhaps you can also argue that the shared state doesn't need to be stored in the forwarder (i.e. it can be in another struct). Regardless, I can add a TODO comment to revisit this.

Done. Did not feed ctx into processors as both suspend/resume now take in a context object. Also, as discussed offline, since we have a way to avoid this shared state, we can do it this way.

Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 💯

}

// Has context been cancelled?
if ctx.Err() != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this context check is unecessary

p.mu.inPeek = true
return false, nil
}
exitPeek := func() (suspendReq bool, err error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we remove the error from enterPeek and exitPeek? They always return nil.

@jaylim-crl jaylim-crl force-pushed the jay/suspend-resume-processors branch from cbb0fec to b804020 Compare March 10, 2022 15:41
@jaylim-crl jaylim-crl requested a review from jeffswenson March 10, 2022 15:42
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 234 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: can we remove the error from enterPeek and exitPeek? They always return nil.

Done. Good catch, this was a leftover when the deadline was set within exitPeek.


pkg/ccl/sqlproxyccl/forwarder.go, line 242 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: this context check is unecessary

Done.

…sors

Previously, we were treating request/response processors as forwarder methods,
and the original intention was to perform a connection migration inline with
the forwarding. However, this has proved to cause confusions and complications.
The new connection migration design uses a different approach by performing
the connection migration out-of-band with the forwarding processors. For this
to work, we would need to be able to suspend and resume those processors.
This commit implements support for that.

Additionally, we no longer wrap clientConn with a custom readTimeoutConn
wrapper as that seems to be problematic with idle connections. There's a
similar discussion here: cockroachdb#25585. Due to that, context cancellations from the
parent no longer close the connection if we are blocked on IO. We could
theoretically spin up a new goroutine in the forwarder to check for this, and
close the connections accordingly, but this case is rare, so we'll let the
caller handle this.

Release justification: sqlproxy-only change, and only used within
CockroachCloud.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/suspend-resume-processors branch from b804020 to 2c537f2 Compare March 10, 2022 15:43
@jaylim-crl
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=JeffSwenson

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 10, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 10, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 10, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@craig craig bot merged commit 0fb2c0c into cockroachdb:master Mar 11, 2022
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 11, 2022

Build succeeded:

@jaylim-crl jaylim-crl deleted the jay/suspend-resume-processors branch March 11, 2022 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants