ccl/sqlproxyccl: add rebalancer queue for connection rebalancing by jaylim-crl · Pull Request #79346 · cockroachdb/cockroach

jaylim-crl · 2022-04-04T16:18:08Z

ccl/sqlproxyccl: add rebalancer queue for rebalance requests

This commit adds a rebalancer queue implementation to the balancer component.
The queue will be used for rebalance requests for the connection migration
work. This is done to ensure a centralized location that invokes the
TransferConnection method on the connection handles. Doing this also enables
us to limit the number of concurrent transfers within the proxy.

Release note: None

ccl/sqlproxyccl: run rebalancer queue processor in the background

The previous commit added a rebalancer queue. This commit connects the queue to
the balancer, and runs the queue processor in the background. By the default,
we limit up to 100 concurrent transfers at any point in time, and each transfer
will be retried up to 3 times.

Release note: None

Jira issue: CRDB-14727

cockroach-teamcity · 2022-04-04T16:18:17Z

This change is

jeffswenson

LGTM

pkg/ccl/sqlproxyccl/balancer/balancer.go

pkg/ccl/sqlproxyccl/balancer/balancer_test.go

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)

pkg/ccl/sqlproxyccl/proxy_handler.go, line 180 at r4 (raw file):

	}

	ctx, _ = stopper.WithCancelOnQuiesce(ctx)

Should WithCancelOnQuiesce be called before we use ctx in the NewCertManager call? It's strange that we use different ctx instances in different places.

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

	}

	if err := b.stopper.RunAsyncTask(ctx, "processQueue-closer", func(ctx context.Context) {

How come we need this separate async task? I thought that ctx.Done would be closed when the stopper is quiesced, and therefore we could close the queue in processQueue at that point.

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/proxy_handler.go, line 180 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Should WithCancelOnQuiesce be called before we use ctx in the NewCertManager call? It's strange that we use different ctx instances in different places.

Yes, we could. We would need to thread ctx to setupIncomingCert. Also, that's already an existing issue today. I also don't see why we'd need a separate context for the cert manager. I can do that here.

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

and therefore we could close the queue in processQueue at that point.

The first part is correct, but the second isn't the case. The queue has no notion of context.Context, and there's nothing to wake the callers up whenever ctx.Done has been closed. The ctx object in processQueue is only used to indicate whether we want to continue reading from the queue. When we get blocked when reading from the queue, someone would need to invoke queue.close() explicitly to wake those callers up.

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 152 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: the DB uses https://github.com/marusama/semaphore as its semaphore implementation. Conveniently its Acquire method accepts a ctx.

Good point. I can make this change.

pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: Instead of adding these two hooks you can increment count before <-waitCh in the onTransferConnection and decrement it after <-waitCh.

Hm, let me look into this again. Maybe there's a simpler approach.

jaylim-crl

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/proxy_handler.go, line 180 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Yes, we could. We would need to thread ctx to setupIncomingCert. Also, that's already an existing issue today. I also don't see why we'd need a separate context for the cert manager. I can do that here.

Done.

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 152 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Good point. I can make this change.

Done. Actually, I see various approaches:

chan struct{}, e.g.:

cockroach/pkg/kv/kvserver/store.go

Lines 765 to 766 in 63ea913

// Semaphore to limit concurrent non-empty snapshot application.

snapshotApplySem chan struct{}
marusama/semaphore
quotapool.IntPool

Regardless, I've updated this to use (2) since it's cleaner.

pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Hm, let me look into this again. Maybe there's a simpler approach.

I left it as-is. I still need the afterProcessQueueItem hook for other sub-tests.

jeffswenson

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jaylim-crl)

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

and therefore we could close the queue in processQueue at that point.

The first part is correct, but the second isn't the case. The queue has no notion of context.Context, and there's nothing to wake the callers up whenever ctx.Done has been closed. The ctx object in processQueue is only used to indicate whether we want to continue reading from the queue. When we get blocked when reading from the queue, someone would need to invoke queue.close() explicitly to wake those callers up.

One idea: we could use a semaphore for tracking the size of the queue instead of the condition variable. That allows us to drop the close state and the goroutine. The implementation would look like:

func (q Queue) Push(item interface{}) {
  q.Lock()
  defer q.Unlock()
  // add the item to the queue
  q.semaphore.Release(1)
}

func (q Queue) Dequeue(ctx context.Context) (item interface{}, err error) {
  if err := q.semaphore.Acquire(ctx, 1); err != nil {
    return err
  }
  q.Lock()
  defer q.Unlock()
  // remove element from the queue
}

pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I left it as-is. I still need the afterProcessQueueItem hook for other sub-tests.

It is possible to avoid the need for the afterProcessQueueItem hook. When setting eventCh<-, we are using it to determine when processing is complete. The test can wait for processing to complete by limiting the concurrency to 1 and publishing a second event that closes eventCh.

jaylim-crl

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jaylim-crl)

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

One idea: we could use a semaphore for tracking the size of the queue instead of the condition variable. That allows us to drop the close state and the goroutine. The implementation would look like:
func (q Queue) Push(item interface{}) {
  q.Lock()
  defer q.Unlock()
  // add the item to the queue
  q.semaphore.Release(1)
}

func (q Queue) Dequeue(ctx context.Context) (item interface{}, err error) {
  if err := q.semaphore.Acquire(ctx, 1); err != nil {
    return err
  }
  q.Lock()
  defer q.Unlock()
  // remove element from the queue
}

I've done something like that before, but if we'd like to stick to github.com/marusama/semaphore, the above won't work for two reasons:

We need a size limit for that to work.
Release panics without Acquire: https://github.com/marusama/semaphore/blob/2d3c1eaa054b6e36c7c0dfde398f2b47e4bc5094/semaphore.go#L169-L171.

Does that align with what you think as well?

jeffswenson

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jaylim-crl)

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I've done something like that before, but if we'd like to stick to github.com/marusama/semaphore, the above won't work for two reasons:

We need a size limit for that to work.

Release panics without Acquire: https://github.com/marusama/semaphore/blob/2d3c1eaa054b6e36c7c0dfde398f2b47e4bc5094/semaphore.go#L169-L171.

Does that align with what you think as well?

During initialization you can set the capacity to something really large and then acquire it all. It looks like the easiest way to do that with the marusama semaphore is:

semaphore := semaphore.New(0)
semaphore.SetLimit(MaxUint32)

jaylim-crl

TFTR! I'll come back again with an updated queue implementation.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

During initialization you can set the capacity to something really large and then acquire it all. It looks like the easiest way to do that with the marusama semaphore is:
semaphore := semaphore.New(0)
semaphore.SetLimit(MaxUint32)

😄 I like that idea. I'll rework the queue as I think it's much more ergonomic being able to unblock when ctx is cancelled automatically.

pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

It is possible to avoid the need for the afterProcessQueueItem hook. When setting eventCh<-, we are using it to determine when processing is complete. The test can wait for processing to complete by limiting the concurrency to 1 and publishing a second event that closes eventCh.

Done.

This commit adds a rebalancer queue implementation to the balancer component. The queue will be used for rebalance requests for the connection migration work. This is done to ensure a centralized location that invokes the TransferConnection method on the connection handles. Doing this also enables us to limit the number of concurrent transfers within the proxy. Release note: None

jaylim-crl

Done. Everything has been addressed :) dequeue now reacts to context cancellations, and will be unblocked automatically when that happens.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)

pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

😄 I like that idea. I'll rework the queue as I think it's much more ergonomic being able to unblock when ctx is cancelled automatically.

Done.

jeffswenson · 2022-04-05T17:27:20Z

LGTM

The previous commit added a rebalancer queue. This commit connects the queue to the balancer, and runs the queue processor in the background. By the default, we limit up to 100 concurrent transfers at any point in time, and each transfer will be retried up to 3 times. Release note: None

jaylim-crl · 2022-04-05T19:37:59Z

TFTR!

bors r=JeffSwenson

craig · 2022-04-05T21:09:52Z

Build succeeded:

GitHub CI (Cockroach)

…G pods In cockroachdb#79346, we added a rebalancer queue for connection rebalancing. This commit adds support for transferring connections away from DRAINING pods. The rebalance loop runs once every 30 seconds for now, and connections will only be moved away from DRAINING pods if the pod has been draining for at least 1 minute. At the same time, we also fix an enqueue bug on the rebalancer queue where we're releasing the semaphore in the case of an update, which is incorrect. Release note: None

79725: ccl/sqlproxyccl: add support for moving connections away from DRAINING pods r=JeffSwenson a=jaylim-crl In #79346, we added a rebalancer queue for connection rebalancing. This commit adds support for transferring connections away from DRAINING pods. The rebalance loop runs once every 30 seconds for now, and connections will only be moved away from DRAINING pods if the pod has been draining for at least 1 minute. At the same time, we also fix an enqueue bug on the rebalancer queue where we're releasing the semaphore in the case of an update, which is incorrect. Release note: None Co-authored-by: Jay <jay@cockroachlabs.com>

…G pods In #79346, we added a rebalancer queue for connection rebalancing. This commit adds support for transferring connections away from DRAINING pods. The rebalance loop runs once every 30 seconds for now, and connections will only be moved away from DRAINING pods if the pod has been draining for at least 1 minute. At the same time, we also fix an enqueue bug on the rebalancer queue where we're releasing the semaphore in the case of an update, which is incorrect. Release note: None

jaylim-crl marked this pull request as ready for review April 4, 2022 16:18

jaylim-crl requested review from a team as code owners April 4, 2022 16:18

jaylim-crl requested review from andy-kimball and jeffswenson and removed request for a team April 4, 2022 16:18

jaylim-crl force-pushed the 220404-add-balancer-queue branch 2 times, most recently from 200259f to 0deab39 Compare April 4, 2022 18:20

jaylim-crl changed the title ~~ccl/sqlproxyccl: add balancer queue for connection rebalancing~~ ccl/sqlproxyccl: add rebalancer queue for connection rebalancing Apr 4, 2022

jaylim-crl force-pushed the 220404-add-balancer-queue branch from 0deab39 to 5b279ab Compare April 4, 2022 19:29

jeffswenson approved these changes Apr 4, 2022

View reviewed changes

pkg/ccl/sqlproxyccl/balancer/balancer.go Outdated Show resolved Hide resolved

pkg/ccl/sqlproxyccl/balancer/balancer_test.go Outdated Show resolved Hide resolved

andy-kimball reviewed Apr 4, 2022

View reviewed changes

jaylim-crl requested review from andy-kimball and jeffswenson April 5, 2022 02:10

jaylim-crl commented Apr 5, 2022

View reviewed changes

jaylim-crl force-pushed the 220404-add-balancer-queue branch 2 times, most recently from a4224d2 to 2c268bd Compare April 5, 2022 05:38

jaylim-crl commented Apr 5, 2022

View reviewed changes

jeffswenson approved these changes Apr 5, 2022

View reviewed changes

jaylim-crl commented Apr 5, 2022

View reviewed changes

jeffswenson reviewed Apr 5, 2022

View reviewed changes

jaylim-crl force-pushed the 220404-add-balancer-queue branch 2 times, most recently from cdaf10f to 0ed578d Compare April 5, 2022 15:20

jaylim-crl commented Apr 5, 2022

View reviewed changes

jaylim-crl force-pushed the 220404-add-balancer-queue branch from 0ed578d to 7aca4e3 Compare April 5, 2022 16:00

jaylim-crl commented Apr 5, 2022

View reviewed changes

jaylim-crl force-pushed the 220404-add-balancer-queue branch from 7aca4e3 to 02b5be6 Compare April 5, 2022 17:31

jaylim-crl added the backport-22.1.x label Apr 5, 2022

craig bot merged commit 985344a into cockroachdb:master Apr 5, 2022

blathers-crl bot mentioned this pull request Apr 5, 2022

release-22.1: ccl/sqlproxyccl: add rebalancer queue for connection rebalancing #79459

Merged

jaylim-crl mentioned this pull request Apr 9, 2022

ccl/sqlproxyccl: add support for moving connections away from DRAINING pods #79725

Merged

blathers-crl bot mentioned this pull request Apr 17, 2022

release-22.1: ccl/sqlproxyccl: add support for moving connections away from DRAINING pods #80075

Merged

tbg mentioned this pull request Jun 22, 2022

roachperf: regression around 2022-04-06 #82136

Closed

jaylim-crl deleted the 220404-add-balancer-queue branch November 30, 2022 16:55

jaylim-crl restored the 220404-add-balancer-queue branch November 30, 2022 16:55

jaylim-crl deleted the 220404-add-balancer-queue branch November 30, 2022 16:55

	// Semaphore to limit concurrent non-empty snapshot application.
	snapshotApplySem chan struct{}

Conversation

jaylim-crl commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ccl/sqlproxyccl: add rebalancer queue for rebalance requests

ccl/sqlproxyccl: run rebalancer queue processor in the background

Uh oh!

cockroach-teamcity commented Apr 4, 2022

Uh oh!

jeffswenson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jeffswenson left a comment

Choose a reason for hiding this comment

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jeffswenson left a comment

Choose a reason for hiding this comment

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jaylim-crl left a comment

Choose a reason for hiding this comment

Uh oh!

jeffswenson commented Apr 5, 2022

Uh oh!

jaylim-crl commented Apr 5, 2022

Uh oh!

craig bot commented Apr 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jaylim-crl commented Apr 4, 2022 •

edited

Loading