Skip to content

ccl/sqlproxyccl: add rebalancer queue for connection rebalancing#79346

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
jaylim-crl:220404-add-balancer-queue
Apr 5, 2022
Merged

ccl/sqlproxyccl: add rebalancer queue for connection rebalancing#79346
craig[bot] merged 2 commits intocockroachdb:masterfrom
jaylim-crl:220404-add-balancer-queue

Conversation

@jaylim-crl
Copy link
Copy Markdown
Contributor

@jaylim-crl jaylim-crl commented Apr 4, 2022

ccl/sqlproxyccl: add rebalancer queue for rebalance requests

This commit adds a rebalancer queue implementation to the balancer component.
The queue will be used for rebalance requests for the connection migration
work. This is done to ensure a centralized location that invokes the
TransferConnection method on the connection handles. Doing this also enables
us to limit the number of concurrent transfers within the proxy.

Release note: None

ccl/sqlproxyccl: run rebalancer queue processor in the background

The previous commit added a rebalancer queue. This commit connects the queue to
the balancer, and runs the queue processor in the background. By the default,
we limit up to 100 concurrent transfers at any point in time, and each transfer
will be retried up to 3 times.

Release note: None

Jira issue: CRDB-14727

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jaylim-crl jaylim-crl marked this pull request as ready for review April 4, 2022 16:18
@jaylim-crl jaylim-crl requested review from a team as code owners April 4, 2022 16:18
@jaylim-crl jaylim-crl requested review from andy-kimball and jeffswenson and removed request for a team April 4, 2022 16:18
@jaylim-crl jaylim-crl force-pushed the 220404-add-balancer-queue branch 2 times, most recently from 200259f to 0deab39 Compare April 4, 2022 18:20
@jaylim-crl jaylim-crl changed the title ccl/sqlproxyccl: add balancer queue for connection rebalancing ccl/sqlproxyccl: add rebalancer queue for connection rebalancing Apr 4, 2022
@jaylim-crl jaylim-crl force-pushed the 220404-add-balancer-queue branch from 0deab39 to 5b279ab Compare April 4, 2022 19:29
Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)


pkg/ccl/sqlproxyccl/proxy_handler.go, line 180 at r4 (raw file):

	}

	ctx, _ = stopper.WithCancelOnQuiesce(ctx)

Should WithCancelOnQuiesce be called before we use ctx in the NewCertManager call? It's strange that we use different ctx instances in different places.


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

	}

	if err := b.stopper.RunAsyncTask(ctx, "processQueue-closer", func(ctx context.Context) {

How come we need this separate async task? I thought that ctx.Done would be closed when the stopper is quiesced, and therefore we could close the queue in processQueue at that point.

Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/proxy_handler.go, line 180 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Should WithCancelOnQuiesce be called before we use ctx in the NewCertManager call? It's strange that we use different ctx instances in different places.

Yes, we could. We would need to thread ctx to setupIncomingCert. Also, that's already an existing issue today. I also don't see why we'd need a separate context for the cert manager. I can do that here.


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

and therefore we could close the queue in processQueue at that point.

The first part is correct, but the second isn't the case. The queue has no notion of context.Context, and there's nothing to wake the callers up whenever ctx.Done has been closed. The ctx object in processQueue is only used to indicate whether we want to continue reading from the queue. When we get blocked when reading from the queue, someone would need to invoke queue.close() explicitly to wake those callers up.


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 152 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: the DB uses https://github.com/marusama/semaphore as its semaphore implementation. Conveniently its Acquire method accepts a ctx.

Good point. I can make this change.


pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: Instead of adding these two hooks you can increment count before <-waitCh in the onTransferConnection and decrement it after <-waitCh.

Hm, let me look into this again. Maybe there's a simpler approach.

@jaylim-crl jaylim-crl force-pushed the 220404-add-balancer-queue branch 2 times, most recently from a4224d2 to 2c268bd Compare April 5, 2022 05:38
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/proxy_handler.go, line 180 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Yes, we could. We would need to thread ctx to setupIncomingCert. Also, that's already an existing issue today. I also don't see why we'd need a separate context for the cert manager. I can do that here.

Done.


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 152 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Good point. I can make this change.

Done. Actually, I see various approaches:

  1. chan struct{}, e.g.:
    // Semaphore to limit concurrent non-empty snapshot application.
    snapshotApplySem chan struct{}
  2. marusama/semaphore
  3. quotapool.IntPool

Regardless, I've updated this to use (2) since it's cleaner.


pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Hm, let me look into this again. Maybe there's a simpler approach.

I left it as-is. I still need the afterProcessQueueItem hook for other sub-tests.

Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jaylim-crl)


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

and therefore we could close the queue in processQueue at that point.

The first part is correct, but the second isn't the case. The queue has no notion of context.Context, and there's nothing to wake the callers up whenever ctx.Done has been closed. The ctx object in processQueue is only used to indicate whether we want to continue reading from the queue. When we get blocked when reading from the queue, someone would need to invoke queue.close() explicitly to wake those callers up.

One idea: we could use a semaphore for tracking the size of the queue instead of the condition variable. That allows us to drop the close state and the goroutine. The implementation would look like:

func (q Queue) Push(item interface{}) {
  q.Lock()
  defer q.Unlock()
  // add the item to the queue
  q.semaphore.Release(1)
}

func (q Queue) Dequeue(ctx context.Context) (item interface{}, err error) {
  if err := q.semaphore.Acquire(ctx, 1); err != nil {
    return err
  }
  q.Lock()
  defer q.Unlock()
  // remove element from the queue
}


pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I left it as-is. I still need the afterProcessQueueItem hook for other sub-tests.

It is possible to avoid the need for the afterProcessQueueItem hook. When setting eventCh<-, we are using it to determine when processing is complete. The test can wait for processing to complete by limiting the concurrency to 1 and publishing a second event that closes eventCh.

Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jaylim-crl)


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

One idea: we could use a semaphore for tracking the size of the queue instead of the condition variable. That allows us to drop the close state and the goroutine. The implementation would look like:

func (q Queue) Push(item interface{}) {
  q.Lock()
  defer q.Unlock()
  // add the item to the queue
  q.semaphore.Release(1)
}

func (q Queue) Dequeue(ctx context.Context) (item interface{}, err error) {
  if err := q.semaphore.Acquire(ctx, 1); err != nil {
    return err
  }
  q.Lock()
  defer q.Unlock()
  // remove element from the queue
}

I've done something like that before, but if we'd like to stick to github.com/marusama/semaphore, the above won't work for two reasons:

  1. We need a size limit for that to work.
  2. Release panics without Acquire: https://github.com/marusama/semaphore/blob/2d3c1eaa054b6e36c7c0dfde398f2b47e4bc5094/semaphore.go#L169-L171.

Does that align with what you think as well?

Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jaylim-crl)


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I've done something like that before, but if we'd like to stick to github.com/marusama/semaphore, the above won't work for two reasons:

  1. We need a size limit for that to work.
  2. Release panics without Acquire: https://github.com/marusama/semaphore/blob/2d3c1eaa054b6e36c7c0dfde398f2b47e4bc5094/semaphore.go#L169-L171.

Does that align with what you think as well?

During initialization you can set the capacity to something really large and then acquire it all. It looks like the easiest way to do that with the marusama semaphore is:

semaphore := semaphore.New(0)
semaphore.SetLimit(MaxUint32)

@jaylim-crl jaylim-crl force-pushed the 220404-add-balancer-queue branch 2 times, most recently from cdaf10f to 0ed578d Compare April 5, 2022 15:20
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I'll come back again with an updated queue implementation.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

During initialization you can set the capacity to something really large and then acquire it all. It looks like the easiest way to do that with the marusama semaphore is:

semaphore := semaphore.New(0)
semaphore.SetLimit(MaxUint32)

😄 I like that idea. I'll rework the queue as I think it's much more ergonomic being able to unblock when ctx is cancelled automatically.


pkg/ccl/sqlproxyccl/balancer/balancer_test.go, line 163 at r4 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

It is possible to avoid the need for the afterProcessQueueItem hook. When setting eventCh<-, we are using it to determine when processing is complete. The test can wait for processing to complete by limiting the concurrency to 1 and publishing a second event that closes eventCh.

Done.

This commit adds a rebalancer queue implementation to the balancer component.
The queue will be used for rebalance requests for the connection migration
work. This is done to ensure a centralized location that invokes the
TransferConnection method on the connection handles. Doing this also enables
us to limit the number of concurrent transfers within the proxy.

Release note: None
@jaylim-crl jaylim-crl force-pushed the 220404-add-balancer-queue branch from 0ed578d to 7aca4e3 Compare April 5, 2022 16:00
Copy link
Copy Markdown
Contributor Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Everything has been addressed :) dequeue now reacts to context cancellations, and will be unblocked automatically when that happens.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)


pkg/ccl/sqlproxyccl/balancer/balancer.go, line 117 at r4 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

😄 I like that idea. I'll rework the queue as I think it's much more ergonomic being able to unblock when ctx is cancelled automatically.

Done.

@jeffswenson
Copy link
Copy Markdown
Collaborator

LGTM

The previous commit added a rebalancer queue. This commit connects the queue to
the balancer, and runs the queue processor in the background. By the default,
we limit up to 100 concurrent transfers at any point in time, and each transfer
will be retried up to 3 times.

Release note: None
@jaylim-crl jaylim-crl force-pushed the 220404-add-balancer-queue branch from 7aca4e3 to 02b5be6 Compare April 5, 2022 17:31
@jaylim-crl
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=JeffSwenson

@craig craig bot merged commit 985344a into cockroachdb:master Apr 5, 2022
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 5, 2022

Build succeeded:

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Apr 10, 2022
…G pods

In cockroachdb#79346, we added a rebalancer queue for connection rebalancing. This commit
adds support for transferring connections away from DRAINING pods. The
rebalance loop runs once every 30 seconds for now, and connections will only be
moved away from DRAINING pods if the pod has been draining for at least 1
minute.

At the same time, we also fix an enqueue bug on the rebalancer queue where
we're releasing the semaphore in the case of an update, which is incorrect.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Apr 11, 2022
…G pods

In cockroachdb#79346, we added a rebalancer queue for connection rebalancing. This commit
adds support for transferring connections away from DRAINING pods. The
rebalance loop runs once every 30 seconds for now, and connections will only be
moved away from DRAINING pods if the pod has been draining for at least 1
minute.

At the same time, we also fix an enqueue bug on the rebalancer queue where
we're releasing the semaphore in the case of an update, which is incorrect.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 12, 2022
79725: ccl/sqlproxyccl: add support for moving connections away from DRAINING pods r=JeffSwenson a=jaylim-crl

In #79346, we added a rebalancer queue for connection rebalancing. This commit
adds support for transferring connections away from DRAINING pods. The
rebalance loop runs once every 30 seconds for now, and connections will only be
moved away from DRAINING pods if the pod has been draining for at least 1
minute.

At the same time, we also fix an enqueue bug on the rebalancer queue where
we're releasing the semaphore in the case of an update, which is incorrect.

Release note: None

Co-authored-by: Jay <jay@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Apr 17, 2022
…G pods

In #79346, we added a rebalancer queue for connection rebalancing. This commit
adds support for transferring connections away from DRAINING pods. The
rebalance loop runs once every 30 seconds for now, and connections will only be
moved away from DRAINING pods if the pod has been draining for at least 1
minute.

At the same time, we also fix an enqueue bug on the rebalancer queue where
we're releasing the semaphore in the case of an update, which is incorrect.

Release note: None
@jaylim-crl jaylim-crl deleted the 220404-add-balancer-queue branch November 30, 2022 16:55
@jaylim-crl jaylim-crl restored the 220404-add-balancer-queue branch November 30, 2022 16:55
@jaylim-crl jaylim-crl deleted the 220404-add-balancer-queue branch November 30, 2022 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants