Skip to content

feat: Dynamic channel pool scaling#194

Merged
nimf merged 7 commits into
masterfrom
dynamic_pool
Aug 6, 2025
Merged

feat: Dynamic channel pool scaling#194
nimf merged 7 commits into
masterfrom
dynamic_pool

Conversation

@nimf

@nimf nimf commented Nov 18, 2024

Copy link
Copy Markdown
Collaborator

Provides dynamic scaling functionality which is configured with three parameters:

  • minRpcPerChannel -- minimum desired average concurrent calls per channel.
  • maxRpcPerChannel -- maximum desired average concurrent calls per channel.
  • scaleDownInterval -- how often to check for a possibility to scale down.

When the average number of concurrent calls per channel reaches maxRpcPerChannel the pool will create and add a new channel unless already at max size.

Every scaleDownInterval a check for downscaling is performed. Based on the maximum total concurrent calls observed since the last check, the desired number of channels is calculated as:

(max_total_concurrent_calls / minRpcPerChannel) rounded up.

If the calculated desired number of channels is lower than the current number of channels, the pool will be downscaled to the desired number or min size (whichever is greater).

When downscaling, channels with the oldest connections are selected. Then the selected channels are removed from the pool but are not instructed to shutdown until all calls are completed. In a case when the pool is scaling up and there is a ready channel awaiting calls completion, the channel will be re-used instead of creating a new channel.

Provides dynamic scaling functionality which is configured with three parameters:
- minRpcPerChannel -- minimum desired average concurrent calls per channel.
- maxRpcPerChannel -- maximum desired average concurrent calls per channel.
- scaleDownInterval -- how often to check for a possibility to scale down.

When the average number of concurrent calls per channel reaches `maxRpcPerChannel` the pool will create and add a new channel unless already at max size.

Every `scaleDownInterval` a check for downscaling is performed. Based on the maximum total concurrent calls observed since the last check, the desired number of channels is calculated as:

`(max_total_concurrent_calls / minRpcPerChannel)` rounded up.

If the calculated desired number of channels is lower than the current number of channels, the pool will be downscaled to the desired number or min size (whichever is greater).

When downscaling, channels with the oldest connections are selected. Then the selected channels are removed from the pool but are not instructed to shutdown until all calls are completed. In a case when the pool is scaling up and there is a ready channel awaiting calls completion, the channel will be re-used instead of creating a new channel.
@nimf nimf requested a review from fengli79 November 18, 2024 22:15
@nimf nimf marked this pull request as draft December 2, 2024 21:48
@nimf nimf marked this pull request as ready for review December 3, 2024 19:04
Comment thread grpc-gcp/src/main/java/com/google/cloud/grpc/GcpManagedChannel.java
Comment thread grpc-gcp/src/main/java/com/google/cloud/grpc/GcpManagedChannel.java

if ((totalActiveStreams.get() / channelRefs.size()) >= maxRpcPerChannel) {
createNewChannel();
scaleUpCount++;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if (totalActiveStreams.get() / channelRefs.size()) >= maxRpcPerChannel still happens after you create a new channel? In other words, what if you need to create multiple channels at the same time for scaling up?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition you described is not quite possible because we perform this check for every upcoming RPC, hence we may only have a +1 RPC difference between the checks, we cannot jump to +100 RPC instantly.

But even if we jumped this would still be quite safe. The first check would create one channel and the very next RPC would create another one and so on until the condition is not satisfied. This part is synchronized, so we won't create more channels than we need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By synchronized do you mean the channel creation is in the critical path of a RPC, i.e., a RPC needs to wait for the channel creation? If so, will this be a performance issue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it shouldn't be a performance issue because we first check if we need to create a new channel and if we do, only then we acquire the lock, check again and create a channel if still needed. And channel creation is quite fast -- we don't wait until it become ready or anything like that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A channel creation is not fast, it can be hundreds of milliseconds. For example, in gRPC RR LB policy, gRPC will only assign RPCs to subchannels that are READY, and wait for subchannels that are connecting. In this way, channel creation will only cause latency at the beginning of the process, and as long as there is at least one connection, RPCs will not be blocked. Here it sounds not this case, and a RPC will need to wait for a channel creation even there are other channels that are READY, correct? If so, I am worried here, since this is a dynamical channel pool, channels can keep being created (and destroyed).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this does not work as you described in this case, because we don't issue the RPC to the newly created client immediately.

  1. grpc-gcp gets called to pick a channel for a new RPC.
  2. we check if we have enough channels.
  3. if we don't we acquite lock.
  4. we check the number of channels again and if still needed we create a channel.
  5. yes, we ask channel to connect immediately by using getState(true) but we don't wait for it to connect to release the lock.
  6. we release the lock while the new channel is still connecting.
  7. then we return the newly created channel to be used for the RPC in (1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We synced up office. The known risk is that RPCs could be blocked by the new channel creation (if fallback is enabled, only one RPCs will be blocked; if fallback is not enabled, multiple RPCs will be blocked because RPC assignment is based on the load of channel and the new channel that is being created has no load. Cloud Spanner can configure this behavior).

However, we think this risk is OK since:

  1. The channel scale up and down should be rare (at most every a few minutes);
  2. We have observability by correlating the channel scale up/down metric and the latency metric,

Comment thread grpc-gcp/src/main/java/com/google/cloud/grpc/GcpManagedChannel.java
@nimf nimf requested review from mohanli-ml and removed request for fengli79 July 21, 2025 19:03
Comment thread grpc-gcp/src/main/java/com/google/cloud/grpc/GcpManagedChannel.java
Comment thread grpc-gcp/src/main/java/com/google/cloud/grpc/GcpManagedChannel.java

if ((totalActiveStreams.get() / channelRefs.size()) >= maxRpcPerChannel) {
createNewChannel();
scaleUpCount++;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By synchronized do you mean the channel creation is in the critical path of a RPC, i.e., a RPC needs to wait for the channel creation? If so, will this be a performance issue?

Comment thread grpc-gcp/src/main/java/com/google/cloud/grpc/GcpManagedChannel.java
@nimf nimf merged commit cbc3149 into master Aug 6, 2025
2 checks passed
@nimf nimf deleted the dynamic_pool branch August 6, 2025 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants