feat: Dynamic channel pool scaling#194
Conversation
Provides dynamic scaling functionality which is configured with three parameters: - minRpcPerChannel -- minimum desired average concurrent calls per channel. - maxRpcPerChannel -- maximum desired average concurrent calls per channel. - scaleDownInterval -- how often to check for a possibility to scale down. When the average number of concurrent calls per channel reaches `maxRpcPerChannel` the pool will create and add a new channel unless already at max size. Every `scaleDownInterval` a check for downscaling is performed. Based on the maximum total concurrent calls observed since the last check, the desired number of channels is calculated as: `(max_total_concurrent_calls / minRpcPerChannel)` rounded up. If the calculated desired number of channels is lower than the current number of channels, the pool will be downscaled to the desired number or min size (whichever is greater). When downscaling, channels with the oldest connections are selected. Then the selected channels are removed from the pool but are not instructed to shutdown until all calls are completed. In a case when the pool is scaling up and there is a ready channel awaiting calls completion, the channel will be re-used instead of creating a new channel.
|
|
||
| if ((totalActiveStreams.get() / channelRefs.size()) >= maxRpcPerChannel) { | ||
| createNewChannel(); | ||
| scaleUpCount++; |
There was a problem hiding this comment.
What if (totalActiveStreams.get() / channelRefs.size()) >= maxRpcPerChannel still happens after you create a new channel? In other words, what if you need to create multiple channels at the same time for scaling up?
There was a problem hiding this comment.
The condition you described is not quite possible because we perform this check for every upcoming RPC, hence we may only have a +1 RPC difference between the checks, we cannot jump to +100 RPC instantly.
But even if we jumped this would still be quite safe. The first check would create one channel and the very next RPC would create another one and so on until the condition is not satisfied. This part is synchronized, so we won't create more channels than we need.
There was a problem hiding this comment.
By synchronized do you mean the channel creation is in the critical path of a RPC, i.e., a RPC needs to wait for the channel creation? If so, will this be a performance issue?
There was a problem hiding this comment.
No, it shouldn't be a performance issue because we first check if we need to create a new channel and if we do, only then we acquire the lock, check again and create a channel if still needed. And channel creation is quite fast -- we don't wait until it become ready or anything like that.
There was a problem hiding this comment.
A channel creation is not fast, it can be hundreds of milliseconds. For example, in gRPC RR LB policy, gRPC will only assign RPCs to subchannels that are READY, and wait for subchannels that are connecting. In this way, channel creation will only cause latency at the beginning of the process, and as long as there is at least one connection, RPCs will not be blocked. Here it sounds not this case, and a RPC will need to wait for a channel creation even there are other channels that are READY, correct? If so, I am worried here, since this is a dynamical channel pool, channels can keep being created (and destroyed).
There was a problem hiding this comment.
Sorry, this does not work as you described in this case, because we don't issue the RPC to the newly created client immediately.
- grpc-gcp gets called to pick a channel for a new RPC.
- we check if we have enough channels.
- if we don't we acquite lock.
- we check the number of channels again and if still needed we create a channel.
- yes, we ask channel to connect immediately by using
getState(true)but we don't wait for it to connect to release the lock. - we release the lock while the new channel is still connecting.
- then we return the newly created channel to be used for the RPC in (1)
There was a problem hiding this comment.
We synced up office. The known risk is that RPCs could be blocked by the new channel creation (if fallback is enabled, only one RPCs will be blocked; if fallback is not enabled, multiple RPCs will be blocked because RPC assignment is based on the load of channel and the new channel that is being created has no load. Cloud Spanner can configure this behavior).
However, we think this risk is OK since:
- The channel scale up and down should be rare (at most every a few minutes);
- We have observability by correlating the channel scale up/down metric and the latency metric,
|
|
||
| if ((totalActiveStreams.get() / channelRefs.size()) >= maxRpcPerChannel) { | ||
| createNewChannel(); | ||
| scaleUpCount++; |
There was a problem hiding this comment.
By synchronized do you mean the channel creation is in the critical path of a RPC, i.e., a RPC needs to wait for the channel creation? If so, will this be a performance issue?
Provides dynamic scaling functionality which is configured with three parameters:
When the average number of concurrent calls per channel reaches
maxRpcPerChannelthe pool will create and add a new channel unless already at max size.Every
scaleDownIntervala check for downscaling is performed. Based on the maximum total concurrent calls observed since the last check, the desired number of channels is calculated as:(max_total_concurrent_calls / minRpcPerChannel)rounded up.If the calculated desired number of channels is lower than the current number of channels, the pool will be downscaled to the desired number or min size (whichever is greater).
When downscaling, channels with the oldest connections are selected. Then the selected channels are removed from the pool but are not instructed to shutdown until all calls are completed. In a case when the pool is scaling up and there is a ready channel awaiting calls completion, the channel will be re-used instead of creating a new channel.