-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Description
Currently we have a lot of structures keyed on the device ID, and assume that there will be exactly one or zero connections to a device. This is probably a design mistake, as we end up removing state for connection a when connection b is closed and things like this:
- panic: bug: ClusterConfig called on closed or nonexistent connection #4170, protocol: We somehow leak state between connections #3906, lib/model: panic: bug: ClusterConfig called on closed or nonexistent connection #3633 and further duplicates
I don't particularly want to maintain more than one connection for more than a few seconds, but it could be good to be able to do so and have things actually work during this brief period of parallellity. Apart from the issues above, I think we could solve:
-
The connection switching is racy and slow because the existing connection needs to be dropped by the other side before the new one can be accepted
-
Timeout handling is suboptimal because we don't accept a new connection until the old one has been timed out
If we could accept more than one connection per device we could do more seamless switching, accept new connections and slowly garbage collect older ones. We might also fix the bugs mentioned above.
Some obvious complications that would need to be taken care of:
-
Duplicate index updates. We should probably send index updates on all connections, as we don't know which one is the best / alive, but the receiving side must be able to handle this. Maybe we already do and the update becomes a no-op, but it needs to be thought about.
-
Load balancing requests over multiple connections when there is more than one, taking into account that a connection might be dead and not yet timed out so we shouldn't throw too many requests at it if it's not responding.
-
Reaping worse-priority connections once better-priority ones are established.
The people limited by TCP latency could increase some tuning value connectionsPerDevice to allow multiple, persistent connections per device, if it anyway works technically with more than one.
(#4202 is related but different and I don't want the discussion on that one to bleed over to this ticket.)
As possible implementation steps, we could:
- Key everything on connection IDs instead and retain the rest of the logic as is. This should hopefully solve a couple of bugs while not bringing any other enhancements to the table. If we have more than one connection alive towards a given device, pick one at random and pretend there was only one. This is more or less what happens today but not explicitly.
- Somehow handle when there is more than one connection more gracefully than just picking one at random.