-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Description
Describe the bug
Suppose we have a 2 shards cluster mode enabled Redis cluster, each shard just has one node - the primary. Let's label primary of shard 1 to be P1, and primary of shard 2 to be P2.
Suppose there is a one-way 100% packet loss from P1 to P2, essentially a asymmetric network partition. This could be caused by either an issue in the underlying network, or a bug in the Redis software, etc.
Now, from the perspective of P1, it has two TCP connections with P2 in steady-state - One outbound connection it established toward P2 here, one inbound connection it accepted from P2 here.
In face of the aforementioned asymmetric network partition:
- The outbound connection is associated with a send buffer that can grow unbounded if data(e.g. PubSub messages) keeps being added but can't be sent. And because this memory consumption is not exempted from the
maxmemorythreshold, it will eventually result in total write outage for clients in the form of-OOMerrors. - Both outbound and inbound connections won't be freed because Redis has an existing logic where it only frees the outbound connection if both inbound and outbound connections cannot receive any data - which isn't true in the case of asymmetric network partition, data can still be received from the inbound connection.
- P1 won't mark P2 as
fail?because has an existing logic where it only marks a peerfail?if both inbound and outbound connections cannot receive any data from that peer
This situation can only be possibly resolved if TCP Keepalive eventually kicks in. Redis' default TCP Keepalive timeout is set to 300 seconds/5 mins (recently change to 2*server.cluster_node_timeout, but only for inbound links), which means TCP Keepalive will kick in after 5 mins of radio silence on a TCP connection. And TCP Keepalive will probe 3 more times before it eventually closes the connection, which will collectively take another 5 mins. In total, a radio silent cluster bus connection won't be closed for at least 10 mins (could be longer if users sets TCP Keeplive timeout to be larger). Moreover, TCP Keepalive won't kick in at all if the packet loss isn't 100% or the network partition is formed at Redis software level(Imagine a code bug where P1 doesn't correctly register read/write handlers for its outbound cluster bus link toward P2 even though the underlying TCP connection is perfectly healthy).
We've seen cases where a single cluster link's send buffer can grow to 16GB within 10 mins, in face of a steady stream of PubSub messages of size 1KB. The memory growth is not linear because jemalloc does pre-allocation - as the size of the currently used memory gets larger, the size of each pre-allocation becomes larger as well. Below is a graph that shows the memory growth curve of a single cluster link's send buffer:
(X axis is the number of jemalloc allocations over 10 mins, Y axis is the total memory size of a cluster bus link send buffer in bytes)
Expected behavior
There are couple ways to mitigate the aforementioned issues:
- As a short term thing, IMO we should detect liveness of inbound and outbound connections separately. If we haven't received anything from the outbound connection, we should free it, which will in turn release its send buffers. We shouldn't keep it around just because its inbound counterpart is alive.
- In the longer term, I propose we introduce memory management mechanism for cluster bus link buffers, similar to how we manage memory usage of clients:
a. Introducecluster-link-buffer-limit, similar to howclient-output-buffer-limits work for clients. Disconnect cluster bus links if they overrun this limit.
b. Exclude memory usage of cluster links frommaxmemoryand manage it separately, similar to how we manage client output buffers.
c. Display memory consumption of cluster links in theINFOMemorysection.
If we can reach agreement on these proposals, I can create separate PRs for #1 and #2. (#2 probably will be a bigger one and generates more discussions, so don't want to bundle them together.)
Additional information
Any additional information that is relevant to the problem.
