-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Replication backlog and replicas use one global shared replication buffer #9166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ShooterIT thank you for this PR.
My thoughts: Note that the above mentioned roadmap plan doesn't indeed solve a case where a sudden traffic spike on the master accumulate a large replica output buffer on multiple repliacs (your PR will mitigate that), but i'm not sure that justifies the complexity price. Note that other cases (slow replicas that can't catch up with the master traffic) can just be considered bad configuration. @redis/core-team @yoav-steinberg feel free to argue or share additional thoughts. |
|
Thanks for your review @oranagra i want to argue for my PR😜
For this PR, i have another idea if we allow, we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism, replication backlog just is a consumer of replication buffer, this also may save some memory, and saving memory copy because we just increase refcount of some node when one replica hits replication backlog content. Moreover, replication backlog size may be the biggest size of replication buffer that is kept by slow replicas. |
|
I don't have answers to everything, i'll try to respond to the parts i can..
i also thought about that idea of using the shared output buffers for replication backlog.. basically saying that refcount of 0 is still not freed in some conditions (up to a certain size). overall, this PR is certainly useful... my concern is about the extra complexity it adds and whether of not it is needed if we had the other solution in place.. i.e. imagine a case where the other solution is already implemented, then the problem this PR comes to solve is not really that painful. i'll try to give it another look, maybe we can simplify it, or maybe it isn't as complex as i think it is.. |
|
Yes, actually, this mechanism is not complex, i also lightly refactor I must acknowledge that reservation of memory on the replica machine is better than on the master because master data safety is much more important, especially, CoW expense also is on master. In fact, i think my this PR doesn't conflict with your solution(sending RDB and replication buffer by multiplexing), your solution may mitigate the risk of OOM on full synchronization. My PR aims to reduce total memory for all replicas' output buffer, when there are many replicas and output buffer is accumulated since of slow network or waiting RDB finished. We always hope the consuming memory of redis is predictable and controllable after setting maxmemory. We general set memory quota of redis as 1.5~2 multiple of maxmemory (CoW, Replication backlog, replicas output buffer) in our deployment, but redis may eat too much memory when more replicas and worse network. Reserving more memory is costly, but not adding memory is risky in this case. In some other disk storage services, such as MySQL, there is only one binlog for replication whatever how many replicas, actually, copying writing commands to every replica output buffer also is writing amplification. In future, for replication backlog, i want we could regard the shared replication buffer as replication backlog if replication backlog is less than shared replication buffer because copying replication buffer is very light, i.e. if replication backlog is 100MB but replicas output buffer limit is 1GB, due to that one replica is slow and keep the replication buffer to 1GB, so another replica could start partially synchronization when reconnecting even its offset gap is more than 100MB with master, of course, offset gap should be not more than 1GB. |
|
@ShooterIT we discussed this PR in a core-team meeting and decided we wanna proceed. |
|
thanks @oranagra Wow! This branch has some conflicts with unstable branch, i need to resolve them, I think current code already implemented the function that all replicas use one global shared replication buffer. |
|
my hunch is that this is better done together, it'll probably impose a few changes on the current mechanisms. |
|
Generally it is ready to review and update top comment @redis/core-team do you have any thought? |
Introduced in redis#9166
After PR #9166 , replication backlog is not a real block of memory, just contains a reference points to replication buffer's block and the blocks index (to accelerate search offset when partial sync), so we need update both replication buffer's block's offset and replication backlog blocks index's offset when master restart from RDB, since the `server.master_repl_offset` is changed. The implications of this bug was just a slow search, but not a replication failure.
Since the loop in incrementalTrimReplicationBacklog checks the size of histlen, we cannot afford to update it only when the loop exits, this may cause deleting much more replication blocks, and replication backlog may be less than setting size. introduce in #9166 Co-authored-by: sundb <sundbcn@gmail.com>
The issue was that setting maxmemory to used_memory and expecting eviction is insufficient, since we need to take mem_not_counted_for_evict into consideration. This test got broken by #9166
…ink (#10020) Since #9166 we have an assertion here to make sure replica clients don't write anything to their buffer. But in reality a replica may attempt write data to it's buffer simply by sending a command on the replication link. This command in most cases will be rejected since #8868 but it'll still generate an error. Actually the only valid command to send on a replication link is 'REPCONF ACK` which generates no response. We want to keep the design so that replicas can send commands but we need to avoid any situation where we start putting data in their response buffers, especially since they aren't used anymore. This PR makes sure to disconnect a rogue client which generated a write on the replication link that cause something to be written to the response buffer. To recreate the bug this fixes simply connect via telnet to a redis server and write sync\r\n wait for the the payload to be written and then write any command (valid or invalid), such as ping\r\n on the telnet connection. It'll crash the server.
Added regression tests for redis#10020 / redis#10081 / redis#10243. The above PRs fixed some crashes due to an asserting, see function `clientHasPendingReplies` (introduced in redis#9166). This commit added some tests to cover the above scenario. These tests will all fail in redis#9166, althought fixed not, there is value in adding these tests to cover and verify the changes. And it also can cover redis#8868 (verify the logs). Other changes: reduces the wait time in `waitForBgsave` and `waitForBgrewriteaof` from 1s to 50ms, which should reduce the time for some tests.
Added regression tests for #10020 / #10081 / #10243. The above PRs fixed some crashes due to an asserting, see function `clientHasPendingReplies` (introduced in #9166). This commit added some tests to cover the above scenario. These tests will all fail in #9166, althought fixed not, there is value in adding these tests to cover and verify the changes. And it also can cover #8868 (verify the logs). Other changes: 1. Reduces the wait time in `waitForBgsave` and `waitForBgrewriteaof` from 1s to 50ms, which should reduce the time for some tests. 2. Improve the test infra to print context when `assert_match` fails. 3. Improve the test infra to print `$error` when `assert_error` fails. ``` Expected an error matching 'ERR*' but got 'OK' (context: type eval line 4 cmd {assert_error "ERR*" {r set a b}} proc ::test) ```
From redis#9166, we need to call several times of prepareReplicasToWrite when propagating one write command to replication stream, that is not necessary. Now we only call it one time at the begin of feeding replication stream.
From #9166, we call several times of prepareReplicasToWrite when propagating one write command to replication stream (once per argument, same as we do for normal clients), that is not necessary. Now we only call it one time per command at the begin of feeding replication stream. This results in reducing CPU consumption and slightly better performance, specifically when there are many replicas.
From redis#9166, we call several times of prepareReplicasToWrite when propagating one write command to replication stream (once per argument, same as we do for normal clients), that is not necessary. Now we only call it one time per command at the begin of feeding replication stream. This results in reducing CPU consumption and slightly better performance, specifically when there are many replicas.
…truct (redis#10697) Move the client flags to a more cache friendly position within the client struct we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ). These are due to higher rate of calls to getClientType due to changes in redis#9166 and redis#10020
Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism.
Implementation
I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is
replBufBlock, which hasid, repl_offset, refcountfields.So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer
feedReplicationBuffer. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceedingclient-output-buffer-limit, and trim replication backlog if exceedingrepl-backlog-size.When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in
beforeSleep(free 640 blocks).Other changes
mem_total_replication_buffers: we add this field in INFO command, it means the total memory of replication buffers used.mem_clients_slaves: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog.Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
client-output-buffer-limitcheck for replica clientsIt doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory.
We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it.
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
Other optimizations
This solution resolve some other problem:
client-output-buffer-limitfor replicas, but now, it doesn't cause freezing.