Replication backlog and replicas use one global shared replication buffer #9166

ShooterIT · 2021-06-29T08:21:13Z

Background

For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory.

Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism.

Implementation

I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is replBufBlock, which has id, repl_offset, refcount fields.

/* Replication buffer blocks is the list of replBufBlock.
 *
 * +--------------+       +--------------+       +--------------+
 * | refcount = 1 |  ...  | refcount = 0 |  ...  | refcount = 2 |
 * +--------------+       +--------------+       +--------------+
 *      |                                            /       \
 *      |                                           /         \
 *      |                                          /           \
 *  Repl Backlog                               Replia_A      Replia_B
 * 
 * Each replica or replication backlog increments only the refcount of the
 * 'ref_repl_buf_node' which it points to. So when replica walks to the next
 * node, it should first increase the next node's refcount, and when we trim
 * the replication buffer nodes, we remove node always from the head node which
 * refcount is 0. If the refcount of the head node is not 0, we must stop
 * trimming and never iterate the next node. */

/* Similar with 'clientReplyBlock', it is used for shared buffers between
 * all replica clients and replication backlog. */
typedef struct replBufBlock {
    int refcount;           /* Number of replicas or repl backlog using. */
    long long id;           /* The unique incremental number. */
    long long repl_offset;  /* Start replication offset of the block. */
    size_t size, used;
    char buf[];
} replBufBlock;

So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer feedReplicationBuffer. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding client-output-buffer-limit, and trim replication backlog if exceeding repl-backlog-size.

When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks.

Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index.

Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in beforeSleep(free 640 blocks).

Other changes

mem_total_replication_buffers: we add this field in INFO command, it means the total memory of replication buffers used.
mem_clients_slaves: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog.
Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
client-output-buffer-limit check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory.
Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it.
Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas.

Other optimizations

This solution resolve some other problem:

When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big client-output-buffer-limit for replicas, but now, it doesn't cause freezing.
This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light.
If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem.
Resizing replication backlog size doesn't empty current replication backlog content.

oranagra · 2021-07-06T08:58:57Z

@ShooterIT thank you for this PR.
I've briefly reviewed the code, trying to sum up what i understand:

the server struct has a mechanism similar to the reply list that exists in every client, this one has a refcount in each node indicating how many replicas are sharing that node.
a replica client can also have a normal (private) reply list, which is normally used on successful psync.
each replica client tracks how many bytes in the shared reply list are meaningful for it (in addition to tracking it's own reply buffers size).
when considering the obuf limits disconnection, we look at both the private and shared memory (similar total as before this PR).
when considering the total memory used for replication in the server (e.g. for eviction and info), we count the shared memory just once (counting actual memory usage).

My thoughts:
We have a long term plan to (almost) completely eliminate the slave buffers, see: #8440 (comment)
Had this PR provide a significant improvement with a small price of complexity, i would have merged it right away..
but considering the added complexity, and considering the long term roadmap, i think we may wanna avoid it.

Note that the above mentioned roadmap plan doesn't indeed solve a case where a sudden traffic spike on the master accumulate a large replica output buffer on multiple repliacs (your PR will mitigate that), but i'm not sure that justifies the complexity price.

Note that other cases (slow replicas that can't catch up with the master traffic) can just be considered bad configuration.

@redis/core-team @yoav-steinberg feel free to argue or share additional thoughts.

ShooterIT · 2021-07-06T10:21:27Z

Thanks for your review @oranagra i want to argue for my PR😜
For your plan, i ever tried to design, there are some problems i encountered, please correct me if i am wrong.

Replica stores output buffers during the RDB transfer, we need to reserve this big size memory for developing every instance, because the buffer still exist when finishing loading entire RDB, and master may change into replica, replica also may change into master. So i think we can use this reserved memory only on master.
On full synchronization with multiple replicas, every replica uses one output buffer, for total memory used in one shard, this method still cost much memory.
You already said, your plan still has this bad case, actually, we often encounter it, because some replicas may be thousands of kilometers away from replicas, network sometime is not unstable.

Note that the above mentioned roadmap plan doesn't indeed solve a case where a sudden traffic spike on the master accumulate a large replica output buffer on multiple replicas.
Send replication stream during transferring RDB, for diskless replication, there may be more COW memory use because child process exists more time.
Multiplexing replication packet is much complicated, replicas need to distinguish different packet types and install different read/write handlers. especially for diskless load solution, i think it is more complicated. Moreover, we need to concern compatibility for syncing with old version instance.
It will be not easy to many replicas share one RDB on disk-based replication, because we send replication output buffer ASAP for every replication, new full sync replicas can't copy old replicas' output buffer.

For this PR, i have another idea if we allow, we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism, replication backlog just is a consumer of replication buffer, this also may save some memory, and saving memory copy because we just increase refcount of some node when one replica hits replication backlog content. Moreover, replication backlog size may be the biggest size of replication buffer that is kept by slow replicas.

oranagra · 2021-07-06T14:52:55Z

I don't have answers to everything, i'll try to respond to the parts i can..

regarding reservation of memory on the replica machine, i think this is better than having to reserve that memory on the master machine. one case that's easy to reason with is a case where the machine hosting redis (master) is running low of memory (because the master grew), and now you wanna migrate that master to another machine, so we set up a bigger machine, but we can't get the data out of the old machine, since there's not enough memory on it to host the replica buffers and CoW. if we move the replica buffers to be a problem of the replica machine, the admin can allocate enough memory and save the day.
for multiple replicas, they're usually on multiple machines, and each should have enough spare, it's right that if we had hosted the replica buffers on the master and share them we can save the total memory, but when the replica is promoted it'll have to have that spare too, and considering a replica per machine, that's the same total.
i think the admin should make sure there's enough memory on the machine to support these spikes, or we can throttle them. i agree it's better to share memory if we can, it's just a matter of complexity vs gain.
yes indeed it can slow down replication and cause more COW.. but on the bright side it save the replica buffers, not sure if that's not overall beneficial.
first, we don't really need to support old replicas, we don't do that in the rdb format either.. a replica must never be of an older version than the master. however, we may still want to use some replconf capa mechanism to switch this off for compatibility with other tools that pose as a redis replica.
yes, disk-based replicas will not be able to join an ongoing fork in that mode.. same as we have with diskless.

i also thought about that idea of using the shared output buffers for replication backlog.. basically saying that refcount of 0 is still not freed in some conditions (up to a certain size).

overall, this PR is certainly useful... my concern is about the extra complexity it adds and whether of not it is needed if we had the other solution in place.. i.e. imagine a case where the other solution is already implemented, then the problem this PR comes to solve is not really that painful.

i'll try to give it another look, maybe we can simplify it, or maybe it isn't as complex as i think it is..

ShooterIT · 2021-08-02T15:52:26Z

Yes, actually, this mechanism is not complex, i also lightly refactor writeToClient which is a bit long. If no tests and refactor, there are less changes.

I must acknowledge that reservation of memory on the replica machine is better than on the master because master data safety is much more important, especially, CoW expense also is on master. In fact, i think my this PR doesn't conflict with your solution(sending RDB and replication buffer by multiplexing), your solution may mitigate the risk of OOM on full synchronization. My PR aims to reduce total memory for all replicas' output buffer, when there are many replicas and output buffer is accumulated since of slow network or waiting RDB finished.

We always hope the consuming memory of redis is predictable and controllable after setting maxmemory. We general set memory quota of redis as 1.5~2 multiple of maxmemory (CoW, Replication backlog, replicas output buffer) in our deployment, but redis may eat too much memory when more replicas and worse network. Reserving more memory is costly, but not adding memory is risky in this case. In some other disk storage services, such as MySQL, there is only one binlog for replication whatever how many replicas, actually, copying writing commands to every replica output buffer also is writing amplification.

In future, for replication backlog, i want we could regard the shared replication buffer as replication backlog if replication backlog is less than shared replication buffer because copying replication buffer is very light, i.e. if replication backlog is 100MB but replicas output buffer limit is 1GB, due to that one replica is slow and keep the replication buffer to 1GB, so another replica could start partially synchronization when reconnecting even its offset gap is more than 100MB with master, of course, offset gap should be not more than 1GB.

oranagra · 2021-08-17T10:21:47Z

@ShooterIT we discussed this PR in a core-team meeting and decided we wanna proceed.
I didn't review it in detail, and i'm not certain what's the current state (i did take a brief look to realize the concepts).
I suppose we'd want to also proceed with using this for the replication backlog too (unless you think it's a bad idea to add it now, and prefer to add it in a followup PR).
please let me know what you think the next step is... if you want to do some refresh and add the replication backlog into the mix, or you think it's ready for a detailed review.
thanks.

ShooterIT · 2021-08-18T15:48:07Z

thanks @oranagra Wow! This branch has some conflicts with unstable branch, i need to resolve them, I think current code already implemented the function that all replicas use one global shared replication buffer.
I want to apply this mechanism to the replication backlog too, but initially, to make it easy to review, i prefer to make another PR to do that, in fact, i didn't implement it yet. if you want me implement them in this one PR, i also feel fine.

oranagra · 2021-08-18T16:53:14Z

my hunch is that this is better done together, it'll probably impose a few changes on the current mechanisms.
but if you feel that it'll be too complicated to review and reason with in one go, i'm ok to do it later.

ShooterIT · 2021-09-08T11:54:00Z

Generally it is ready to review and update top comment @redis/core-team do you have any thought?

Introduced in redis#9166

Introduced in #9166

After PR #9166 , replication backlog is not a real block of memory, just contains a reference points to replication buffer's block and the blocks index (to accelerate search offset when partial sync), so we need update both replication buffer's block's offset and replication backlog blocks index's offset when master restart from RDB, since the `server.master_repl_offset` is changed. The implications of this bug was just a slow search, but not a replication failure.

Since the loop in incrementalTrimReplicationBacklog checks the size of histlen, we cannot afford to update it only when the loop exits, this may cause deleting much more replication blocks, and replication backlog may be less than setting size. introduce in #9166 Co-authored-by: sundb <sundbcn@gmail.com>

The issue was that setting maxmemory to used_memory and expecting eviction is insufficient, since we need to take mem_not_counted_for_evict into consideration. This test got broken by #9166

…ink (#10020) Since #9166 we have an assertion here to make sure replica clients don't write anything to their buffer. But in reality a replica may attempt write data to it's buffer simply by sending a command on the replication link. This command in most cases will be rejected since #8868 but it'll still generate an error. Actually the only valid command to send on a replication link is 'REPCONF ACK` which generates no response. We want to keep the design so that replicas can send commands but we need to avoid any situation where we start putting data in their response buffers, especially since they aren't used anymore. This PR makes sure to disconnect a rogue client which generated a write on the replication link that cause something to be written to the response buffer. To recreate the bug this fixes simply connect via telnet to a redis server and write sync\r\n wait for the the payload to be written and then write any command (valid or invalid), such as ping\r\n on the telnet connection. It'll crash the server.

Added regression tests for redis#10020 / redis#10081 / redis#10243. The above PRs fixed some crashes due to an asserting, see function `clientHasPendingReplies` (introduced in redis#9166). This commit added some tests to cover the above scenario. These tests will all fail in redis#9166, althought fixed not, there is value in adding these tests to cover and verify the changes. And it also can cover redis#8868 (verify the logs). Other changes: reduces the wait time in `waitForBgsave` and `waitForBgrewriteaof` from 1s to 50ms, which should reduce the time for some tests.

Added regression tests for #10020 / #10081 / #10243. The above PRs fixed some crashes due to an asserting, see function `clientHasPendingReplies` (introduced in #9166). This commit added some tests to cover the above scenario. These tests will all fail in #9166, althought fixed not, there is value in adding these tests to cover and verify the changes. And it also can cover #8868 (verify the logs). Other changes: 1. Reduces the wait time in `waitForBgsave` and `waitForBgrewriteaof` from 1s to 50ms, which should reduce the time for some tests. 2. Improve the test infra to print context when `assert_match` fails. 3. Improve the test infra to print `$error` when `assert_error` fails. ``` Expected an error matching 'ERR*' but got 'OK' (context: type eval line 4 cmd {assert_error "ERR*" {r set a b}} proc ::test) ```

From redis#9166, we need to call several times of prepareReplicasToWrite when propagating one write command to replication stream, that is not necessary. Now we only call it one time at the begin of feeding replication stream.

From #9166, we call several times of prepareReplicasToWrite when propagating one write command to replication stream (once per argument, same as we do for normal clients), that is not necessary. Now we only call it one time per command at the begin of feeding replication stream. This results in reducing CPU consumption and slightly better performance, specifically when there are many replicas.

…truct (#10697) Move the client flags to a more cache friendly position within the client struct we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ). These are due to higher rate of calls to getClientType due to changes in #9166 and #10020

From redis#9166, we call several times of prepareReplicasToWrite when propagating one write command to replication stream (once per argument, same as we do for normal clients), that is not necessary. Now we only call it one time per command at the begin of feeding replication stream. This results in reducing CPU consumption and slightly better performance, specifically when there are many replicas.

…truct (redis#10697) Move the client flags to a more cache friendly position within the client struct we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ). These are due to higher rate of calls to getClientType due to changes in redis#9166 and redis#10020

ShooterIT added 6 commits June 28, 2021 17:53

All replicas use one global replication buffer

4028ac4

Fix typo

8a6ee62

All replicas' client memory is one

51f9393

Reset another commit

cca56bf

Fix client memory count

acb5869

Fix missing decr block reference count when clients disconnect

a96a645

Add tests

d714dcd

ShooterIT marked this pull request as ready for review July 9, 2021 06:27

oranagra added the approval-needed Waiting for core team approval to be merged label Aug 1, 2021

oranagra added state:major-decision Requires core team consensus release-notes indication that this issue needs to be mentioned in the release notes and removed approval-needed Waiting for core team approval to be merged labels Aug 17, 2021

merge unstable

11c934e

BobEQAlpha mentioned this pull request Aug 25, 2021

[ACTIVE REP] [ENHANCEMENT] Set a static replication ID Snapchat/KeyDB#353

Open

ShooterIT marked this pull request as draft August 30, 2021 02:27

ShooterIT added 5 commits September 2, 2021 18:09

Replication backlog also use global replication buffer

c51b92c

Fix typo

dcff7a2

Fix failed tcl tests

d574ae6

Use rax tree to implement block offset index

ba62fd3

Merge branch 'unstable' into repl-buf-backlog2

6c4c938

ShooterIT marked this pull request as ready for review September 6, 2021 09:44

Fix memory leak

ceaec1f

ShooterIT added a commit to ShooterIT/redis that referenced this pull request Oct 29, 2021

Fix timing issue in replication buffer test

6494361

Introduced in redis#9166

ShooterIT mentioned this pull request Oct 29, 2021

Fix timing issue in replication buffer test #9697

Merged

oranagra pushed a commit that referenced this pull request Oct 29, 2021

Fix timing issue in replication buffer test (#9697)

68886de

Introduced in #9166

ShooterIT mentioned this pull request Nov 1, 2021

Fix not updating backlog histlen when trimming repl backlog #9713

Merged

soloestoy mentioned this pull request Nov 2, 2021

rebuild replication backlog index when master restart #9720

Merged

oranagra mentioned this pull request Nov 2, 2021

Solve issues with tracking test in external mode #9726

Merged

hzzb mentioned this pull request Nov 11, 2021

replication: introduce REPLCONF buf-only to listen only modification without whole RDB #9742

Closed

yoav-steinberg mentioned this pull request Dec 28, 2021

Make sure replicas don't write their own replies to the replication link #10020

Merged

enjoy-binbin mentioned this pull request Jan 8, 2022

[CRASH] ASSERTION FAILED and stack-buffer-overflow in networking.c:1026 #10076

Closed

enjoy-binbin mentioned this pull request Feb 12, 2022

Regression test for sync psync crash #10288

Merged

ShooterIT mentioned this pull request Mar 15, 2022

[BUG]replicationFeedSlaves still high CPU utilization in redis7.0 when it Replication backlog and replicas use one global shared replication buffer #10428

Closed

ShooterIT mentioned this pull request Apr 15, 2022

Optimize the call of prepareReplicasToWrite #10588

Merged

filipecosta90 mentioned this pull request Apr 26, 2022

large ammount of getClientType() within the reply code is costing us around 3% of cpu cycles on LRANGE command (or any command that heavily relies on partial reply building) #10648

Closed

sundb mentioned this pull request Nov 18, 2022

Performance degrade 7.0.3 vs 6.2.7 #10981

Closed

sundb mentioned this pull request Apr 20, 2023

Drop replication backlog after restarting with expired if RDB doesn't have replication info #12080

Closed

nantiferov mentioned this pull request May 24, 2024

Feature: Add support for mem_total_replication_buffers redis 7.0 metric oliver006/redis_exporter#909

Closed

2 tasks

minchopaskal mentioned this pull request Sep 2, 2025

Handle primary and replica clients in iothreads minchopaskal/redis#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replication backlog and replicas use one global shared replication buffer #9166

Replication backlog and replicas use one global shared replication buffer #9166

Uh oh!

ShooterIT commented Jun 29, 2021 •

edited

Loading

Uh oh!

oranagra commented Jul 6, 2021

Uh oh!

ShooterIT commented Jul 6, 2021

Uh oh!

oranagra commented Jul 6, 2021

Uh oh!

ShooterIT commented Aug 2, 2021

Uh oh!

oranagra commented Aug 17, 2021

Uh oh!

ShooterIT commented Aug 18, 2021

Uh oh!

oranagra commented Aug 18, 2021

Uh oh!

ShooterIT commented Sep 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Replication backlog and replicas use one global shared replication buffer #9166

Replication backlog and replicas use one global shared replication buffer #9166

Uh oh!

Conversation

ShooterIT commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Implementation

Other changes

Other optimizations

Uh oh!

oranagra commented Jul 6, 2021

Uh oh!

ShooterIT commented Jul 6, 2021

Uh oh!

oranagra commented Jul 6, 2021

Uh oh!

ShooterIT commented Aug 2, 2021

Uh oh!

oranagra commented Aug 17, 2021

Uh oh!

ShooterIT commented Aug 18, 2021

Uh oh!

oranagra commented Aug 18, 2021

Uh oh!

ShooterIT commented Sep 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ShooterIT commented Jun 29, 2021 •

edited

Loading