Skip to content

Conversation

@ShooterIT
Copy link
Member

@ShooterIT ShooterIT commented Jun 29, 2021

Background

For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory.

Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism.

Implementation

I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is replBufBlock, which has id, repl_offset, refcount fields.

/* Replication buffer blocks is the list of replBufBlock.
 *
 * +--------------+       +--------------+       +--------------+
 * | refcount = 1 |  ...  | refcount = 0 |  ...  | refcount = 2 |
 * +--------------+       +--------------+       +--------------+
 *      |                                            /       \
 *      |                                           /         \
 *      |                                          /           \
 *  Repl Backlog                               Replia_A      Replia_B
 * 
 * Each replica or replication backlog increments only the refcount of the
 * 'ref_repl_buf_node' which it points to. So when replica walks to the next
 * node, it should first increase the next node's refcount, and when we trim
 * the replication buffer nodes, we remove node always from the head node which
 * refcount is 0. If the refcount of the head node is not 0, we must stop
 * trimming and never iterate the next node. */

/* Similar with 'clientReplyBlock', it is used for shared buffers between
 * all replica clients and replication backlog. */
typedef struct replBufBlock {
    int refcount;           /* Number of replicas or repl backlog using. */
    long long id;           /* The unique incremental number. */
    long long repl_offset;  /* Start replication offset of the block. */
    size_t size, used;
    char buf[];
} replBufBlock;

So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer feedReplicationBuffer. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding client-output-buffer-limit, and trim replication backlog if exceeding repl-backlog-size.

When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks.

Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index.

Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in beforeSleep(free 640 blocks).

Other changes

  • mem_total_replication_buffers: we add this field in INFO command, it means the total memory of replication buffers used.
  • mem_clients_slaves: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog.
  • Key eviction
    Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas.
    Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
  • client-output-buffer-limit check for replica clients
    It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory.
  • Drop replication backlog after loading data if needed
    We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it.
  • Multi IO threads
    Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas.

Other optimizations

This solution resolve some other problem:

  • When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big client-output-buffer-limit for replicas, but now, it doesn't cause freezing.
  • This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light.
  • If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem.
  • Resizing replication backlog size doesn't empty current replication backlog content.

@oranagra
Copy link
Member

oranagra commented Jul 6, 2021

@ShooterIT thank you for this PR.
I've briefly reviewed the code, trying to sum up what i understand:

  1. the server struct has a mechanism similar to the reply list that exists in every client, this one has a refcount in each node indicating how many replicas are sharing that node.
  2. a replica client can also have a normal (private) reply list, which is normally used on successful psync.
  3. each replica client tracks how many bytes in the shared reply list are meaningful for it (in addition to tracking it's own reply buffers size).
  4. when considering the obuf limits disconnection, we look at both the private and shared memory (similar total as before this PR).
  5. when considering the total memory used for replication in the server (e.g. for eviction and info), we count the shared memory just once (counting actual memory usage).

My thoughts:
We have a long term plan to (almost) completely eliminate the slave buffers, see: #8440 (comment)
Had this PR provide a significant improvement with a small price of complexity, i would have merged it right away..
but considering the added complexity, and considering the long term roadmap, i think we may wanna avoid it.

Note that the above mentioned roadmap plan doesn't indeed solve a case where a sudden traffic spike on the master accumulate a large replica output buffer on multiple repliacs (your PR will mitigate that), but i'm not sure that justifies the complexity price.

Note that other cases (slow replicas that can't catch up with the master traffic) can just be considered bad configuration.

@redis/core-team @yoav-steinberg feel free to argue or share additional thoughts.

@ShooterIT
Copy link
Member Author

Thanks for your review @oranagra i want to argue for my PR😜
For your plan, i ever tried to design, there are some problems i encountered, please correct me if i am wrong.

  • Replica stores output buffers during the RDB transfer, we need to reserve this big size memory for developing every instance, because the buffer still exist when finishing loading entire RDB, and master may change into replica, replica also may change into master. So i think we can use this reserved memory only on master.

  • On full synchronization with multiple replicas, every replica uses one output buffer, for total memory used in one shard, this method still cost much memory.

  • You already said, your plan still has this bad case, actually, we often encounter it, because some replicas may be thousands of kilometers away from replicas, network sometime is not unstable.

    Note that the above mentioned roadmap plan doesn't indeed solve a case where a sudden traffic spike on the master accumulate a large replica output buffer on multiple replicas.

  • Send replication stream during transferring RDB, for diskless replication, there may be more COW memory use because child process exists more time.

  • Multiplexing replication packet is much complicated, replicas need to distinguish different packet types and install different read/write handlers. especially for diskless load solution, i think it is more complicated. Moreover, we need to concern compatibility for syncing with old version instance.

  • It will be not easy to many replicas share one RDB on disk-based replication, because we send replication output buffer ASAP for every replication, new full sync replicas can't copy old replicas' output buffer.

For this PR, i have another idea if we allow, we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism, replication backlog just is a consumer of replication buffer, this also may save some memory, and saving memory copy because we just increase refcount of some node when one replica hits replication backlog content. Moreover, replication backlog size may be the biggest size of replication buffer that is kept by slow replicas.

@oranagra
Copy link
Member

oranagra commented Jul 6, 2021

I don't have answers to everything, i'll try to respond to the parts i can..

  • regarding reservation of memory on the replica machine, i think this is better than having to reserve that memory on the master machine. one case that's easy to reason with is a case where the machine hosting redis (master) is running low of memory (because the master grew), and now you wanna migrate that master to another machine, so we set up a bigger machine, but we can't get the data out of the old machine, since there's not enough memory on it to host the replica buffers and CoW. if we move the replica buffers to be a problem of the replica machine, the admin can allocate enough memory and save the day.
  • for multiple replicas, they're usually on multiple machines, and each should have enough spare, it's right that if we had hosted the replica buffers on the master and share them we can save the total memory, but when the replica is promoted it'll have to have that spare too, and considering a replica per machine, that's the same total.
  • i think the admin should make sure there's enough memory on the machine to support these spikes, or we can throttle them. i agree it's better to share memory if we can, it's just a matter of complexity vs gain.
  • yes indeed it can slow down replication and cause more COW.. but on the bright side it save the replica buffers, not sure if that's not overall beneficial.
  • first, we don't really need to support old replicas, we don't do that in the rdb format either.. a replica must never be of an older version than the master. however, we may still want to use some replconf capa mechanism to switch this off for compatibility with other tools that pose as a redis replica.
  • yes, disk-based replicas will not be able to join an ongoing fork in that mode.. same as we have with diskless.

i also thought about that idea of using the shared output buffers for replication backlog.. basically saying that refcount of 0 is still not freed in some conditions (up to a certain size).

overall, this PR is certainly useful... my concern is about the extra complexity it adds and whether of not it is needed if we had the other solution in place.. i.e. imagine a case where the other solution is already implemented, then the problem this PR comes to solve is not really that painful.

i'll try to give it another look, maybe we can simplify it, or maybe it isn't as complex as i think it is..

@ShooterIT ShooterIT marked this pull request as ready for review July 9, 2021 06:27
@oranagra oranagra added the approval-needed Waiting for core team approval to be merged label Aug 1, 2021
@ShooterIT
Copy link
Member Author

Yes, actually, this mechanism is not complex, i also lightly refactor writeToClient which is a bit long. If no tests and refactor, there are less changes.

I must acknowledge that reservation of memory on the replica machine is better than on the master because master data safety is much more important, especially, CoW expense also is on master. In fact, i think my this PR doesn't conflict with your solution(sending RDB and replication buffer by multiplexing), your solution may mitigate the risk of OOM on full synchronization. My PR aims to reduce total memory for all replicas' output buffer, when there are many replicas and output buffer is accumulated since of slow network or waiting RDB finished.

We always hope the consuming memory of redis is predictable and controllable after setting maxmemory. We general set memory quota of redis as 1.5~2 multiple of maxmemory (CoW, Replication backlog, replicas output buffer) in our deployment, but redis may eat too much memory when more replicas and worse network. Reserving more memory is costly, but not adding memory is risky in this case. In some other disk storage services, such as MySQL, there is only one binlog for replication whatever how many replicas, actually, copying writing commands to every replica output buffer also is writing amplification.

In future, for replication backlog, i want we could regard the shared replication buffer as replication backlog if replication backlog is less than shared replication buffer because copying replication buffer is very light, i.e. if replication backlog is 100MB but replicas output buffer limit is 1GB, due to that one replica is slow and keep the replication buffer to 1GB, so another replica could start partially synchronization when reconnecting even its offset gap is more than 100MB with master, of course, offset gap should be not more than 1GB.

@oranagra
Copy link
Member

@ShooterIT we discussed this PR in a core-team meeting and decided we wanna proceed.
I didn't review it in detail, and i'm not certain what's the current state (i did take a brief look to realize the concepts).
I suppose we'd want to also proceed with using this for the replication backlog too (unless you think it's a bad idea to add it now, and prefer to add it in a followup PR).
please let me know what you think the next step is... if you want to do some refresh and add the replication backlog into the mix, or you think it's ready for a detailed review.
thanks.

@oranagra oranagra added state:major-decision Requires core team consensus release-notes indication that this issue needs to be mentioned in the release notes and removed approval-needed Waiting for core team approval to be merged labels Aug 17, 2021
@ShooterIT
Copy link
Member Author

thanks @oranagra Wow! This branch has some conflicts with unstable branch, i need to resolve them, I think current code already implemented the function that all replicas use one global shared replication buffer.
I want to apply this mechanism to the replication backlog too, but initially, to make it easy to review, i prefer to make another PR to do that, in fact, i didn't implement it yet. if you want me implement them in this one PR, i also feel fine.

@oranagra
Copy link
Member

my hunch is that this is better done together, it'll probably impose a few changes on the current mechanisms.
but if you feel that it'll be too complicated to review and reason with in one go, i'm ok to do it later.

@ShooterIT ShooterIT marked this pull request as ready for review September 6, 2021 09:44
@ShooterIT
Copy link
Member Author

Generally it is ready to review and update top comment @redis/core-team do you have any thought?

ShooterIT added a commit to ShooterIT/redis that referenced this pull request Oct 29, 2021
oranagra pushed a commit that referenced this pull request Oct 29, 2021
oranagra pushed a commit that referenced this pull request Nov 2, 2021
After PR #9166 , replication backlog is not a real block of memory, just contains a
reference points to replication buffer's block and the blocks index (to accelerate
search offset when partial sync), so we need update both replication buffer's block's
offset and replication backlog blocks index's offset when master restart from RDB,
since the `server.master_repl_offset` is changed.
The implications of this bug was just a slow search, but not a replication failure.
oranagra pushed a commit that referenced this pull request Nov 2, 2021
Since the loop in incrementalTrimReplicationBacklog checks the size of histlen,
we cannot afford to update it only when the loop exits, this may cause deleting
much more replication blocks, and replication backlog may be less than setting size.

introduce in #9166 

Co-authored-by: sundb <sundbcn@gmail.com>
madolson pushed a commit that referenced this pull request Nov 2, 2021
The issue was that setting maxmemory to used_memory and expecting
eviction is insufficient, since we need to take
mem_not_counted_for_evict into consideration.

This test got broken by #9166
oranagra pushed a commit that referenced this pull request Jan 2, 2022
…ink (#10020)

Since #9166 we have an assertion here to make sure replica clients don't write anything to their buffer.
But in reality a replica may attempt write data to it's buffer simply by sending a command on the replication link.
This command in most cases will be rejected since #8868 but it'll still generate an error.
Actually the only valid command to send on a replication link is 'REPCONF ACK` which generates no response.

We want to keep the design so that replicas can send commands but we need to avoid any situation where we start
putting data in their response buffers, especially since they aren't used anymore. This PR makes sure to disconnect
a rogue client which generated a write on the replication link that cause something to be written to the response buffer.

To recreate the bug this fixes simply connect via telnet to a redis server and write sync\r\n wait for the the payload to
be written and then write any command (valid or invalid), such as ping\r\n on the telnet connection. It'll crash the server.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Feb 11, 2022
Added regression tests for redis#10020 / redis#10081 / redis#10243.
The above PRs fixed some crashes due to an asserting,
see function `clientHasPendingReplies` (introduced in redis#9166).

This commit added some tests to cover the above scenario.
These tests will all fail in redis#9166, althought fixed not,
there is value in adding these tests to cover and verify
the changes. And it also can cover redis#8868 (verify the logs).

Other changes: reduces the wait time in `waitForBgsave` and
`waitForBgrewriteaof` from 1s to 50ms, which should reduce
the time for some tests.
oranagra pushed a commit that referenced this pull request Feb 13, 2022
Added regression tests for #10020 / #10081 / #10243.
The above PRs fixed some crashes due to an asserting,
see function `clientHasPendingReplies` (introduced in #9166).

This commit added some tests to cover the above scenario.
These tests will all fail in #9166, althought fixed not,
there is value in adding these tests to cover and verify
the changes. And it also can cover #8868 (verify the logs).

Other changes: 
1. Reduces the wait time in `waitForBgsave` and `waitForBgrewriteaof`
from 1s to 50ms, which should reduce the time for some tests.
2. Improve the test infra to print context when `assert_match` fails.
3. Improve the test infra to print `$error` when `assert_error` fails.
```
Expected an error matching 'ERR*' but got 'OK' (context: type eval line 4 cmd {assert_error "ERR*" {r set a b}} proc ::test)
```
ShooterIT added a commit to ShooterIT/redis that referenced this pull request Apr 15, 2022
From redis#9166, we need to call several times of prepareReplicasToWrite when propagating
one write command to replication stream, that is not necessary. Now we only call it
one time at the begin of feeding replication stream.
oranagra pushed a commit that referenced this pull request Apr 17, 2022
From #9166, we call several times of prepareReplicasToWrite when propagating
one write command to replication stream (once per argument, same as we do for
normal clients), that is not necessary. Now we only call it one time per command
at the begin of feeding replication stream.

This results in reducing CPU consumption and slightly better performance,
specifically when there are many replicas.
oranagra pushed a commit that referenced this pull request Jun 1, 2022
…truct (#10697)

Move the client flags to a more cache friendly position within the client struct
we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ).
These are due to higher rate of calls to getClientType due to changes in #9166 and #10020
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
From redis#9166, we call several times of prepareReplicasToWrite when propagating
one write command to replication stream (once per argument, same as we do for
normal clients), that is not necessary. Now we only call it one time per command
at the begin of feeding replication stream.

This results in reducing CPU consumption and slightly better performance,
specifically when there are many replicas.
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…truct (redis#10697)

Move the client flags to a more cache friendly position within the client struct
we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ).
These are due to higher rate of calls to getClientType due to changes in redis#9166 and redis#10020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants