Introduce memory management on cluster link buffers #9774

ny0312 · 2021-11-12T23:54:03Z

This PR introduced memory management on cluster bus link send buffers, to address issues raised in #9688.

The interface changes are (need core-team consensus):

Introduced a new cluster-link-sendbuf-limit config that caps memory usage of cluster bus link send buffers.
- Its default is chosen to be 0 (aka infinite).
Introduced a new CLUSTER LINKS command that displays current TCP links to/from peers as a RESP3 map.
Added mem_cluster_links field under the Memory section of INFO command output, which displays the overall memory usage by all current cluster links.
- Value of mem_cluster_links is also deducted from used_memory_dataset to better isolate the memory usage of real key values.
- Memory usage of cluster links are still counted toward the maxmemory limit, so over consumption by cluster links can still cause evictions or OOM errors (no behavior change).
Added total_cluster_links_buffer_limit_exceeded field under the CLUSTER INFO command output, which displays the accumulated count of cluster links freed due to cluster-link-sendbuf-limit.

On an implementation level:

In clusterNode struct, added an inbound_link field pointing to the inbound link accepted from this node. Before this change, inbound links are not associated with the corresponding clusterNode and are loosely managed.

ny0312 · 2021-11-12T23:55:19Z

This is a new PR that replaces old PR: ny0312#1

There were some comments left on the old PR that are now addressed in this new PR.

madolson

Still some thoughts on the other PR, but this one seems OK to me. We also need a change in redis.conf to describe the parameter.

tests/cluster/tests/24-links.tcl

madolson

LGTM otherwise.

src/cluster.c

madolson · 2021-11-28T02:32:38Z

@redis/core-team Please review, there are two core team decisions. The first is the config to limit maximum amount of outbound data, the second is the command to describe the information for the cluster links.

oranagra

i didn't review the majority of the code. just the parts related to the new interfaces.
i suppose the feature is valid, but i'd like to argue about the default value for the config.

redis.conf

src/cluster.c

src/server.c

src/cluster.c

redis.conf

madolson · 2021-12-07T08:23:27Z

@ny0312 The decision was that we will set the default to 0, implying infinite today, and we'll evaluate how it works. We'll document that you can do something about it.

bashanyy · 2021-12-07T08:32:11Z

谢谢

src/server.c

ny0312 · 2021-12-14T01:10:36Z

Accompanying doc PR: redis/redis-doc#1710

Thanks.

madolson

LGTM, @redis/core-team back to ya'll for approval.

oranagra

one comment. otherwise LTTM.

redis.conf

oranagra · 2021-12-14T06:38:47Z

The top comment of the PR is outdated too.

oranagra

LGTM

madolson · 2021-12-16T03:13:07Z

@ny0312 Sorry to ask more from you, but we moved to a new command definition system:

If you had any PR that modifies the command table, you'll need to rebase, modify the json files, and run the python script that generates commands.c (and commit that too).
Same thing about modifying some things about the command (like argument, response, etc. they're all documented there now).

If you have any redis-doc PR that changes commands.json, make sure that the change is reflected in the json files in the redis repo, the command.json in redis-doc is now in "code" freeze (for now you can keep the change in both repos, but make sure they're "Identical"), we'll soon delete the json file from redis-doc and take the data from redis in some way.

You can transfer the contents of the doc PR over to this CR now.

@yossigo Can you approve the major changes if you have time, so we can get this merged?

madolson · 2021-12-16T20:21:02Z

@oranagra Can you verify that commands.c and the json file were implemented correctly? They look right to me but I haven't looked that much. Besides that I think this is ready to merge along with the doc PR.

oranagra

LGTM

sundb · 2021-12-17T15:26:16Z

Disconnect link when send buffer limit reached test failed in my daily CI.
https://github.com/sundb/redis/runs/4560035116?check_suite_focus=true
https://github.com/sundb/redis/runs/4560034978?check_suite_focus=true

Introduce memory management on cluster link buffers: * Introduce a new `cluster-link-sendbuf-limit` config that caps memory usage of cluster bus link send buffers. * Introduce a new `CLUSTER LINKS` command that displays current TCP links to/from peers. * Introduce a new `mem_cluster_links` field under `INFO` command output, which displays the overall memory usage by all current cluster links. * Introduce a new `total_cluster_links_buffer_limit_exceeded` field under `CLUSTER INFO` command output, which displays the accumulated count of cluster links freed due to `cluster-link-sendbuf-limit`.

oranagra · 2021-12-29T09:02:27Z

@ny0312 can you please look into these failures.
see also
https://github.com/redis/redis/runs/4654435937?check_suite_focus=true#step:10:1262

01:17:54> Each node has two links with each peer: FAILED: Expected 19*2 eq 37 (context: type eval line 11 cmd {assert {$num_peers*2 eq $num_links}} proc ::foreach_instance_id)

madolson · 2021-12-29T14:21:28Z

@oranagra @ny0312 that issue specifically is caused because the cluster tests do hard resets in between tests, and the nodes may not have re-established all links, especially for slower tests. I was looking into the other failures a bit last night, and don't have any other suggested fixes at the moment.

oranagra · 2022-01-02T05:49:06Z

some failure in the FreeBSD CI in Disconnect link when send buffer limit reached
https://github.com/redis/redis/runs/4681750680?check_suite_focus=true#step:4:7614
This may be an issue with the CI itself (it't not stable), but also, maybe there's some timing issue in the test.

madolson · 2022-01-02T05:50:12Z

I'll ping @ny0312 on our AWS slack, I'm not sure he's noticed this.

oranagra · 2022-01-02T05:51:59Z

@madolson a few days ago he told me he's on vacation 8-)

madolson · 2022-01-02T05:59:12Z

I probably should have known that :D.

I was able to slowly validate it was this commit that is causing both the crashes on sanitizer and the buffer disconnect issues. I have no idea why though, it might just be we're dumping a bunch of memory into the test which doesn't seem to be done elsewhere in the cluster test.

ny0312 · 2022-01-03T17:16:49Z

I'm back. I will sync up with Maddy and investigate and fix what is going on. Sorry and thanks.

oranagra · 2022-01-09T09:35:19Z

any news?
i must admit i didn't notice it fail lately, but got this one now:
https://github.com/redis/redis/runs/4733591841?check_suite_focus=true

00:47:22> Disconnect link when send buffer limit reached: error writing "sock802fbc590": broken pipe

it could be just a random failure of the slow freebsd CI (we see broken pipes in random places).
note that in the past we saw:

00:44:36> Disconnect link when send buffer limit reached: error writing "sock80264c2d0": connection reset by peer

as well as other types of errors, on non-freebsd CI.

a hung triggering a SIGSEGV kill by the test framework on a sanitizer run.
a the following failure on centos+tls:

01:17:54> Each node has two links with each peer: FAILED: Expected 19*2 eq 37 (context: type eval line 11 cmd {assert {$num_peers*2 eq $num_links}} proc ::foreach_instance_id)

ny0312 · 2022-01-14T00:54:21Z

Hey sorry for the late response. But I have been trying to reproduce this test failure, but couldn't.

This is the command I run (I copied it from https://github.com/redis/redis/runs/4654435988?check_suite_focus=true):

gmake || exit 1 ; if echo "" | grep -vq redis ; then ./runtest --verbose --timeout 2400 --no-latency --dump-logs  || exit 1 ; fi ; if echo "" | grep -vq modules ; then MAKE=gmake ./runtest-moduleapi --verbose --timeout 2400 --no-latency --dump-logs  || exit 1 ; fi ; if echo "" | grep -vq sentinel ; then ./runtest-sentinel  || exit 1 ; fi ; if echo "" | grep -vq cluster ; then ./runtest-cluster  || exit 1 ; fi ;

I've run it numerous times locally. I've run it numerous times on a FreeBSD 64 bit EC2. All with the latest unstable branch. Never able to reproduce the failure.

From the error message, it looks like Redis crashed during my test. The crash must be flaky and non-deterministic.

Any ideas that could help reproduce it?

Another question - how do I preserve redis logs from these test runs?

sundb · 2022-01-14T01:57:49Z

@ny0312 It's bound to happen every time on my local freebsd vm, I've only allocated 1 core and 256M RAM to it, and since I don't know about clusters, there's no way to solve this problem directly, maybe I can package the logs and send them to you.

I use the following to reproduce.

./runtest-cluster --single 24-links --dont-clean

sundb · 2022-01-14T06:08:42Z

When memory is exhausted, this test is bound to fail, and here are some logs after enabling loglevel debug.

16316:M 14 Jan 2022 12:48:51.257 . I/O error reading from node link: connection closed
16316:M 14 Jan 2022 12:48:51.315 . I/O error reading from node link: connection closed
16316:M 14 Jan 2022 12:48:51.315 . I/O error reading from node link: connection closed
16316:M 14 Jan 2022 12:48:51.318 . Unable to connect to Cluster Node [127.0.0.1]:40000 -> accept: Resource temporarily unavailable
16316:M 14 Jan 2022 12:48:51.318 . Unable to connect to Cluster Node [127.0.0.1]:40002 -> accept: Resource temporarily unavailable
16316:M 14 Jan 2022 12:48:51.318 . Unable to connect to Cluster Node [127.0.0.1]:40001 -> accept: Resource temporarily unavailable

16314:M 14 Jan 2022 12:48:51.254 . I/O error reading from node link: Connection reset by peer
16314:M 14 Jan 2022 12:48:51.257 . I/O error reading from node link: connection closed

ny0312 · 2022-01-14T06:39:17Z

Huh, looks like file descriptors are exhausted?

@sundb ,

You mentioned you reproduced this on a FreeBSD vm. Could you please share what exactly is your setup? Which VirtualMachine?

In the meanwhile, I will try to simulate your setup with a EC2 to reproduce. i.e. I will provision a EC2 with FreeBSD, 1 core and 256MB memory.

Also, if you could share the full Redis logs for the failed test, that'd be great. (For example you could attach a file here, or use something like https://justpaste.it/)

Thanks a lot!

sundb · 2022-01-14T07:11:08Z

@ny0312 justpaste.it can't upload files, so I created a repository to hold the logs(https://github.com/sundb/log).

system: Freebsd 12 64bit
vm: VMware
ulimit

-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          33554432
-s: stack size (kbytes)             524288
-c: core file size (blocks)         unlimited
-m: resident set size (kbytes)      unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       10240
-n: file descriptors                10240
-b: socket buffer size (bytes)      unlimited
-v: virtual memory size (kbytes)    unlimited
-p: pseudo-terminals                unlimited
-w: swap size (kbytes)              unlimited
-k: kqueues                         unlimited
-o: umtx shared locks               unlimited

change loglevel

diff --git a/tests/cluster/run.tcl b/tests/cluster/run.tcl
index c81d8f39d..ddf9986aa 100644
--- a/tests/cluster/run.tcl
+++ b/tests/cluster/run.tcl
@@ -17,6 +17,7 @@ proc main {} {
         "appendonly yes"
         "enable-protected-configs yes"
         "enable-debug-command yes"
+        "loglevel debug"
     }
     run_tests
     cleanup

ny0312 · 2022-01-18T05:11:41Z

OK. I am able to reproduce the test failure with 1 core, 512MB RAM 64bit FreeBSD.

The failure is Redis getting OOM killed by kernel due to out of swap. In the test, I'm allowing cluster link buffers to grow up to 32MB. That proved to be too much for the FreeBSD test environment used by the daily runs.

I'm currently experimenting with a new cluster link buffer limit in my test. The challenge is that in order to trigger the condition under test, which is a cluster bus link being freed once over the limit, I need to fill up the cluster link buffer. But to fill up the link buffer, I need to first fill up the TCP write buffer in the kernel level, which varies from OS to OS.

I will post a fix PR as soon as I can get a good stable buffer limit.

oranagra · 2022-01-18T05:53:21Z

note that in the "other" test suite, we have the --large-memory which enables us to run certain tests only on systems with a lot of memory (currently only manually).

also, what about the test failures in the sanitizer runs, like this one:
https://github.com/redis/redis/runs/4829401183?check_suite_focus=true#step:9:630

not sure if this one is related:
https://github.com/redis/redis/runs/4810931899?check_suite_focus=true#step:9:466

ny0312 · 2022-01-19T01:04:23Z

also, what about the test failures in the sanitizer runs, like this one:
https://github.com/redis/redis/runs/4829401183?check_suite_focus=true#step:9:630

This has a different root cause. In the test, I'm assuming as soon as I send a large PUBLISH command to fill up a cluster link, the link will be freed. But in reality the link will only get freed in the next clusterCron run whenever that happens. My test is not accounting for this race condition. I will fix it too.

not sure if this one is related:
https://github.com/redis/redis/runs/4810931899?check_suite_focus=true#step:9:466

This link has two failures. test-sanitizer-address (clang) and test-freebsd

test-sanitizer-address (clang) is not related to my test. test-freebsd is related to my test and it should be the same OOM root cause.

oranagra · 2022-01-19T06:34:51Z

i didn't look at the test, but i wanna mention that there are other tests in which we struggled to fill output buffers and had to deal with the OS socket buffers swallowing our bytes.

i think these are in:
tests/unit/client-eviction.tcl
tests/unit/maxmemory.tcl
tests/integration/replication-buffer.tcl

IIRC the best approach we had was to pause the destination process, and then gradually fill more and more data until we see it starts piling up, then stop.

ny0312 · 2022-01-20T21:39:25Z

@madolson @oranagra @sundb

PR to fix flaky test "Disconnect link when send buffer limit reached" - #10157

Sorry for all the inconvenience. And thanks for the help and advice.

tezc · 2022-01-21T07:06:56Z

@ny0312
Test failed : https://github.com/redis/redis/runs/4890129923?check_suite_focus=true#step:9:627

Testing unit: 24-links.tcl
00:39:15> (init) Restart killed instances: OK
00:39:15> Cluster nodes are reachable: OK
00:39:15> Cluster nodes hard reset: OK
00:39:15> Cluster Join and auto-discovery test: OK
00:39:17> Before slots allocation, all nodes report cluster failure: OK
00:39:17> Create a cluster with two single-node shards: OK
00:39:21> Cluster should start ok: OK
00:39:21> Each node has two links with each peer: FAILED: Expected 19*2 eq 37 (context: type eval line 11 cmd {assert {$num_peers*2 eq $num_links}} proc ::foreach_instance_id)
(Jumping to next unit after error)

Don't know about this test but might be related with timing. I increase "cluster-node-timeout" for something else on my branch, I see this test fails here. Just fyi.

ny0312 · 2022-01-21T17:42:10Z

@tezc Thanks for reporting the failure. Sorry for the inconvenience. I updated my PR(#10157) to fix this test as well.

ny0312 mentioned this pull request Nov 12, 2021

Introduce memory management on cluster link buffers ny0312/redis#1

Closed

ny0312 changed the title ~~Introduce memory management on cluster link buffers #1~~ Introduce memory management on cluster link buffers Nov 13, 2021

madolson added release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus labels Nov 14, 2021

madolson reviewed Nov 14, 2021

View reviewed changes

tests/cluster/tests/24-links.tcl Outdated Show resolved Hide resolved

madolson reviewed Nov 26, 2021

View reviewed changes

src/cluster.c Outdated Show resolved Hide resolved

madolson linked an issue Nov 28, 2021 that may be closed by this pull request

[NEW] Unbounded memory usage by cluster bus link buffers in face of asymmetric network parition #9688

Closed

madolson added approval-needed Waiting for core team approval to be merged state:needs-code-doc the PR requires documentation inside the redis repository labels Nov 28, 2021

madolson requested a review from yossigo November 28, 2021 02:30

oranagra reviewed Nov 28, 2021

View reviewed changes

redis.conf Outdated Show resolved Hide resolved

src/cluster.c Outdated Show resolved Hide resolved

src/cluster.c Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

madolson reviewed Nov 30, 2021

View reviewed changes

src/cluster.c Outdated Show resolved Hide resolved

redis.conf Outdated Show resolved Hide resolved

oranagra mentioned this pull request Dec 12, 2021

Free unused capacity in the cluster send buffer. #9255

Merged

madolson requested review from madolson and oranagra December 13, 2021 23:13

madolson reviewed Dec 13, 2021

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

madolson mentioned this pull request Dec 13, 2021

Allow most CONFIG SET during loading, block some commands in async-loading #9878

Merged

4 tasks

ny0312 force-pushed the cluster-link-memory-management branch 2 times, most recently from 6206889 to 58beb95 Compare December 14, 2021 00:33

ny0312 mentioned this pull request Dec 14, 2021

Add API documentation for CLUSTER LINKS command redis/redis-doc#1710

Merged

madolson approved these changes Dec 14, 2021

View reviewed changes

oranagra reviewed Dec 14, 2021

View reviewed changes

redis.conf Outdated Show resolved Hide resolved

oranagra approved these changes Dec 14, 2021

View reviewed changes

oranagra approved these changes Dec 16, 2021

View reviewed changes

madolson merged commit 792afb4 into redis:unstable Dec 17, 2021

madolson mentioned this pull request Jan 10, 2022

Sharded pubsub implementation #8621

Merged

6 tasks

ny0312 mentioned this pull request Jan 20, 2022

Fix flaky cluster tests in 24-links.tcl #10157

Merged

madolson mentioned this pull request Jun 14, 2022

[BUG] Clusterbus buffer management can consume significant memory and CPU utilization during pubsub #10863

Closed

madolson mentioned this pull request Sep 9, 2022

Prevent use after free for inbound cluster link #11255

Merged

Introduce memory management on cluster link buffers #9774

Introduce memory management on cluster link buffers #9774

Uh oh!

Conversation

ny0312 commented Nov 12, 2021 • edited by madolson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ny0312 commented Nov 12, 2021

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson commented Nov 28, 2021

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson commented Dec 7, 2021

Uh oh!

bashanyy commented Dec 7, 2021 via email

Uh oh!

Uh oh!

ny0312 commented Dec 14, 2021

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oranagra commented Dec 14, 2021

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

madolson commented Dec 16, 2021

Uh oh!

madolson commented Dec 16, 2021

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

sundb commented Dec 17, 2021

Uh oh!

oranagra commented Dec 29, 2021

Uh oh!

madolson commented Dec 29, 2021

Uh oh!

oranagra commented Jan 2, 2022

Uh oh!

madolson commented Jan 2, 2022

Uh oh!

oranagra commented Jan 2, 2022

Uh oh!

madolson commented Jan 2, 2022

Uh oh!

ny0312 commented Jan 3, 2022

Uh oh!

oranagra commented Jan 9, 2022

Uh oh!

ny0312 commented Jan 14, 2022

Uh oh!

sundb commented Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sundb commented Jan 14, 2022

Uh oh!

ny0312 commented Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ny0312 commented Nov 12, 2021 •

edited by madolson

Loading

sundb commented Jan 14, 2022 •

edited

Loading

ny0312 commented Jan 14, 2022 •

edited

Loading

tezc commented Jan 21, 2022 •

edited

Loading