Prevent huge WAL sizes when rocksdb sharding enabled by aclamk · Pull Request #35277 · ceph/ceph

aclamk · 2020-05-27T14:26:52Z

This fixes problem in case when sharding is turned on ('bluestore_rocksdb_cf=true').
Default value (0) caused rocksdb to set maximum of 16GB for WALs.
Now this is 1GB. This still can be overridden by setting rocksdb options
'max_total_wal_size' and 'max_write_buffer_number'.

ifed01 · 2020-05-27T14:32:48Z

src/kv/RocksDBStore.cc

+  // This is only used if we have some non-default column families,
+  // But value rocksdb uses by default is way too large
+  if (opt.max_total_wal_size == 0) {
+    opt.max_total_wal_size = opt.write_buffer_size;


Isn't it better to adjust bluestore_rocksdb_options parameter and insert proper vallue for max_total_wal_size there?
BTW I can see that write_buffer_size is set to 256MB in Ceph by default...

Isn't it better to adjust bluestore_rocksdb_options parameter and insert proper vallue for max_total_wal_size there?
BTW I can see that write_buffer_size is set to 256MB in Ceph by default...

I think that makes sense. It would be clear what it's set to without having to read the code. Also I think you are right, in this case wouldn't we end up with max_total_wal_size = 256MB since it's not accounting for max_write_buffer_number?

I'm also wondering if we might want smaller memtables now. I'm not sure what will happen with 4 large buffers and 15 CFs. This thread may be helpful: facebook/rocksdb#5789

I modified a bit store_test (master...aclamk:dnm-test-wal-sharding) to test different settings of max_total_wal_size.

Test procedure:
bin/ceph_test_objectstore --gtest_filter=*SpilloverTest*/2 --bluestore_rocksdb_options "max_total_wal_size=${wal},compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,writable_file_max_buffer_size=0,compaction_readahead_size=2097152" --bluestore_rocksdb_cf true --no-log-to-stderr 2>/dev/null |grep WAL;

Results:

WAL=200000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 384 MiB 0 B 0 B 0 B 383 MiB 2 WAL 0 B 384 MiB 0 B 0 B 0 B 383 MiB WAL=400000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.4 GiB 138 MiB 0 B 0 B 445 MiB 8 WAL 0 B 1.6 GiB 138 MiB 0 B 0 B 684 MiB WAL=600000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.5 GiB 51 MiB 0 B 0 B 659 MiB 11 WAL 0 B 1.6 GiB 51 MiB 0 B 0 B 818 MiB WAL=800000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.3 GiB 28 MiB 0 B 0 B 547 MiB 10 WAL 0 B 1.6 GiB 56 MiB 0 B 0 B 851 MiB WAL=1000000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.3 GiB 33 MiB 0 B 0 B 571 MiB 11 WAL 0 B 1.5 GiB 197 MiB 0 B 0 B 970 MiB WAL=1500000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.1 GiB 33 MiB 0 B 0 B 567 MiB 10 WAL 0 B 1.4 GiB 173 MiB 0 B 0 B 1.4 GiB WAL=2000000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.1 GiB 38 MiB 0 B 0 B 567 MiB 10 WAL 0 B 1.9 GiB 150 MiB 0 B 0 B 1.9 GiB WAL=2500000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.2 GiB 61 MiB 0 B 0 B 547 MiB 10 WAL 0 B 2.2 GiB 146 MiB 0 B 0 B 2.3 GiB WAL=3000000000 DEV/LEV WAL DB SLOW * * REAL FILES WAL 0 B 1.2 GiB 44 MiB 0 B 0 B 567 MiB 10 WAL 0 B 2.2 GiB 637 MiB 0 B 0 B 2.8 GiB

It seems that if WAL is smaller then 1GB rocksdb has trouble keeping it below set value.
For WAL 1GB and larger rocksdb does not go over limit.

@aclamk - you might want to increase object count for the test case. Here are my results for 4K objects and 1GB WAL:
CFamily on, 4096 obj, 1GB wal
1 : device size 0xc0000000 : own 0x[2000bfffe000] = 0xbfffe000 : using 0xbc7fe000(2.9 GiB)
2 : device size 0x105aa15000 : own 0x[100003e03d0000,7d9940000~a7780000] = 0x487b50000 : using 0x454c20000(17 GiB) : bluestore has 0xbd2eb0000(47 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV WAL DB SLOW * * REAL FILES
LOG 0 B 16 MiB 0 B 0 B 0 B 11 MiB 1
WAL 0 B 2.8 GiB 271 MiB 0 B 0 B 1.2 GiB 7
DB 0 B 177 MiB 2.1 GiB 0 B 0 B 2.3 GiB 30
SLOW 0 B 0 B 15 GiB 0 B 0 B 15 GiB 229
TOTALS 0 B 2.9 GiB 17 GiB 0 B 0 B 0 B 267
MAXIMUMS:
LOG 0 B 16 MiB 0 B 0 B 0 B 11 MiB
WAL 0 B 2.8 GiB 271 MiB 0 B 0 B 1.6 GiB
DB 0 B 1.7 GiB 3.1 GiB 0 B 0 B 3.6 GiB
SLOW 0 B 0 B 15 GiB 0 B 0 B 15 GiB
TOTALS 0 B 3.0 GiB 18 GiB 0 B 0 B 0 B

Please note WAL maximum at 1.6GiB
Hence I'm not sure your point is 100% valid.

And in results I shared for 256MB WAL it never exceeds 512+ MB which is inline with what I ever have observed before. Personally I'd prefer to preserve these actual numbers. Not many pros and cons though. Just prefer being conservative here as one should be in data storage solutions ;)

@ifed01 So, basically, you proposed to give default of 256MB, but document that it is likely to grow as much as 512?

Originally I didn't imply any documentation on that and moreover I'm not 100% sure it is always below 512MB. So may be it's better to simply avoid this sort of information in log. Just recommend 4GB for WAL volume and omit rationales for that.

tchaikov · 2020-05-28T08:11:14Z

failed tests are tracked by

ifed01 · 2020-05-28T14:57:45Z

@aclamk @markhpc - may be another topic to discuss is the actual cap we want for WAL size.
Originally I have never seen WAL size higher than 0.5GB, even in the field.
Currently for my spillover test case the maximum seems to be around 3GB. Which isn't that dramatic but may be it still makes sense to be more conservative and try to preserve original numbers.
Below are DB stats for running spillover test case using different 'max_total_wal_size' values: 1GB, 0.5 GB and, 0.25GB

CFamily on, 4096 obj, 1GB wal
1 : device size 0xc0000000 : own 0x[2000~bfffe000] = 0xbfffe000 : using 0xbc7fe000(2.9 GiB)
2 : device size 0x105aa15000 : own 0x[10000~3e03d0000,7d9940000~a7780000] = 0x487b50000 : using 0x454c20000(17 GiB) : bluestore has 0xbd2eb0000(47 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         16 MiB      0 B         0 B         0 B         11 MiB      1
WAL         0 B         2.8 GiB     271 MiB     0 B         0 B         1.2 GiB     7
DB          0 B         177 MiB     2.1 GiB     0 B         0 B         2.3 GiB     30
SLOW        0 B         0 B         15 GiB      0 B         0 B         15 GiB      229
TOTALS      0 B         2.9 GiB     17 GiB      0 B         0 B         0 B         267
MAXIMUMS:
LOG         0 B         16 MiB      0 B         0 B         0 B         11 MiB
WAL         0 B         2.8 GiB     271 MiB     0 B         0 B         1.6 GiB
DB          0 B         1.7 GiB     3.1 GiB     0 B         0 B         3.6 GiB
SLOW        0 B         0 B         15 GiB      0 B         0 B         15 GiB
TOTALS      0 B         3.0 GiB     18 GiB      0 B         0 B         0 B

db_used:3162497024
slow_used:18601869312

CFamily on, 4096 obj, 0.5GB wal
1 : device size 0xc0000000 : own 0x[2000~bfffe000] = 0xbfffe000 : using 0xbfffe000(3.0 GiB)
2 : device size 0x105aa15000 : own 0x[10000~3dea20000,7d9940000~a7780000] = 0x4861a0000 : using 0x442cc0000(17 GiB) : bluestore has 0xbd4860000(47 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         12 MiB      4 MiB       0 B         0 B         12 MiB      1
WAL         0 B         2.1 GiB     207 MiB     0 B         0 B         1.0 GiB     5
DB          0 B         929 MiB     432 MiB     0 B         0 B         1.3 GiB     21
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB      250
TOTALS      0 B         3.0 GiB     17 GiB      0 B         0 B         0 B         277
MAXIMUMS:
LOG         0 B         12 MiB      4 MiB       0 B         0 B         12 MiB
WAL         0 B         2.1 GiB     207 MiB     0 B         0 B         1.6 GiB
DB          0 B         1.8 GiB     1.0 GiB     0 B         0 B         2.2 GiB
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB
TOTALS      0 B         3.0 GiB     17 GiB      0 B         0 B         0 B

db_used:3221217280
slow_used:18300534784

CFamily on, 4096 obj, 0.25GB wal
1 : device size 0xc0000000 : own 0x[2000~bfffe000] = 0xbfffe000 : using 0x76ffe000(1.9 GiB)
2 : device size 0x105aa15000 : own 0x[10000~3a0e80000,7d9940000~a7780000] = 0x448600000 : using 0x41c2d0000(16 GiB) : bluestore has 0xc12400000(48 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         12 MiB      0 B         0 B         0 B         8.7 MiB     1
WAL         0 B         522 MiB     0 B         0 B         0 B         521 MiB     2
DB          0 B         1.3 GiB     0 B         0 B         0 B         1.3 GiB     26
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB      263
TOTALS      0 B         1.9 GiB     16 GiB      0 B         0 B         0 B         292
MAXIMUMS:
LOG         0 B         12 MiB      0 B         0 B         0 B         8.7 MiB
WAL         0 B         522 MiB     0 B         0 B         0 B         522 MiB
DB          0 B         2.5 GiB     65 MiB      0 B         0 B         2.5 GiB
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB
TOTALS      0 B         3.0 GiB     16 GiB      0 B         0 B         0 B

db_used:1996480512
slow_used:17652580352

This fixes problem in case when sharding is turned on ('bluestore_rocksdb_cf=true'). Default value (0) caused rocksdb to set maximum of 16GB for WALs. Now this is 1GB by setting max_total_wal_size. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

yuriw · 2020-06-03T23:44:17Z

wip-yuri2-testing-2020-06-03-2341-MASTER

markhpc · 2020-06-04T14:17:54Z

@ifed01 Regarding WAL size, I suspect that it depends on the throttling settings and how long memtable flushes and indirectly L0 compactions are taking. My assumption is that we want plenty of runway since we are only talking about a few GB here and space is (relatively) cheap. IE we don't watn to end up blocking during long flushes, preferring instead to throttle writes to some equalibrium. Especially on NVMe, I think we'll need a fair amount of WAL to do that?

ifed01 · 2020-06-05T16:18:19Z

@markhpc - your comment is generally valid. Larger WAL is rather more preferable on its own.

Potential hidden issues with existing deployments (and existing QA cases) are what makes me a bit nervous. I presume WAL has never (very rarely?) exceeded 512MB before. With this patch that's not the case anymore for sure - wouldn't this result in different behavior during QA runs and/or some issues in upgraded production clusters which "got used to" smaller WALs?

Have we ever seen real WAL larger than 512MB? Do we have any numbers showing 1+GB WAL is better? May be it's better to be more conservative and try to preserve original numbers?

Generally IMO we might want to become more conservative in our modifications to stable components like BlueStore - as one has to be in storage solutions.... Just some grunting ;)

And finally I'm absolutely not sure my concerns are that important this time. Just wanted to share these thoughts.

jdurgin · 2020-06-05T18:38:26Z

Have we ever seen real WAL larger than 512MB? Do we have any numbers showing 1+GB WAL is better? May be it's better to be more conservative and try to preserve original numbers?

Yes, in practice I recall WALs of at least 30GB, especially on hardware that was originally designed with filestore in mind, where there isn't enough space for db on flash. I'm not sure whether it had any impact on the performance compared to a smaller WAL. No correctness issues at least.

markhpc · 2020-06-11T14:10:27Z

@ifed01 FWIW, when we originally did the rocksdb testing for bluestore we arrived at 1GB aggregate WAL size because that was the point at which we no longer saw performance benefit on the Incerta nodes (P3700 NVMe). Ultimately it's all tied to flush latency, ingest rate, and throttling/stalling behavior:

https://github.com/facebook/rocksdb/wiki/Write-Stalls

IE things like level0_slowdown_writes_trigger and level0_stop_writes_trigger I would suspect would have a large impact on how large the WAL can grow.

tchaikov · 2020-06-17T11:46:26Z

test failure is tracked by https://tracker.ceph.com/issues/46046

aclamk added the bluestore label May 27, 2020

aclamk requested review from ifed01 and markhpc May 27, 2020 14:26

aclamk force-pushed the wip-limit-wal-sharding branch from 072abff to a4a3dd1 Compare May 27, 2020 14:27

tchaikov added the wip-kefu-testing label May 27, 2020

ifed01 reviewed May 27, 2020

View reviewed changes

tchaikov added needs-review and removed wip-kefu-testing labels May 28, 2020

aclamk force-pushed the wip-limit-wal-sharding branch from a4a3dd1 to 2c4e4dc Compare May 29, 2020 10:57

ifed01 mentioned this pull request Jun 1, 2020

test/objectstore/store_test: kill ExcessiveFragmentation test case. #35318

Merged

3 tasks

yuriw added the wip-yuri2-testing label Jun 3, 2020

markhpc added performance core labels Jun 4, 2020

yuriw removed the wip-yuri2-testing label Jun 5, 2020

ifed01 approved these changes Jun 9, 2020

View reviewed changes

ifed01 removed the needs-review label Jun 9, 2020

markhpc approved these changes Jun 11, 2020

View reviewed changes

tchaikov added the wip-kefu-testing label Jun 15, 2020

tchaikov merged commit ff3a51f into ceph:master Jun 17, 2020

aclamk mentioned this pull request Nov 16, 2020

os/bluestore: allow dynamic levels #37156

Merged

Conversation

aclamk commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 May 27, 2020

Choose a reason for hiding this comment

Uh oh!

markhpc May 27, 2020

Choose a reason for hiding this comment

Uh oh!

aclamk May 29, 2020

Choose a reason for hiding this comment

Uh oh!

ifed01 May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifed01 May 29, 2020

Choose a reason for hiding this comment

Uh oh!

aclamk May 29, 2020

Choose a reason for hiding this comment

Uh oh!

ifed01 May 29, 2020

Choose a reason for hiding this comment

Uh oh!

tchaikov commented May 28, 2020

Uh oh!

ifed01 commented May 28, 2020 • edited by jdurgin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuriw commented Jun 3, 2020

Uh oh!

markhpc commented Jun 4, 2020

Uh oh!

ifed01 commented Jun 5, 2020

Uh oh!

jdurgin commented Jun 5, 2020

Uh oh!

markhpc commented Jun 11, 2020

Uh oh!

tchaikov commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aclamk commented May 27, 2020 •

edited

Loading

ifed01 May 29, 2020 •

edited

Loading

ifed01 commented May 28, 2020 •

edited by jdurgin

Loading