Prevent huge WAL sizes when rocksdb sharding enabled#35277
Prevent huge WAL sizes when rocksdb sharding enabled#35277tchaikov merged 1 commit intoceph:masterfrom
Conversation
072abff to
a4a3dd1
Compare
src/kv/RocksDBStore.cc
Outdated
| // This is only used if we have some non-default column families, | ||
| // But value rocksdb uses by default is way too large | ||
| if (opt.max_total_wal_size == 0) { | ||
| opt.max_total_wal_size = opt.write_buffer_size; |
There was a problem hiding this comment.
Isn't it better to adjust bluestore_rocksdb_options parameter and insert proper vallue for max_total_wal_size there?
BTW I can see that write_buffer_size is set to 256MB in Ceph by default...
There was a problem hiding this comment.
Isn't it better to adjust bluestore_rocksdb_options parameter and insert proper vallue for max_total_wal_size there?
BTW I can see that write_buffer_size is set to 256MB in Ceph by default...
I think that makes sense. It would be clear what it's set to without having to read the code. Also I think you are right, in this case wouldn't we end up with max_total_wal_size = 256MB since it's not accounting for max_write_buffer_number?
I'm also wondering if we might want smaller memtables now. I'm not sure what will happen with 4 large buffers and 15 CFs. This thread may be helpful: facebook/rocksdb#5789
There was a problem hiding this comment.
I modified a bit store_test (master...aclamk:dnm-test-wal-sharding) to test different settings of max_total_wal_size.
Test procedure:
bin/ceph_test_objectstore --gtest_filter=*SpilloverTest*/2 --bluestore_rocksdb_options "max_total_wal_size=${wal},compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,writable_file_max_buffer_size=0,compaction_readahead_size=2097152" --bluestore_rocksdb_cf true --no-log-to-stderr 2>/dev/null |grep WAL;
Results:
WAL=200000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 384 MiB 0 B 0 B 0 B 383 MiB 2
WAL 0 B 384 MiB 0 B 0 B 0 B 383 MiB
WAL=400000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.4 GiB 138 MiB 0 B 0 B 445 MiB 8
WAL 0 B 1.6 GiB 138 MiB 0 B 0 B 684 MiB
WAL=600000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.5 GiB 51 MiB 0 B 0 B 659 MiB 11
WAL 0 B 1.6 GiB 51 MiB 0 B 0 B 818 MiB
WAL=800000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.3 GiB 28 MiB 0 B 0 B 547 MiB 10
WAL 0 B 1.6 GiB 56 MiB 0 B 0 B 851 MiB
WAL=1000000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.3 GiB 33 MiB 0 B 0 B 571 MiB 11
WAL 0 B 1.5 GiB 197 MiB 0 B 0 B 970 MiB
WAL=1500000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.1 GiB 33 MiB 0 B 0 B 567 MiB 10
WAL 0 B 1.4 GiB 173 MiB 0 B 0 B 1.4 GiB
WAL=2000000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.1 GiB 38 MiB 0 B 0 B 567 MiB 10
WAL 0 B 1.9 GiB 150 MiB 0 B 0 B 1.9 GiB
WAL=2500000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.2 GiB 61 MiB 0 B 0 B 547 MiB 10
WAL 0 B 2.2 GiB 146 MiB 0 B 0 B 2.3 GiB
WAL=3000000000
DEV/LEV WAL DB SLOW * * REAL FILES
WAL 0 B 1.2 GiB 44 MiB 0 B 0 B 567 MiB 10
WAL 0 B 2.2 GiB 637 MiB 0 B 0 B 2.8 GiB
It seems that if WAL is smaller then 1GB rocksdb has trouble keeping it below set value.
For WAL 1GB and larger rocksdb does not go over limit.
There was a problem hiding this comment.
@aclamk - you might want to increase object count for the test case. Here are my results for 4K objects and 1GB WAL:
CFamily on, 4096 obj, 1GB wal
1 : device size 0xc0000000 : own 0x[2000bfffe000] = 0xbfffe000 : using 0xbc7fe000(2.9 GiB)3e03d0000,7d9940000~a7780000] = 0x487b50000 : using 0x454c20000(17 GiB) : bluestore has 0xbd2eb0000(47 GiB) available
2 : device size 0x105aa15000 : own 0x[10000
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV WAL DB SLOW * * REAL FILES
LOG 0 B 16 MiB 0 B 0 B 0 B 11 MiB 1
WAL 0 B 2.8 GiB 271 MiB 0 B 0 B 1.2 GiB 7
DB 0 B 177 MiB 2.1 GiB 0 B 0 B 2.3 GiB 30
SLOW 0 B 0 B 15 GiB 0 B 0 B 15 GiB 229
TOTALS 0 B 2.9 GiB 17 GiB 0 B 0 B 0 B 267
MAXIMUMS:
LOG 0 B 16 MiB 0 B 0 B 0 B 11 MiB
WAL 0 B 2.8 GiB 271 MiB 0 B 0 B 1.6 GiB
DB 0 B 1.7 GiB 3.1 GiB 0 B 0 B 3.6 GiB
SLOW 0 B 0 B 15 GiB 0 B 0 B 15 GiB
TOTALS 0 B 3.0 GiB 18 GiB 0 B 0 B 0 B
Please note WAL maximum at 1.6GiB
Hence I'm not sure your point is 100% valid.
There was a problem hiding this comment.
And in results I shared for 256MB WAL it never exceeds 512+ MB which is inline with what I ever have observed before. Personally I'd prefer to preserve these actual numbers. Not many pros and cons though. Just prefer being conservative here as one should be in data storage solutions ;)
There was a problem hiding this comment.
@ifed01 So, basically, you proposed to give default of 256MB, but document that it is likely to grow as much as 512?
There was a problem hiding this comment.
Originally I didn't imply any documentation on that and moreover I'm not 100% sure it is always below 512MB. So may be it's better to simply avoid this sort of information in log. Just recommend 4GB for WAL volume and omit rationales for that.
|
@aclamk @markhpc - may be another topic to discuss is the actual cap we want for WAL size. |
This fixes problem in case when sharding is turned on ('bluestore_rocksdb_cf=true').
Default value (0) caused rocksdb to set maximum of 16GB for WALs.
Now this is 1GB by setting max_total_wal_size.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
a4a3dd1 to
2c4e4dc
Compare
|
@ifed01 Regarding WAL size, I suspect that it depends on the throttling settings and how long memtable flushes and indirectly L0 compactions are taking. My assumption is that we want plenty of runway since we are only talking about a few GB here and space is (relatively) cheap. IE we don't watn to end up blocking during long flushes, preferring instead to throttle writes to some equalibrium. Especially on NVMe, I think we'll need a fair amount of WAL to do that? |
|
@markhpc - your comment is generally valid. Larger WAL is rather more preferable on its own. Potential hidden issues with existing deployments (and existing QA cases) are what makes me a bit nervous. I presume WAL has never (very rarely?) exceeded 512MB before. With this patch that's not the case anymore for sure - wouldn't this result in different behavior during QA runs and/or some issues in upgraded production clusters which "got used to" smaller WALs? Have we ever seen real WAL larger than 512MB? Do we have any numbers showing 1+GB WAL is better? May be it's better to be more conservative and try to preserve original numbers? Generally IMO we might want to become more conservative in our modifications to stable components like BlueStore - as one has to be in storage solutions.... Just some grunting ;) And finally I'm absolutely not sure my concerns are that important this time. Just wanted to share these thoughts. |
Yes, in practice I recall WALs of at least 30GB, especially on hardware that was originally designed with filestore in mind, where there isn't enough space for db on flash. I'm not sure whether it had any impact on the performance compared to a smaller WAL. No correctness issues at least. |
|
@ifed01 FWIW, when we originally did the rocksdb testing for bluestore we arrived at 1GB aggregate WAL size because that was the point at which we no longer saw performance benefit on the Incerta nodes (P3700 NVMe). Ultimately it's all tied to flush latency, ingest rate, and throttling/stalling behavior: https://github.com/facebook/rocksdb/wiki/Write-Stalls IE things like level0_slowdown_writes_trigger and level0_stop_writes_trigger I would suspect would have a large impact on how large the WAL can grow. |
This fixes problem in case when sharding is turned on ('bluestore_rocksdb_cf=true').
Default value (0) caused rocksdb to set maximum of 16GB for WALs.
Now this is 1GB. This still can be overridden by setting rocksdb options
'max_total_wal_size' and 'max_write_buffer_number'.