Skip to content

Prevent huge WAL sizes when rocksdb sharding enabled#35277

Merged
tchaikov merged 1 commit intoceph:masterfrom
aclamk:wip-limit-wal-sharding
Jun 17, 2020
Merged

Prevent huge WAL sizes when rocksdb sharding enabled#35277
tchaikov merged 1 commit intoceph:masterfrom
aclamk:wip-limit-wal-sharding

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented May 27, 2020

This fixes problem in case when sharding is turned on ('bluestore_rocksdb_cf=true').
Default value (0) caused rocksdb to set maximum of 16GB for WALs.
Now this is 1GB. This still can be overridden by setting rocksdb options
'max_total_wal_size' and 'max_write_buffer_number'.

@aclamk aclamk requested review from ifed01 and markhpc May 27, 2020 14:26
@aclamk aclamk force-pushed the wip-limit-wal-sharding branch from 072abff to a4a3dd1 Compare May 27, 2020 14:27
// This is only used if we have some non-default column families,
// But value rocksdb uses by default is way too large
if (opt.max_total_wal_size == 0) {
opt.max_total_wal_size = opt.write_buffer_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to adjust bluestore_rocksdb_options parameter and insert proper vallue for max_total_wal_size there?
BTW I can see that write_buffer_size is set to 256MB in Ceph by default...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to adjust bluestore_rocksdb_options parameter and insert proper vallue for max_total_wal_size there?
BTW I can see that write_buffer_size is set to 256MB in Ceph by default...

I think that makes sense. It would be clear what it's set to without having to read the code. Also I think you are right, in this case wouldn't we end up with max_total_wal_size = 256MB since it's not accounting for max_write_buffer_number?

I'm also wondering if we might want smaller memtables now. I'm not sure what will happen with 4 large buffers and 15 CFs. This thread may be helpful: facebook/rocksdb#5789

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified a bit store_test (master...aclamk:dnm-test-wal-sharding) to test different settings of max_total_wal_size.

Test procedure:
bin/ceph_test_objectstore --gtest_filter=*SpilloverTest*/2 --bluestore_rocksdb_options "max_total_wal_size=${wal},compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,writable_file_max_buffer_size=0,compaction_readahead_size=2097152" --bluestore_rocksdb_cf true --no-log-to-stderr 2>/dev/null |grep WAL;

Results:

WAL=200000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         384 MiB     0 B         0 B         0 B         383 MiB     2           
WAL         0 B         384 MiB     0 B         0 B         0 B         383 MiB     
WAL=400000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.4 GiB     138 MiB     0 B         0 B         445 MiB     8           
WAL         0 B         1.6 GiB     138 MiB     0 B         0 B         684 MiB     
WAL=600000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.5 GiB     51 MiB      0 B         0 B         659 MiB     11          
WAL         0 B         1.6 GiB     51 MiB      0 B         0 B         818 MiB     
WAL=800000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.3 GiB     28 MiB      0 B         0 B         547 MiB     10          
WAL         0 B         1.6 GiB     56 MiB      0 B         0 B         851 MiB     
WAL=1000000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.3 GiB     33 MiB      0 B         0 B         571 MiB     11          
WAL         0 B         1.5 GiB     197 MiB     0 B         0 B         970 MiB     
WAL=1500000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.1 GiB     33 MiB      0 B         0 B         567 MiB     10          
WAL         0 B         1.4 GiB     173 MiB     0 B         0 B         1.4 GiB     
WAL=2000000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.1 GiB     38 MiB      0 B         0 B         567 MiB     10          
WAL         0 B         1.9 GiB     150 MiB     0 B         0 B         1.9 GiB     
WAL=2500000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.2 GiB     61 MiB      0 B         0 B         547 MiB     10          
WAL         0 B         2.2 GiB     146 MiB     0 B         0 B         2.3 GiB     
WAL=3000000000
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
WAL         0 B         1.2 GiB     44 MiB      0 B         0 B         567 MiB     10          
WAL         0 B         2.2 GiB     637 MiB     0 B         0 B         2.8 GiB

It seems that if WAL is smaller then 1GB rocksdb has trouble keeping it below set value.
For WAL 1GB and larger rocksdb does not go over limit.

Copy link
Contributor

@ifed01 ifed01 May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclamk - you might want to increase object count for the test case. Here are my results for 4K objects and 1GB WAL:
CFamily on, 4096 obj, 1GB wal
1 : device size 0xc0000000 : own 0x[2000bfffe000] = 0xbfffe000 : using 0xbc7fe000(2.9 GiB)
2 : device size 0x105aa15000 : own 0x[10000
3e03d0000,7d9940000~a7780000] = 0x487b50000 : using 0x454c20000(17 GiB) : bluestore has 0xbd2eb0000(47 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV WAL DB SLOW * * REAL FILES
LOG 0 B 16 MiB 0 B 0 B 0 B 11 MiB 1
WAL 0 B 2.8 GiB 271 MiB 0 B 0 B 1.2 GiB 7
DB 0 B 177 MiB 2.1 GiB 0 B 0 B 2.3 GiB 30
SLOW 0 B 0 B 15 GiB 0 B 0 B 15 GiB 229
TOTALS 0 B 2.9 GiB 17 GiB 0 B 0 B 0 B 267
MAXIMUMS:
LOG 0 B 16 MiB 0 B 0 B 0 B 11 MiB
WAL 0 B 2.8 GiB 271 MiB 0 B 0 B 1.6 GiB
DB 0 B 1.7 GiB 3.1 GiB 0 B 0 B 3.6 GiB
SLOW 0 B 0 B 15 GiB 0 B 0 B 15 GiB
TOTALS 0 B 3.0 GiB 18 GiB 0 B 0 B 0 B

Please note WAL maximum at 1.6GiB
Hence I'm not sure your point is 100% valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in results I shared for 256MB WAL it never exceeds 512+ MB which is inline with what I ever have observed before. Personally I'd prefer to preserve these actual numbers. Not many pros and cons though. Just prefer being conservative here as one should be in data storage solutions ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 So, basically, you proposed to give default of 256MB, but document that it is likely to grow as much as 512?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I didn't imply any documentation on that and moreover I'm not 100% sure it is always below 512MB. So may be it's better to simply avoid this sort of information in log. Just recommend 4GB for WAL volume and omit rationales for that.

@ifed01
Copy link
Contributor

ifed01 commented May 28, 2020

@aclamk @markhpc - may be another topic to discuss is the actual cap we want for WAL size.
Originally I have never seen WAL size higher than 0.5GB, even in the field.
Currently for my spillover test case the maximum seems to be around 3GB. Which isn't that dramatic but may be it still makes sense to be more conservative and try to preserve original numbers.
Below are DB stats for running spillover test case using different 'max_total_wal_size' values: 1GB, 0.5 GB and, 0.25GB

CFamily on, 4096 obj, 1GB wal
1 : device size 0xc0000000 : own 0x[2000~bfffe000] = 0xbfffe000 : using 0xbc7fe000(2.9 GiB)
2 : device size 0x105aa15000 : own 0x[10000~3e03d0000,7d9940000~a7780000] = 0x487b50000 : using 0x454c20000(17 GiB) : bluestore has 0xbd2eb0000(47 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         16 MiB      0 B         0 B         0 B         11 MiB      1
WAL         0 B         2.8 GiB     271 MiB     0 B         0 B         1.2 GiB     7
DB          0 B         177 MiB     2.1 GiB     0 B         0 B         2.3 GiB     30
SLOW        0 B         0 B         15 GiB      0 B         0 B         15 GiB      229
TOTALS      0 B         2.9 GiB     17 GiB      0 B         0 B         0 B         267
MAXIMUMS:
LOG         0 B         16 MiB      0 B         0 B         0 B         11 MiB
WAL         0 B         2.8 GiB     271 MiB     0 B         0 B         1.6 GiB
DB          0 B         1.7 GiB     3.1 GiB     0 B         0 B         3.6 GiB
SLOW        0 B         0 B         15 GiB      0 B         0 B         15 GiB
TOTALS      0 B         3.0 GiB     18 GiB      0 B         0 B         0 B

db_used:3162497024
slow_used:18601869312

CFamily on, 4096 obj, 0.5GB wal
1 : device size 0xc0000000 : own 0x[2000~bfffe000] = 0xbfffe000 : using 0xbfffe000(3.0 GiB)
2 : device size 0x105aa15000 : own 0x[10000~3dea20000,7d9940000~a7780000] = 0x4861a0000 : using 0x442cc0000(17 GiB) : bluestore has 0xbd4860000(47 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         12 MiB      4 MiB       0 B         0 B         12 MiB      1
WAL         0 B         2.1 GiB     207 MiB     0 B         0 B         1.0 GiB     5
DB          0 B         929 MiB     432 MiB     0 B         0 B         1.3 GiB     21
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB      250
TOTALS      0 B         3.0 GiB     17 GiB      0 B         0 B         0 B         277
MAXIMUMS:
LOG         0 B         12 MiB      4 MiB       0 B         0 B         12 MiB
WAL         0 B         2.1 GiB     207 MiB     0 B         0 B         1.6 GiB
DB          0 B         1.8 GiB     1.0 GiB     0 B         0 B         2.2 GiB
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB
TOTALS      0 B         3.0 GiB     17 GiB      0 B         0 B         0 B

db_used:3221217280
slow_used:18300534784

CFamily on, 4096 obj, 0.25GB wal
1 : device size 0xc0000000 : own 0x[2000~bfffe000] = 0xbfffe000 : using 0x76ffe000(1.9 GiB)
2 : device size 0x105aa15000 : own 0x[10000~3a0e80000,7d9940000~a7780000] = 0x448600000 : using 0x41c2d0000(16 GiB) : bluestore has 0xc12400000(48 GiB) available
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:3060164198, slow_total:66727998054, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         12 MiB      0 B         0 B         0 B         8.7 MiB     1
WAL         0 B         522 MiB     0 B         0 B         0 B         521 MiB     2
DB          0 B         1.3 GiB     0 B         0 B         0 B         1.3 GiB     26
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB      263
TOTALS      0 B         1.9 GiB     16 GiB      0 B         0 B         0 B         292
MAXIMUMS:
LOG         0 B         12 MiB      0 B         0 B         0 B         8.7 MiB
WAL         0 B         522 MiB     0 B         0 B         0 B         522 MiB
DB          0 B         2.5 GiB     65 MiB      0 B         0 B         2.5 GiB
SLOW        0 B         0 B         16 GiB      0 B         0 B         16 GiB
TOTALS      0 B         3.0 GiB     16 GiB      0 B         0 B         0 B

db_used:1996480512
slow_used:17652580352

This fixes problem in case when sharding is turned on ('bluestore_rocksdb_cf=true').
Default value (0) caused rocksdb to set maximum of 16GB for WALs.
Now this is 1GB by setting max_total_wal_size.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@yuriw
Copy link
Contributor

yuriw commented Jun 3, 2020

@markhpc
Copy link
Member

markhpc commented Jun 4, 2020

@ifed01 Regarding WAL size, I suspect that it depends on the throttling settings and how long memtable flushes and indirectly L0 compactions are taking. My assumption is that we want plenty of runway since we are only talking about a few GB here and space is (relatively) cheap. IE we don't watn to end up blocking during long flushes, preferring instead to throttle writes to some equalibrium. Especially on NVMe, I think we'll need a fair amount of WAL to do that?

@ifed01
Copy link
Contributor

ifed01 commented Jun 5, 2020

@markhpc - your comment is generally valid. Larger WAL is rather more preferable on its own.

Potential hidden issues with existing deployments (and existing QA cases) are what makes me a bit nervous. I presume WAL has never (very rarely?) exceeded 512MB before. With this patch that's not the case anymore for sure - wouldn't this result in different behavior during QA runs and/or some issues in upgraded production clusters which "got used to" smaller WALs?

Have we ever seen real WAL larger than 512MB? Do we have any numbers showing 1+GB WAL is better? May be it's better to be more conservative and try to preserve original numbers?

Generally IMO we might want to become more conservative in our modifications to stable components like BlueStore - as one has to be in storage solutions.... Just some grunting ;)

And finally I'm absolutely not sure my concerns are that important this time. Just wanted to share these thoughts.

@jdurgin
Copy link
Member

jdurgin commented Jun 5, 2020

Have we ever seen real WAL larger than 512MB? Do we have any numbers showing 1+GB WAL is better? May be it's better to be more conservative and try to preserve original numbers?

Yes, in practice I recall WALs of at least 30GB, especially on hardware that was originally designed with filestore in mind, where there isn't enough space for db on flash. I'm not sure whether it had any impact on the performance compared to a smaller WAL. No correctness issues at least.

@markhpc
Copy link
Member

markhpc commented Jun 11, 2020

@ifed01 FWIW, when we originally did the rocksdb testing for bluestore we arrived at 1GB aggregate WAL size because that was the point at which we no longer saw performance benefit on the Incerta nodes (P3700 NVMe). Ultimately it's all tied to flush latency, ingest rate, and throttling/stalling behavior:

https://github.com/facebook/rocksdb/wiki/Write-Stalls

IE things like level0_slowdown_writes_trigger and level0_stop_writes_trigger I would suspect would have a large impact on how large the WAL can grow.

@tchaikov tchaikov merged commit ff3a51f into ceph:master Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants