Skip to content

[RFC][WIP]os/bluestore: framework for more intelligent DB space usage#28960

Closed
ifed01 wants to merge 3 commits intoceph:masterfrom
ifed01:wip-ifed-flex-bluefs-size
Closed

[RFC][WIP]os/bluestore: framework for more intelligent DB space usage#28960
ifed01 wants to merge 3 commits intoceph:masterfrom
ifed01:wip-ifed-flex-bluefs-size

Conversation

@ifed01
Copy link
Copy Markdown
Contributor

@ifed01 ifed01 commented Jul 10, 2019

The idea is to force RocksDB to "hint" corresponding DB level when opening a file.
Which is implemented via passing level size aligned folders when opening a DB. RocksDB opens files using these folders hence denoting DB level this file belongs to. The accuracy of such hints looks pretty good.
As a result one is able to build volume usage matrix (DEVICE x DB_LEVEL) which allows to make more intelligent decisions where allocate bluefs extents for specific file from.
Currently this patch is mainly about infrastructure rather than taking such a decision except some improvement for levels 4+ which allows partial DB space usage even when the whole L4 doesn't fit into DB.
One more improvement we can consider is using WAL device for L0/L1 when WAL is underused...

Here is an example for DB levels and bluefs volume usage statistics collected by both the new framework and existing methods. In fact new framework keeps two DEVICE x LEVEL matrices:
a) current values
b) maximum observed values
In the reports below one can check the matching between REAL column in current values matrix and rocksdb per-level stats.
Or TOTALS row values vs. bluefs-bdev-sizes output

Case 1: Original allocation policy:

2019-07-10T15:40:13.791+0300 7f32bb255700 1 RocksDBBlueFSVolumeSelector: wal_total:5368709120, db_total:96636764160, slow_total:107374182400, db_cut_level:4, policy:0 usage matrix:
**** current values matrix starts here ****
LEVEL, WAL, DB, SLOW, ****, **, REAL
L0-1 0,0,0,0,0,0
L2 0,1009778688,0,0,0,1002696637
L3 0,15006171136,0,0,0,14928509229
L4+ 0,0,78671511552,0,0,78313236820
WAL 530579456,1048576,0,0,0,345309959
UNSORTED 0,6291456,0,0,0,1285822
TOTALS 530579456,16023289856,78671511552,0,0,0
^^^^^ current values matrix ends here, maximums matrix follow
MAXIMUMS:
0,3279945728,0,0,0,3264581164
0,10412359680,0,0,0,10369342915
0,48643440640,0,0,0,48425182335
0,0,93072654336,0,0,92586709652
538968064,1048576,0,0,0,528943749
0,7340032,0,0,0,1285822
538968064,55071211520,93072654336,0,0,0
2019-07-10T15:40:13.791+0300 7f32bb255700 1 bluestore(/home/if/ceph/build/dev/osd0) bluefs bdev sizes: bluefs bdev sizes: bluefs bdev sizes:
0 : device size 0x140000000 : own 0x[100013ffff000] = 0x13ffff000 : using 0x1faff000(507 MiB)
1 : device size 0x1680000000 : own 0x[2000
167fffe000] = 0x167fffe000 : using 0x3bb1fe000(15 GiB)
2 : device size 0x1900000000 : own 0x[3600000300000,3a0000052ac00000,52ea00000600d00000,b33d00000a60d00000,1595000000~3dc00000] = 0x15ca500000 : using 0x1251300000(73 GiB)
db_statistics {
"rocksdb_compaction_statistics": "",
"": "",
"": "
Compaction Stats [default] **",
"": "Level Files Size
"": "---------------------
"": " L0 0/0 0.00 KB ...
"": " L1 0/0 0.00 KB ...
"": " L2 19/1 956.25 MB ...
"": " L3 234/27 13.00 GB ...
"": " L4 1161/0 72.93 GB ...
"": " Sum 1414/28 86.87 GB ...

Case 2: Use some extra space for L4+ policy:
2019-07-10T16:21:45.827+0300 7f5f6f6d7700 1 RocksDBBlueFSVolumeSelector: wal_total:5368709120, db_total:96636764160, slow_total:107374182400, db_cut_level:4, policy:1 usage matrix:
LEVEL, WAL, DB, SLOW, ****, **, REAL
L0-1 0,419430400,0,0,0,417663585
L2 0,2645557248,0,0,0,2634051067
L3 0,26923237376,0,0,0,26823171800
L4+ 0,21681405952,0,0,0,21550869858
WAL 530579456,1048576,0,0,0,524066537
UNSORTED 0,5242880,0,0,0,511651
TOTALS 530579456,51675922432,0,0,0,0
MAXIMUMS:
0,6149898240,0,0,0,6121377338
0,19858980864,0,0,0,19782445877
0,46491762688,0,0,0,46268971009
0,21681405952,0,0,0,21550869858
538968064,1048576,0,0,0,531368421
0,7340032,0,0,0,511651
538968064,53198454784,0,0,0,0
2019-07-10T16:21:45.831+0300 7f5f6f6d7700 1 bluestore(/home/if/ceph/build/dev/osd0) bluefs bdev sizes: bluefs bdev sizes: bluefs bdev sizes:
0 : device size 0x140000000 : own 0x[100013ffff000] = 0x13ffff000 : using 0x1faff000(507 MiB)
1 : device size 0x1680000000 : own 0x[2000
167fffe000] = 0x167fffe000 : using 0xc0c3fe000(48 GiB)
2 : device size 0x1900000000 : own 0x[c00000000~100000000] = 0x100000000 : using 0x0(0 B)
db_statistics {
"rocksdb_compaction_statistics": "",
"": "",
"": "
Compaction Stats [default] **",
"": "Level Files Size
"": "-----------------------",
"": " L0 1/0 204.04 MB ...
"": " L1 3/0 194.28 MB ...
"": " L2 41/0 2.45 GB ...
"": " L3 396/0 24.98 GB ...
"": " L4 317/0 20.07 GB ...

Relates to: http://tracker.ceph.com/issues/38745

Signed-off-by: Igor Fedotov ifedotov@suse.com

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

ifed01 added 2 commits July 10, 2019 16:01
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@ifed01 ifed01 requested review from aclamk and liewegas July 10, 2019 13:34
@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size branch from ae3585d to 02d8e91 Compare July 10, 2019 13:38
@ifed01 ifed01 added the DNM label Jul 10, 2019
…r BlueFS.

It allows excessive space usage for higher DB levels.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size branch from 02d8e91 to 7f3e46e Compare July 10, 2019 17:29
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Jul 11, 2019

One more example for L4 using DB space, with a spillover this time.
2019-07-11T14:52:36.393+0300 7f28ea1c6700 1 RocksDBBlueFSVolumeSelector: wal_total:5368709120, db_total:96636764160, slow_total:107374182400, db_cut_level:4, policy:1, max{1342177280,2684354560,26843545600} usage matrix:
LEVEL, WAL, DB, SLOW, ****, **, REAL
L0-1 0,418381824,0,0,0,417328942
L2 0,2658140160,0,0,0,2648314695
L3 0,26886537216,0,0,0,26785618406
L4+ 0,16862150656,31764512768,0,0,48369549967
WAL 530579456,1048576,0,0,0,526305973
UNSORTED 0,5242880,0,0,0,885839
TOTALS 530579456,46831501312,31764512768,0,0,0
MAXIMUMS:
0,5357174784,0,0,0,5333681549
0,18615369728,0,0,0,18545064748
0,55829331968,0,0,0,55551072776
0,16894656512,31764512768,0,0,48369549967
538968064,1048576,0,0,0,531337539
0,7340032,0,0,0,885839
538968064,73391931392,31764512768,0,0,0
2019-07-11T14:52:36.401+0300 7f28ea1c6700 1 bluestore(/home/if/ceph/build/dev/osd0) bluefs bdev sizes: bluefs bdev sizes: bluefs bdev sizes:
0 : device size 0x140000000 : own 0x[100013ffff000] = 0x13ffff000 : using 0x1faff000(507 MiB)
1 : device size 0x1680000000 : own 0x[2000
167fffe000] = 0x167fffe000 : using 0xaeb7fe000(44 GiB)
2 : device size 0x1900000000 : own 0x[1000006a2000000,c00000000100000000] = 0x7a2000000 : using 0x765500000(30 GiB)
db_statistics {
"rocksdb_compaction_statistics": "",
"": "",
"": "
Compaction Stats [default] **",
"": "Level Files Size
"": "-----------------------
"": " L0 1/0 203.72 MB
"": " L1 3/0 194.28 MB
"": " L2 39/0 2.47 GB
"": " L3 396/0 24.95 GB
"": " L4 712/0 45.05 GB

@aclamk
Copy link
Copy Markdown
Contributor

aclamk commented Jul 22, 2019

@ifed01
I was looking at this PR and wondering, if it is really better then simply use
virtual void WritableFile::SetWriteLifeTimeHint(Env::WriteLifeTimeHint hint)
which gives values 4 values relating to levels 1, 2, 3, 4 ?
I mean I fail to see reasons for increased complexity.

@ifed01 ifed01 closed this Jul 23, 2019
@ifed01 ifed01 reopened this Jul 23, 2019
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Jul 23, 2019

@aclamk - thanks a lot for pointing this function out, I wasn't aware of it.
Nevertheless I'm not sure it's the best choice - provided hint doesn't strictly correlate with target level (i.e. with data importance/priority). This is about how long it should be preserved but not about how fast access is required.

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Aug 16, 2019

Simplified version available at #29687

@ifed01 ifed01 closed this Aug 29, 2019
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Aug 29, 2019

Closed in favour of #29687

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants