Skip to content

os/bluestore: v.2 framework for more intelligent DB space usage#29687

Merged
liewegas merged 7 commits intoceph:masterfrom
ifed01:wip-ifed-flex-bluefs-size-simplified
Dec 8, 2019
Merged

os/bluestore: v.2 framework for more intelligent DB space usage#29687
liewegas merged 7 commits intoceph:masterfrom
ifed01:wip-ifed-flex-bluefs-size-simplified

Conversation

@ifed01
Copy link
Copy Markdown
Contributor

@ifed01 ifed01 commented Aug 15, 2019

This is an evolution and simplification of #28960. It neither uses different naming model to learn sst file level nor requires some hints from RocksDB.
Instead it relies on existing wal/db/db.slow paths to track space usage per WAL/DB/SLOW 'levels' and WAL/DB/SLOW devices. And attempts to reuse extra DB space for slow "level" rather than put it unconditionally to main device. Depending on the current and previous utilization...
Similarly we can reuse WAL space for DB "level" if needed (not implemented).

Internal/RocksDB stats dump (Note 44GB of "slow" data put on DB volume for free:

bluestore(/home/if/ceph/build/dev/osd0)  bluefs bdev sizes: bluefs bdev sizes: bluefs bdev sizes:
0 : device size 0x140000000 : own 0x[1000\~13ffff000] = 0x13ffff000 : using 0x1f2ff000(499 MiB)
1 : device size 0x1900000000 : own 0x[2000\~18ffffe000] = 0x18ffffe000 : using 0x11dd8fe000(71 GiB)
2 : device size 0x1900000000 : own 0x[100000\~18efe00000,18f0000000\~4100000,18f4800000\~4100000,18f9000000\~4100000,18fd800000\~2800000] = 0x18fe900000 : using 0x18fb800000(100 GiB)

RocksDBBlueFSVolumeSelector: wal_total:5100273664, db_total:102005473280, slow_total:102005473280, db_avail:44291850240 usage matrix:
LEVEL, WAL, DB, SLOW, ****, ****, REAL
WAL 522190848,1048576,0,0,0,513338069
DB 0,31607226368,0,0,0,31461968188
SLOW 0,*44343230464*,107298684928,0,0,150927338859
UNSORTED 0,0,0,0,0,0
TOTALS 522190848,75951505408,107298684928,0,0,0
MAXIMUMS:
538968064,1048576,0,0,0,530301834
0,37870370816,0,0,0,37697560316
0,44343230464,107298684928,0,0,150927338859
0,0,0,0,0,0
538968064,79353085952,107298684928,0,0,0
db_statistics {
    "rocksdb_compaction_statistics": "",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Level    Files   Size     
    "": "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
    "": "  L0      1/0   202.98 MB   
    "": "  L1      7/1   395.22 MB   
    "": "  L2     63/5    3.77 GB   
    "": "  L3    435/1   24.57 GB 
    "": "  L4   2246/0   140.56 GB 
    "": " Sum   2752/7   169.48 GB   

Another sample - 32GB of L4 data (originally targeted to slow device) at DB volume:

bluestore(/home/if/ceph/build/dev/osd0)  bluefs bdev sizes: bluefs bdev sizes: bluefs bdev sizes:
0 : device size 0x140000000 : own 0x[1000\~13ffff000] = 0x13ffff000 : using 0x147ff000(328 MiB)
1 : device size 0x1e00000000 : own 0x[2000\~18ffffe000] = 0x18ffffe000 : using 0x7d58fe000(31 GiB)
2 : device size 0x1900000000 : own ... = 0x1675400000 : using 0x109100000(4.1 GiB)

RocksDBBlueFSVolumeSelector: wal_total:5100273664, db_total:122406567936, slow_total:102005473280, db_avail:64692944896 usage matrix:
LEVEL, WAL, DB, SLOW, ****, ****, REAL
WAL 342884352,1048576,0,0,0,338348352
DB 0,370147328,0,0,0,353936829
SLOW 0,*32027705344*,3994025984,0,0,35833722268
UNSORTED 0,0,0,0,0,0
TOTALS 342884352,32398901248,3994025984,0,0,0
MAXIMUMS:
522190848,1048576,0,0,0,513350357
0,33274462208,0,0,0,33122793939
0,64760053760,107298684928,0,0,168613750477
0,0,0,0,0,0
522190848,77617692672,107298684928,0,0,0
db_statistics {
    "rocksdb_compaction_statistics": "",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Level    Files   Size
    "": "--------------------
    "": "  L0      0/0    0.00 KB   
    "": "  L1      0/0    0.00 KB       
    "": "  L2      0/0    0.00 KB
    "": "  L3     21/0   336.46 MB   
    "": "  L4    556/0   33.37 GB   
    "": " Sum    577/0   33.70 GB

Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test docs
  • jenkins render docs

@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch from f237499 to 31ea145 Compare August 15, 2019 14:08
@ifed01 ifed01 requested review from liewegas and markhpc August 15, 2019 14:09
@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch 2 times, most recently from 7f22ce9 to 69e47d9 Compare August 16, 2019 14:42
@ifed01 ifed01 changed the title [DNM][RFC] os/bluestore: os/bluestore: v.2 framework for more intelligent DB space usage os/bluestore: os/bluestore: v.2 framework for more intelligent DB space usage Aug 19, 2019
@ifed01 ifed01 changed the title os/bluestore: os/bluestore: v.2 framework for more intelligent DB space usage os/bluestore: v.2 framework for more intelligent DB space usage Aug 19, 2019
Copy link
Copy Markdown
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. It seems like even if we had the explicit hinting, we'd need the 2d map for usage tracking to know where to put things--these are mostly orthogonal parts of the problem.

Can we add some basic tests the verify that we end up putting some of the higher-level data on db?

if (prefer_bdev != dev_target && devs_source.count(prefer_bdev)) {
prefer_bdev = dev_target_new;
}
}*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

if (prefer_bdev != dev_target && devs_source.count(prefer_bdev)) {
prefer_bdev = dev_target_new;
}
}*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

return 0;
}

int RocksDBStore::ParseOptionsFromStringStatic(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of duplicating the function above, can we call the static version from the non-static, or just use the static one exclusively?

block_all(MAX_BDEV)
bdev(MAX_BDEV),
ioc(MAX_BDEV),
block_all(MAX_BDEV)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should stay aligned with cct(cct)?

Option("bluestore_volume_selection_reserved_factor", Option::TYPE_FLOAT, Option::LEVEL_ADVANCED)
.set_flag(Option::FLAG_STARTUP)
.set_default(2.0)
.set_description("Max DB level size multiplier. Togather they determine amount of space to preserve from usage due to 'use some extra' policy"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/togather/together/

Do I understand correctly that this means that 2x of the estimated DB level(s) size will be reserved, and then what is left over after that will be used for subsequent levels?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. And additional cap is maximum amount of data previously observed at DB for DB/WAL levels. So the actual reserved size is max of both.

Copy link
Copy Markdown
Contributor

@anthonyeleven anthonyeleven Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there situations where this reservation might actually force a level to slow storage because there is no longer space on a faster device?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyeleven - theoretically yes. The 2x multiplier is a bit speculative and doesn't have any strong rationales behind. So in the worst scenario some lower level data can go to slow device while higher level one are at fast device. Don't think this is highly likely though...

<< dendl;

// set volume selector
if (vselector == nullptr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be simpler/safer to reset(nullptr) in umount and at end of mkfs, and assert vselector is null before you instantiate it here (and below in mount())?

virtual uint8_t select_prefer_bdev(void* hint) = 0;
virtual void get_paths(const std::string& base, paths& res) const = 0;
virtual void dump(CephContext* cct) = 0;
virtual void restart() = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you wouldn't need restart()...

@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Aug 27, 2019

I like this. It seems like even if we had the explicit hinting, we'd need the 2d map for usage tracking to know where to put things--these are mostly orthogonal parts of the problem.
Correct

Can we add some basic tests the verify that we end up putting some of the higher-level data on db?

Will do.

@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch 2 times, most recently from 5aad709 to f4ffbfd Compare August 30, 2019 16:32
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Aug 30, 2019

@liewegas - done, please review.

@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch from f4ffbfd to 97d7972 Compare September 9, 2019 17:36
@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch from 97d7972 to 44765db Compare September 26, 2019 15:50
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Sep 27, 2019

jenkins retest this please

@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch from 44765db to c2cddbe Compare October 31, 2019 12:06
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
…r BlueFS.

It allows excessive space usage for higher DB levels.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This is an alternative (and straightforward) way to specify barred space not allowed for 'use
some extra' BlueFS volume selector policy.
Disabled by default since it should depend on RocksDB settings and actual volume size.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@ifed01 ifed01 force-pushed the wip-ifed-flex-bluefs-size-simplified branch from c2cddbe to 57c6a79 Compare November 26, 2019 19:51
@ifed01
Copy link
Copy Markdown
Contributor Author

ifed01 commented Nov 26, 2019

@liewegas - ping? any barriers for this to go?

Copy link
Copy Markdown
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Dec 5, 2019

@liewegas
Copy link
Copy Markdown
Member

liewegas commented Dec 8, 2019

opened https://tracker.ceph.com/issues/43189 for the hung pgs.. it occurs on master too, so not related to this PR. I think this can merge!

liewegas added a commit that referenced this pull request Dec 8, 2019
* refs/pull/29687/head:
	test/store_test: test coverage for anti-spillover framework.
	os/bluestore: introduce bluestore_volume_selection_reserved cfg option.
	os/bluestore: beautify RocksDBBlueFSVolumeSelector::dump output
	os/bluestore: introduce 'bluestore bluefs stats' admin socket command
	os/bluestore: implement more intelligent DB volume space managment for BlueFS.
	os/kv: add static method to parse RocksDB options
	os/bluestore: introduce volume selector abstraction to BlueFS

Reviewed-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit 57c6a79 into ceph:master Dec 8, 2019
@ifed01 ifed01 deleted the wip-ifed-flex-bluefs-size-simplified branch April 1, 2020 10:52
ifed01 added a commit to ifed01/ceph that referenced this pull request Aug 21, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
neha-ojha pushed a commit to neha-ojha/ceph that referenced this pull request Sep 10, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 41823db)
neha-ojha pushed a commit to neha-ojha/ceph that referenced this pull request Sep 10, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 41823db)
guits pushed a commit to guits/ceph that referenced this pull request Sep 16, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
lixiaoy1 pushed a commit to lixiaoy1/ceph that referenced this pull request Sep 21, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
mattbenjamin pushed a commit to linuxbox2/ceph that referenced this pull request Nov 24, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 41823db)
(cherry picked from commit 6d37c6a)

Resolves: rhbz#1877910
ifed01 added a commit to ifed01/ceph that referenced this pull request Nov 30, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 41823db)
votdev pushed a commit to votdev/ceph that referenced this pull request Dec 1, 2020
It turned out that we did't enable it when introduced this feature in
ceph#29687.

Fixes: https://tracker.ceph.com/issues/47053

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 41823db)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants