os/bluestore: v.2 framework for more intelligent DB space usage#29687
os/bluestore: v.2 framework for more intelligent DB space usage#29687liewegas merged 7 commits intoceph:masterfrom
Conversation
f237499 to
31ea145
Compare
7f22ce9 to
69e47d9
Compare
liewegas
left a comment
There was a problem hiding this comment.
I like this. It seems like even if we had the explicit hinting, we'd need the 2d map for usage tracking to know where to put things--these are mostly orthogonal parts of the problem.
Can we add some basic tests the verify that we end up putting some of the higher-level data on db?
src/os/bluestore/BlueFS.cc
Outdated
| if (prefer_bdev != dev_target && devs_source.count(prefer_bdev)) { | ||
| prefer_bdev = dev_target_new; | ||
| } | ||
| }*/ |
src/os/bluestore/BlueFS.cc
Outdated
| if (prefer_bdev != dev_target && devs_source.count(prefer_bdev)) { | ||
| prefer_bdev = dev_target_new; | ||
| } | ||
| }*/ |
| return 0; | ||
| } | ||
|
|
||
| int RocksDBStore::ParseOptionsFromStringStatic( |
There was a problem hiding this comment.
instead of duplicating the function above, can we call the static version from the non-static, or just use the static one exclusively?
src/os/bluestore/BlueFS.cc
Outdated
| block_all(MAX_BDEV) | ||
| bdev(MAX_BDEV), | ||
| ioc(MAX_BDEV), | ||
| block_all(MAX_BDEV) |
There was a problem hiding this comment.
this should stay aligned with cct(cct)?
src/common/options.cc
Outdated
| Option("bluestore_volume_selection_reserved_factor", Option::TYPE_FLOAT, Option::LEVEL_ADVANCED) | ||
| .set_flag(Option::FLAG_STARTUP) | ||
| .set_default(2.0) | ||
| .set_description("Max DB level size multiplier. Togather they determine amount of space to preserve from usage due to 'use some extra' policy"), |
There was a problem hiding this comment.
s/togather/together/
Do I understand correctly that this means that 2x of the estimated DB level(s) size will be reserved, and then what is left over after that will be used for subsequent levels?
There was a problem hiding this comment.
Right. And additional cap is maximum amount of data previously observed at DB for DB/WAL levels. So the actual reserved size is max of both.
There was a problem hiding this comment.
Are there situations where this reservation might actually force a level to slow storage because there is no longer space on a faster device?
There was a problem hiding this comment.
@anthonyeleven - theoretically yes. The 2x multiplier is a bit speculative and doesn't have any strong rationales behind. So in the worst scenario some lower level data can go to slow device while higher level one are at fast device. Don't think this is highly likely though...
| << dendl; | ||
|
|
||
| // set volume selector | ||
| if (vselector == nullptr) { |
There was a problem hiding this comment.
wouldn't it be simpler/safer to reset(nullptr) in umount and at end of mkfs, and assert vselector is null before you instantiate it here (and below in mount())?
src/os/bluestore/BlueFS.h
Outdated
| virtual uint8_t select_prefer_bdev(void* hint) = 0; | ||
| virtual void get_paths(const std::string& base, paths& res) const = 0; | ||
| virtual void dump(CephContext* cct) = 0; | ||
| virtual void restart() = 0; |
There was a problem hiding this comment.
then you wouldn't need restart()...
Will do. |
5aad709 to
f4ffbfd
Compare
|
@liewegas - done, please review. |
f4ffbfd to
97d7972
Compare
97d7972 to
44765db
Compare
|
jenkins retest this please |
44765db to
c2cddbe
Compare
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
…r BlueFS. It allows excessive space usage for higher DB levels. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This is an alternative (and straightforward) way to specify barred space not allowed for 'use some extra' BlueFS volume selector policy. Disabled by default since it should depend on RocksDB settings and actual volume size. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
c2cddbe to
57c6a79
Compare
|
@liewegas - ping? any barriers for this to go? |
|
failures tracked by i suspect that https://tracker.ceph.com/issues/43131 was caused by this PR. @ifed01 could you help take a look? |
|
opened https://tracker.ceph.com/issues/43189 for the hung pgs.. it occurs on master too, so not related to this PR. I think this can merge! |
* refs/pull/29687/head: test/store_test: test coverage for anti-spillover framework. os/bluestore: introduce bluestore_volume_selection_reserved cfg option. os/bluestore: beautify RocksDBBlueFSVolumeSelector::dump output os/bluestore: introduce 'bluestore bluefs stats' admin socket command os/bluestore: implement more intelligent DB volume space managment for BlueFS. os/kv: add static method to parse RocksDB options os/bluestore: introduce volume selector abstraction to BlueFS Reviewed-by: Sage Weil <sage@redhat.com>
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit 41823db)
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit 41823db)
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit 41823db) (cherry picked from commit 6d37c6a) Resolves: rhbz#1877910
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit 41823db)
It turned out that we did't enable it when introduced this feature in ceph#29687. Fixes: https://tracker.ceph.com/issues/47053 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit 41823db)
This is an evolution and simplification of #28960. It neither uses different naming model to learn sst file level nor requires some hints from RocksDB.
Instead it relies on existing wal/db/db.slow paths to track space usage per WAL/DB/SLOW 'levels' and WAL/DB/SLOW devices. And attempts to reuse extra DB space for slow "level" rather than put it unconditionally to main device. Depending on the current and previous utilization...
Similarly we can reuse WAL space for DB "level" if needed (not implemented).
Internal/RocksDB stats dump (Note 44GB of "slow" data put on DB volume for free:
Another sample - 32GB of L4 data (originally targeted to slow device) at DB volume:
Signed-off-by: Igor Fedotov ifedotov@suse.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test docsjenkins render docs