os/bluestore: AVL-tree & extent - based space allocator#30897
os/bluestore: AVL-tree & extent - based space allocator#30897xiexingguo merged 4 commits intoceph:masterfrom
Conversation
f94dbc5 to
5849317
Compare
|
@liewegas Shall we still consider this as an alternative allocator? I had a talk with kefu last saturday, and he optimistic thought this might be still useful for the incoming seastore implementation.. |
|
retest this please |
5849317 to
0da9888
Compare
|
@xiexingguo i ported your work to boost::intrusive::avltree. i think it's nice even if it's just a simpler implementation and its performance is on par with that of bitmap allocator. |
|
yeah, it used to work well in our cluster for the past two years ✌️ |
src/os/bluestore/AvlAllocator.cc
Outdated
| double AvlAllocator::get_fragmentation() | ||
| { | ||
| std::lock_guard l(lock); | ||
| #warn TODO |
There was a problem hiding this comment.
still need to figure out how to calculate the ratio of fragments.
There was a problem hiding this comment.
i think it can be roughly evaluated by:
max-contiguous-segment-size / device-size
e.g., the smaller max-contiguous-segment-size it is, the more fragmented the disk space it is..
There was a problem hiding this comment.
@xiexingguo i am using
total_unusable_size / total_free_size
as the indicator. i think it's different from the indicator you suggest, but the idea is quite alike.
6aca973 to
7883e1b
Compare
ifed01
left a comment
There was a problem hiding this comment.
Smoke benchmarking using unittest_alloc_bench shows this implementation is ~30% faster then bitmap allocator. Which is good!
On the other side potentially it might have much higher memory footprint. To say nothing mem usage is volatile and depends on allocate/release use pattern.
Rough estimate for maximum memory consumption is as follows (4TB space and 4K alloc unit):
Max number of free entries = 4TB/4K / 2 = 512M. Entry size (as per mempool stats provided by unittest_alloc_bench) is 80 bytes. So one might need up to 40 GB RAM.
The same for bitmap is 4TB/4K/8 = 128MB (plus some minor overhead (around 4M) for additional data structures)
For sure this is a corner case and under regular usage numbers are much less most of the time.
src/os/bluestore/AvlAllocator.cc
Outdated
|
|
||
| auto rs = range_tree.lower_bound(range_seg_t{start, end}); | ||
| if (rs != range_tree.end() && rs->start <= start && rs->end >= end) { | ||
| assert(0 == "freeing free segment"); |
There was a problem hiding this comment.
replace with ceph_assert here and there?
There was a problem hiding this comment.
i am inclined to use ceph_assert() to fail a test where we cannot recover from. and the assert() here is to verify that we don't have logic error, and should be optimized out in non-debug build. so i'd rather to use assert() here.
| return num_free; | ||
| } | ||
|
|
||
| double AvlAllocator::get_fragmentation() |
There was a problem hiding this comment.
Not sure I understand the implementation here. Let's consider the case we have two free extents: 0-4096, 16K-32K.
IMO this function will result in 0 for this case which isn't valid. Moreover it looks like it returns 0 for any number of extents if they are all aligned with block size.
I think the proper calculation should do something like that:
For the given num_free maximum possible amount of free extents in the tree is: M = num_free / block_size. Which means full fragmentation = 1, And minimal amount of free extents (m) is 1 which means fragmentation is 0.
The simplest (may be not the most perfect) approach would be just to estimate fragmentation as f = (num free extents -1 ) / M
| { | ||
| std::lock_guard l(lock); | ||
| range_tree.clear_and_dispose(dispose_rs{}); | ||
| range_size_tree.clear(); |
There was a problem hiding this comment.
wouldn't it be more correct to dispose ranges after cleaning range_size_tree?
|
@tchaikov Results are: So, AVL gives allocations that are split into more fragments then stupid (5.1% vs 1.4%), fragments fee space comparatively (0.77 vs 0.75), and is 6 times slower. |
| rs != t.end(); ++rs) { | ||
| uint64_t offset = p2roundup(rs->start, align); | ||
| if (offset + size <= rs->end) { | ||
| *cursor = offset + size; |
There was a problem hiding this comment.
quote from https://github.com/ceph/ceph/pull/18187/files#r146657678
General concern about such cursor behavior (already highlighted for stupid alocator at #18494) is that it permanently increases that causes excessive fragmentation. Here is a scenario I observed for Stupid Allocator and that's probably is applicable here as well:
There is 40Gb block device (coupled with 16 Gb DB and 10 Gb WAL devices but that's doesn't matter). It's sequentially pre-condifitioned with 32 Gb of data. Hence we have 8 Gb free at block device and allocator has a single continuous 8Gb free extent at this point. Then we start doing random 4K overwrites that cause some 4K extents to return to allocator while new ones are allocated from that large 8Gb extent. Please note that returned small extents aren't reused until we drain all the 8Gb extent since cursor permanently goes up. After overwriting 8Gb of data we have large continuous extent fully exhausted and free 8Gb spread randomly over released 4K extents. There are good chances that existing continuous extents are pretty short (<1Mb). One day bluefs rebalance procedure comes into action and requests that long continuous extent(s) of 1Mb... Allocator fails to do that and we have an assert. That's a real use case I faced recently...
May be we should do a cursor rewind on extent release?
7883e1b to
ccbca7d
Compare
|
changelog
thanks @aclamk. with your updated "test_alloc_fragmentation_max_chunk_8M", i have following test result:
|
|
jenkins test make check |
ccbca7d to
4ee0136
Compare
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> Signed-off-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
…ting free score. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
4ee0136 to
53aac52
Compare
|
@xiexingguo @ifed01 @aclamk mind taking another look? |
|
merging. It should do no harm since bitmap is still the default allocator and this is a good example of extent-based space allocating strategy. |
Local (very rough) tests show that the performance is comparable with Stupid(at least), need further verification, though.
Benefits: better memory monitoring, perhaps? Also simpler code logic if you don't have to dig into the details of AVL-implementation...
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docs