os/bluestore: AVL-tree & extent - based space allocator by tchaikov · Pull Request #30897 · ceph/ceph

tchaikov · 2019-10-14T12:32:28Z

Local (very rough) tests show that the performance is comparable with Stupid(at least), need further verification, though.
Benefits: better memory monitoring, perhaps? Also simpler code logic if you don't have to dig into the details of AVL-implementation...

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs

xiexingguo · 2019-10-15T00:21:48Z

@liewegas Shall we still consider this as an alternative allocator? I had a talk with kefu last saturday， and he optimistic thought this might be still useful for the incoming seastore implementation..

xiexingguo · 2019-10-15T00:22:11Z

retest this please

tchaikov · 2019-10-15T00:32:26Z

@xiexingguo i ported your work to boost::intrusive::avltree. i think it's nice even if it's just a simpler implementation and its performance is on par with that of bitmap allocator.

xiexingguo · 2019-10-15T00:34:14Z

yeah, it used to work well in our cluster for the past two years ✌️

tchaikov · 2019-10-15T00:46:15Z

src/os/bluestore/AvlAllocator.cc

+double AvlAllocator::get_fragmentation()
+{
+  std::lock_guard l(lock);
+  #warn TODO


still need to figure out how to calculate the ratio of fragments.

i think it can be roughly evaluated by:
max-contiguous-segment-size / device-size

e.g., the smaller max-contiguous-segment-size it is, the more fragmented the disk space it is..

@xiexingguo i am using

total_unusable_size / total_free_size

as the indicator. i think it's different from the indicator you suggest, but the idea is quite alike.

xiexingguo

I like it. At least it is a better alternative compared to stupid.
Also I am wondering if we can kill stupid entirely and rename avl to extent, which will be less confusing..
But let's do that after this patch gets in and becomes stable..

ifed01

Smoke benchmarking using unittest_alloc_bench shows this implementation is ~30% faster then bitmap allocator. Which is good!
On the other side potentially it might have much higher memory footprint. To say nothing mem usage is volatile and depends on allocate/release use pattern.
Rough estimate for maximum memory consumption is as follows (4TB space and 4K alloc unit):
Max number of free entries = 4TB/4K / 2 = 512M. Entry size (as per mempool stats provided by unittest_alloc_bench) is 80 bytes. So one might need up to 40 GB RAM.
The same for bitmap is 4TB/4K/8 = 128MB (plus some minor overhead (around 4M) for additional data structures)
For sure this is a corner case and under regular usage numbers are much less most of the time.

ifed01 · 2019-10-15T16:26:03Z

src/os/bluestore/AvlAllocator.cc

+
+  auto rs = range_tree.lower_bound(range_seg_t{start, end});
+  if (rs != range_tree.end() && rs->start <= start && rs->end >= end) {
+    assert(0 == "freeing free segment");


replace with ceph_assert here and there?

i am inclined to use ceph_assert() to fail a test where we cannot recover from. and the assert() here is to verify that we don't have logic error, and should be optimized out in non-debug build. so i'd rather to use assert() here.

ifed01 · 2019-10-16T10:26:19Z

src/os/bluestore/AvlAllocator.cc

+  return num_free;
+}
+
+double AvlAllocator::get_fragmentation()


Not sure I understand the implementation here. Let's consider the case we have two free extents: 0-4096, 16K-32K.
IMO this function will result in 0 for this case which isn't valid. Moreover it looks like it returns 0 for any number of extents if they are all aligned with block size.
I think the proper calculation should do something like that:
For the given num_free maximum possible amount of free extents in the tree is: M = num_free / block_size. Which means full fragmentation = 1, And minimal amount of free extents (m) is 1 which means fragmentation is 0.
The simplest (may be not the most perfect) approach would be just to estimate fragmentation as f = (num free extents -1 ) / M

ifed01 · 2019-10-16T10:55:20Z

src/os/bluestore/AvlAllocator.cc

+{
+  std::lock_guard l(lock);
+  range_tree.clear_and_dispose(dispose_rs{});
+  range_size_tree.clear();


wouldn't it be more correct to dispose ranges after cleaning range_size_tree?

aclamk · 2019-10-16T11:19:31Z

@tchaikov
I tested it using unittest_alloc_aging, updated:
aclamk@e1e90b2

Results are:
Summary:
avl fragmented allocs=5.1326% #frags=2.02157 free_score=0.769476 time=184909ms
bitmap fragmented allocs=55.3145% #frags=2.60743 free_score=0.583628 time=8103.41ms
stupid fragmented allocs=1.40833% #frags=2.04938 free_score=0.749268 time=28889.8ms

So, AVL gives allocations that are split into more fragments then stupid (5.1% vs 1.4%), fragments fee space comparatively (0.77 vs 0.75), and is 6 times slower.

xiexingguo · 2019-10-17T00:59:35Z

@aclamk leave the fragmentation problem aside (i doubt we still have to figure out a more precise way to evaluate it ), I wonder if we can do some tests against the original Sun Microsystems AVL implementation.

#18187 (probably need some rebase first, I can help if necessary)

tchaikov · 2019-10-17T11:30:09Z

src/os/bluestore/AvlAllocator.cc

+       rs != t.end(); ++rs) {
+    uint64_t offset = p2roundup(rs->start, align);
+    if (offset + size <= rs->end) {
+      *cursor = offset + size;


quote from https://github.com/ceph/ceph/pull/18187/files#r146657678

General concern about such cursor behavior (already highlighted for stupid alocator at #18494) is that it permanently increases that causes excessive fragmentation. Here is a scenario I observed for Stupid Allocator and that's probably is applicable here as well:
There is 40Gb block device (coupled with 16 Gb DB and 10 Gb WAL devices but that's doesn't matter). It's sequentially pre-condifitioned with 32 Gb of data. Hence we have 8 Gb free at block device and allocator has a single continuous 8Gb free extent at this point. Then we start doing random 4K overwrites that cause some 4K extents to return to allocator while new ones are allocated from that large 8Gb extent. Please note that returned small extents aren't reused until we drain all the 8Gb extent since cursor permanently goes up. After overwriting 8Gb of data we have large continuous extent fully exhausted and free 8Gb spread randomly over released 4K extents. There are good chances that existing continuous extents are pretty short (<1Mb). One day bluefs rebalance procedure comes into action and requests that long continuous extent(s) of 1Mb... Allocator fails to do that and we have an assert. That's a real use case I faced recently...
May be we should do a cursor rewind on extent release?

tchaikov · 2019-10-17T12:07:18Z

changelog

use a minimal version of range_seg_t for searching in range trees
templatize the comparators so they can be used along with the minimal version of range_seg_t
fix the merge logic, it was buggy. as i was merging the wrong ranges.
in AvlAllocator::_add_to_tree(), do not check if a new range is created before inserting it -- move this logic back into the if statements
do not use the generic avltree use avl_set and avl_multiset for better readability.
use insert_before() when possible for better performance.
add "avl" to more allocator tests.

thanks @aclamk. with your updated "test_alloc_fragmentation_max_chunk_8M", i have following test result:

allocator	fragmented allocs	free_score	time
stupid	0%	2	0.731972
bitmap	65.432%	2.30069	0.582799
avl	0%	2	0.747326

tchaikov · 2019-10-17T12:17:25Z

jenkins test make check

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> Signed-off-by: Kefu Chai <kchai@redhat.com>

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

…ting free score. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

tchaikov · 2019-10-21T04:09:29Z

failures tracked by

tchaikov · 2019-10-21T04:10:37Z

@xiexingguo @ifed01 @aclamk mind taking another look?

xiexingguo

lgtm

xiexingguo · 2019-10-22T00:33:15Z

merging. It should do no harm since bitmap is still the default allocator and this is a good example of extent-based space allocating strategy.

tchaikov mentioned this pull request Oct 14, 2019

os/bluestore: AVL-tree & extent - based space allocator #18187

Closed

tchaikov requested review from ifed01 and xiexingguo October 14, 2019 12:33

tchaikov force-pushed the wip-bluestore/avl-allocator branch from f94dbc5 to 5849317 Compare October 14, 2019 15:56

batrick added bluestore needs-review labels Oct 14, 2019

tchaikov force-pushed the wip-bluestore/avl-allocator branch from 5849317 to 0da9888 Compare October 15, 2019 00:30

tchaikov commented Oct 15, 2019

View reviewed changes

tchaikov force-pushed the wip-bluestore/avl-allocator branch 2 times, most recently from 6aca973 to 7883e1b Compare October 15, 2019 05:26

xiexingguo approved these changes Oct 16, 2019

View reviewed changes

xiexingguo added the needs-qa label Oct 16, 2019

ifed01 reviewed Oct 16, 2019

View reviewed changes

tchaikov commented Oct 17, 2019

View reviewed changes

tchaikov force-pushed the wip-bluestore/avl-allocator branch from 7883e1b to ccbca7d Compare October 17, 2019 11:51

tchaikov requested review from aclamk, ifed01 and xiexingguo October 17, 2019 12:19

tchaikov force-pushed the wip-bluestore/avl-allocator branch from ccbca7d to 4ee0136 Compare October 17, 2019 12:53

xiexingguo added 3 commits October 17, 2019 20:54

os/bluestore: implement avl-extent-based allocator

adbc79a

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> Signed-off-by: Kefu Chai <kchai@redhat.com>

src/test/objectstore/Allocator_{test,bench}.cc: test AvlAllocator too

97dfbc9

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

qa: add AvlAllocator test script

29512a7

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

objectstore/test/aging: Added avl, added more test cases, added repor…

53aac52

…ting free score. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

tchaikov force-pushed the wip-bluestore/avl-allocator branch from 4ee0136 to 53aac52 Compare October 17, 2019 12:55

tchaikov added the wip-kefu-testing label Oct 18, 2019

tchaikov removed needs-qa wip-kefu-testing labels Oct 21, 2019

xiexingguo approved these changes Oct 21, 2019

View reviewed changes

xiexingguo merged commit ad9b7fe into ceph:master Oct 22, 2019

tchaikov deleted the wip-bluestore/avl-allocator branch October 22, 2019 02:03

tchaikov mentioned this pull request Jun 2, 2021

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

Merged

3 tasks

Conversation

tchaikov commented Oct 14, 2019

Checklist

Uh oh!

xiexingguo commented Oct 15, 2019

Uh oh!

xiexingguo commented Oct 15, 2019

Uh oh!

tchaikov commented Oct 15, 2019

Uh oh!

xiexingguo commented Oct 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiexingguo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifed01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aclamk commented Oct 16, 2019

Uh oh!

xiexingguo commented Oct 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Oct 17, 2019

Uh oh!

tchaikov commented Oct 17, 2019

Uh oh!

tchaikov commented Oct 21, 2019

Uh oh!

tchaikov commented Oct 21, 2019

Uh oh!

xiexingguo left a comment

Choose a reason for hiding this comment

Uh oh!

xiexingguo commented Oct 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xiexingguo left a comment •

edited

Loading