Skip to content

os/bluestore: AVL-tree & extent - based space allocator#30897

Merged
xiexingguo merged 4 commits intoceph:masterfrom
tchaikov:wip-bluestore/avl-allocator
Oct 22, 2019
Merged

os/bluestore: AVL-tree & extent - based space allocator#30897
xiexingguo merged 4 commits intoceph:masterfrom
tchaikov:wip-bluestore/avl-allocator

Conversation

@tchaikov
Copy link
Contributor

Local (very rough) tests show that the performance is comparable with Stupid(at least), need further verification, though.
Benefits: better memory monitoring, perhaps? Also simpler code logic if you don't have to dig into the details of AVL-implementation...

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs

@xiexingguo
Copy link
Member

@liewegas Shall we still consider this as an alternative allocator? I had a talk with kefu last saturday, and he optimistic thought this might be still useful for the incoming seastore implementation..

@xiexingguo
Copy link
Member

retest this please

@tchaikov tchaikov force-pushed the wip-bluestore/avl-allocator branch from 5849317 to 0da9888 Compare October 15, 2019 00:30
@tchaikov
Copy link
Contributor Author

@xiexingguo i ported your work to boost::intrusive::avltree. i think it's nice even if it's just a simpler implementation and its performance is on par with that of bitmap allocator.

@xiexingguo
Copy link
Member

yeah, it used to work well in our cluster for the past two years ✌️

double AvlAllocator::get_fragmentation()
{
std::lock_guard l(lock);
#warn TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to figure out how to calculate the ratio of fragments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it can be roughly evaluated by:
max-contiguous-segment-size / device-size

e.g., the smaller max-contiguous-segment-size it is, the more fragmented the disk space it is..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiexingguo i am using

total_unusable_size / total_free_size

as the indicator. i think it's different from the indicator you suggest, but the idea is quite alike.

@tchaikov tchaikov force-pushed the wip-bluestore/avl-allocator branch 2 times, most recently from 6aca973 to 7883e1b Compare October 15, 2019 05:26
Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. At least it is a better alternative compared to stupid.
Also I am wondering if we can kill stupid entirely and rename avl to extent, which will be less confusing..
But let's do that after this patch gets in and becomes stable..

Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke benchmarking using unittest_alloc_bench shows this implementation is ~30% faster then bitmap allocator. Which is good!
On the other side potentially it might have much higher memory footprint. To say nothing mem usage is volatile and depends on allocate/release use pattern.
Rough estimate for maximum memory consumption is as follows (4TB space and 4K alloc unit):
Max number of free entries = 4TB/4K / 2 = 512M. Entry size (as per mempool stats provided by unittest_alloc_bench) is 80 bytes. So one might need up to 40 GB RAM.
The same for bitmap is 4TB/4K/8 = 128MB (plus some minor overhead (around 4M) for additional data structures)
For sure this is a corner case and under regular usage numbers are much less most of the time.


auto rs = range_tree.lower_bound(range_seg_t{start, end});
if (rs != range_tree.end() && rs->start <= start && rs->end >= end) {
assert(0 == "freeing free segment");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with ceph_assert here and there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am inclined to use ceph_assert() to fail a test where we cannot recover from. and the assert() here is to verify that we don't have logic error, and should be optimized out in non-debug build. so i'd rather to use assert() here.

return num_free;
}

double AvlAllocator::get_fragmentation()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the implementation here. Let's consider the case we have two free extents: 0-4096, 16K-32K.
IMO this function will result in 0 for this case which isn't valid. Moreover it looks like it returns 0 for any number of extents if they are all aligned with block size.
I think the proper calculation should do something like that:
For the given num_free maximum possible amount of free extents in the tree is: M = num_free / block_size. Which means full fragmentation = 1, And minimal amount of free extents (m) is 1 which means fragmentation is 0.
The simplest (may be not the most perfect) approach would be just to estimate fragmentation as f = (num free extents -1 ) / M

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

{
std::lock_guard l(lock);
range_tree.clear_and_dispose(dispose_rs{});
range_size_tree.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be more correct to dispose ranges after cleaning range_size_tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@aclamk
Copy link
Contributor

aclamk commented Oct 16, 2019

@tchaikov
I tested it using unittest_alloc_aging, updated:
aclamk@e1e90b2

Results are:
Summary:
avl fragmented allocs=5.1326% #frags=2.02157 free_score=0.769476 time=184909ms
bitmap fragmented allocs=55.3145% #frags=2.60743 free_score=0.583628 time=8103.41ms
stupid fragmented allocs=1.40833% #frags=2.04938 free_score=0.749268 time=28889.8ms

So, AVL gives allocations that are split into more fragments then stupid (5.1% vs 1.4%), fragments fee space comparatively (0.77 vs 0.75), and is 6 times slower.

@xiexingguo
Copy link
Member

@aclamk leave the fragmentation problem aside (i doubt we still have to figure out a more precise way to evaluate it ), I wonder if we can do some tests against the original Sun Microsystems AVL implementation.

#18187 (probably need some rebase first, I can help if necessary)

rs != t.end(); ++rs) {
uint64_t offset = p2roundup(rs->start, align);
if (offset + size <= rs->end) {
*cursor = offset + size;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote from https://github.com/ceph/ceph/pull/18187/files#r146657678

General concern about such cursor behavior (already highlighted for stupid alocator at #18494) is that it permanently increases that causes excessive fragmentation. Here is a scenario I observed for Stupid Allocator and that's probably is applicable here as well:
There is 40Gb block device (coupled with 16 Gb DB and 10 Gb WAL devices but that's doesn't matter). It's sequentially pre-condifitioned with 32 Gb of data. Hence we have 8 Gb free at block device and allocator has a single continuous 8Gb free extent at this point. Then we start doing random 4K overwrites that cause some 4K extents to return to allocator while new ones are allocated from that large 8Gb extent. Please note that returned small extents aren't reused until we drain all the 8Gb extent since cursor permanently goes up. After overwriting 8Gb of data we have large continuous extent fully exhausted and free 8Gb spread randomly over released 4K extents. There are good chances that existing continuous extents are pretty short (<1Mb). One day bluefs rebalance procedure comes into action and requests that long continuous extent(s) of 1Mb... Allocator fails to do that and we have an assert. That's a real use case I faced recently...
May be we should do a cursor rewind on extent release?

@tchaikov tchaikov force-pushed the wip-bluestore/avl-allocator branch from 7883e1b to ccbca7d Compare October 17, 2019 11:51
@tchaikov
Copy link
Contributor Author

changelog

  • use a minimal version of range_seg_t for searching in range trees
  • templatize the comparators so they can be used along with the minimal version of range_seg_t
  • fix the merge logic, it was buggy. as i was merging the wrong ranges.
  • in AvlAllocator::_add_to_tree(), do not check if a new range is created before inserting it -- move this logic back into the if statements
  • do not use the generic avltree use avl_set and avl_multiset for better readability.
  • use insert_before() when possible for better performance.
  • add "avl" to more allocator tests.

thanks @aclamk. with your updated "test_alloc_fragmentation_max_chunk_8M", i have following test result:

allocator fragmented allocs free_score time
stupid 0% 2 0.731972
bitmap 65.432% 2.30069 0.582799
avl 0% 2 0.747326

@tchaikov
Copy link
Contributor Author

jenkins test make check

@tchaikov tchaikov force-pushed the wip-bluestore/avl-allocator branch from ccbca7d to 4ee0136 Compare October 17, 2019 12:53
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
…ting free score.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@tchaikov tchaikov force-pushed the wip-bluestore/avl-allocator branch from 4ee0136 to 53aac52 Compare October 17, 2019 12:55
@tchaikov
Copy link
Contributor Author

@xiexingguo @ifed01 @aclamk mind taking another look?

Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xiexingguo
Copy link
Member

merging. It should do no harm since bitmap is still the default allocator and this is a good example of extent-based space allocating strategy.

@xiexingguo xiexingguo merged commit ad9b7fe into ceph:master Oct 22, 2019
@tchaikov tchaikov deleted the wip-bluestore/avl-allocator branch October 22, 2019 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants