os/bluestore: AVL-tree & extent - based space allocator #18187
os/bluestore: AVL-tree & extent - based space allocator #18187xiexingguo wants to merge 5 commits intoceph:masterfrom
Conversation
|
@xiexingguo boost::intrusive contains a flexible avl-tree implementation with tree as well as set-oriented interfaces, that's already used in Ceph. It's not a critique of the submitted work at all, but I think I'd prefer to see us re-use this generic sort of code whenever possible. Matt |
|
@xiexingguo oh, sorry, you're adding the Sun Microsystems AVL implementation. That -is- high quality and I've re-used it in projects myself. There are then 2 points that apply:
|
|
@mattbenjamin Thanks for your comments. Actually I have no preference of Sun::AVL than Boost::AVL. I used to be very familiar with these (ZFS) codes, so I just pick the quickest way by copying it here... If this (AVL allocator) turns out to be a better option or at least an alternative option, I can definitely switch to Boost::AVL. |
|
i concur with @mattbenjamin. further more, i think we cannot include CDDL code in our tree in this way. as to my best knowledge, CDDL and LGPL 2.1 is not compatible. the source code of the derivative work (ceph binaries) of LGPL2.1 should be available under the license of LGPL2.1. but the avl implementation is licensed under CDDL. |
|
Can you summarize what the allocator does? It doesn't seem like the use of an avl tree instead of the rbtree in stupid is the imporant part.. |
That's true:-) |
|
Sounds good! We should switch over to the boost instrusive AVL code though :) |
4e7ba62 to
f2146c8
Compare
src/os/bluestore/AvlAllocator.cc
Outdated
|
|
||
| if (allocated) { | ||
| std::lock_guard<std::mutex> l(lock); | ||
| num_reserved -= allocated; |
There was a problem hiding this comment.
let's do that inside _allocate to avoid additional lock acquisition.
f2146c8 to
4fce767
Compare
src/os/bluestore/AvlAllocator.cc
Outdated
|
|
||
| /* Make sure we completely overlap with someone */ | ||
| if (rs == NULL) { | ||
| assert(0 == "freeing free segment"); |
There was a problem hiding this comment.
IMO that's not freeing free segment but rather allocating already allocated one...
src/os/bluestore/AvlAllocator.cc
Outdated
| rs = static_cast<range_seg_t *>(p); | ||
|
|
||
| if (rs != NULL && rs->rs_start <= start && rs->rs_end >= end) { | ||
| assert(0 == "allocating allocated segment"); |
There was a problem hiding this comment.
This seems to be "freeing free segment"...
| uint64_t offset = P2ROUNDUP(rs->rs_start, align); | ||
|
|
||
| if (offset + size <= rs->rs_end) { | ||
| *cursor = offset + size; |
There was a problem hiding this comment.
General concern about such cursor behavior (already highlighted for stupid alocator at #18494) is that it permanently increases that causes excessive fragmentation. Here is a scenario I observed for Stupid Allocator and that's probably is applicable here as well:
There is 40Gb block device (coupled with 16 Gb DB and 10 Gb WAL devices but that's doesn't matter). It's sequentially pre-condifitioned with 32 Gb of data. Hence we have 8 Gb free at block device and allocator has a single continuous 8Gb free extent at this point. Then we start doing random 4K overwrites that cause some 4K extents to return to allocator while new ones are allocated from that large 8Gb extent. Please note that returned small extents aren't reused until we drain all the 8Gb extent since cursor permanently goes up. After overwriting 8Gb of data we have large continuous extent fully exhausted and free 8Gb spread randomly over released 4K extents. There are good chances that existing continuous extents are pretty short (<1Mb). One day bluefs rebalance procedure comes into action and requests that long continuous extent(s) of 1Mb... Allocator fails to do that and we have an assert. That's a real use case I faced recently...
May be we should do a cursor rewind on extent release?
src/os/bluestore/AvlAllocator.h
Outdated
| * an allocation of this size then it switches to using more | ||
| * aggressive strategy (i.e search by size rather than offset). | ||
| */ | ||
| uint64_t range_size_alloc_threshold = 1ULL << 17; ///< 128K |
There was a problem hiding this comment.
Suggest to make externally configurable
src/os/bluestore/AvlAllocator.h
Outdated
| * Once the allocator's free space drops below this level we dynamically | ||
| * switch to using best-fit allocations. | ||
| */ | ||
| int range_size_alloc_free_pct = 4; |
4fce767 to
4685363
Compare
|
@xiexingguo @ifed01 Should we merge this as is? Abandon it? If we let it sit here in this PR it will just rot. |
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
…figurable Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
4685363 to
3c67522
Compare
@liewegas I'd suggest getting this one in as it is. It's a good alternative and we can do further testing and improvements once it gets in. |
|
@liewegas - there were some licensing concerns above.. I know almost nothing about that but perhaps it's better to replace with Boost implementation then. |
|
Oh yeah, I forgot about the CDDL issue. That needs to be fixed before we can merge this. |
Fixed: ceph#18187 when accept a CROS request, the request http origin shorter than the bucket's corsrule (eg. origin: http://s.com corsrule: <AllowedOrigin>*.verylongdomain.com</AllowedOrigin>). the rgw_cors.cc::is_string_in_set() will have a wrong index, the radosrgw server will abort. $ curl http://test.localhost:8000/app.data -H "Origin:http://s.com" 0> 2016-12-05 03:22:29.548138 7f6add05d700 -1 *** Caught signal (Aborted) ** in thread 7f6add05d700 thread_name:civetweb-worker ceph version 11.0.2-2168-gd2f8fb4 (d2f8fb4) 1: (()+0x50720a) [0x7f6b147c420a] 2: (()+0xf370) [0x7f6b09a33370] 3: (gsignal()+0x37) [0x7f6b081ca1d7] 4: (abort()+0x148) [0x7f6b081cb8c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f6b08ace9d5] 6: (()+0x5e946) [0x7f6b08acc946] 7: (()+0x5e973) [0x7f6b08acc973] 8: (()+0x5eb93) [0x7f6b08accb93] 9: (std::__throw_out_of_range(char const*)+0x77) 0x7f6b08b21a17] 10: (()+0xbd97a) [0x7f6b08b2b97a] 11: (()+0x449c1e) [0x7f6b14706c1e] 12: (RGWCORSRule::is_origin_present(char const*)+0x48) [0x7f6b147073b8] 13: (RGWCORSConfiguration::host_name_rule(char const*)+0x37) [0x7f6b147074e7] 14: (RGWOp::generate_cors_headers(std::string&, std::string&, std::string&, std::string&, unsigned int*)+0xa3) [0x7f6b14593e63] 15: (dump_access_control(req_state*, RGWOp*)+0x61) [0x7f6b14653f91] Signed-off-by: LiuYang <yippeetry@gmail.com> (cherry picked from commit 67d4d9e)
Fixed: ceph#18187 when accept a CROS request, the request http origin shorter than the bucket's corsrule (eg. origin: http://s.com corsrule: <AllowedOrigin>*.verylongdomain.com</AllowedOrigin>). the rgw_cors.cc::is_string_in_set() will have a wrong index, the radosrgw server will abort. $ curl http://test.localhost:8000/app.data -H "Origin:http://s.com" 0> 2016-12-05 03:22:29.548138 7f6add05d700 -1 *** Caught signal (Aborted) ** in thread 7f6add05d700 thread_name:civetweb-worker ceph version 11.0.2-2168-gd2f8fb4 (d2f8fb4) 1: (()+0x50720a) [0x7f6b147c420a] 2: (()+0xf370) [0x7f6b09a33370] 3: (gsignal()+0x37) [0x7f6b081ca1d7] 4: (abort()+0x148) [0x7f6b081cb8c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f6b08ace9d5] 6: (()+0x5e946) [0x7f6b08acc946] 7: (()+0x5e973) [0x7f6b08acc973] 8: (()+0x5eb93) [0x7f6b08accb93] 9: (std::__throw_out_of_range(char const*)+0x77) 0x7f6b08b21a17] 10: (()+0xbd97a) [0x7f6b08b2b97a] 11: (()+0x449c1e) [0x7f6b14706c1e] 12: (RGWCORSRule::is_origin_present(char const*)+0x48) [0x7f6b147073b8] 13: (RGWCORSConfiguration::host_name_rule(char const*)+0x37) [0x7f6b147074e7] 14: (RGWOp::generate_cors_headers(std::string&, std::string&, std::string&, std::string&, unsigned int*)+0xa3) [0x7f6b14593e63] 15: (dump_access_control(req_state*, RGWOp*)+0x61) [0x7f6b14653f91] Signed-off-by: LiuYang <yippeetry@gmail.com> (cherry picked from commit 67d4d9e)
|
@xiexingguo is this change still relevant ? |
|
adapted to boost::intrusive::avltree, see #30897 |
Steal from ZFS :-)
Local (very rough) tests show that the performance is comparable with Stupid(at least), need further verification, though.
Benefits: better memory monitoring, perhaps? Also simpler code logic if you don't have to dig into the details of AVL-implementation...