Skip to content

os/bluestore: AVL-tree & extent - based space allocator #18187

Closed
xiexingguo wants to merge 5 commits intoceph:masterfrom
xiexingguo:wip-avl-allocator
Closed

os/bluestore: AVL-tree & extent - based space allocator #18187
xiexingguo wants to merge 5 commits intoceph:masterfrom
xiexingguo:wip-avl-allocator

Conversation

@xiexingguo
Copy link
Member

@xiexingguo xiexingguo commented Oct 9, 2017

Steal from ZFS :-)
Local (very rough) tests show that the performance is comparable with Stupid(at least), need further verification, though.
Benefits: better memory monitoring, perhaps? Also simpler code logic if you don't have to dig into the details of AVL-implementation...

@mattbenjamin
Copy link
Contributor

@xiexingguo boost::intrusive contains a flexible avl-tree implementation with tree as well as set-oriented interfaces, that's already used in Ceph. It's not a critique of the submitted work at all, but I think I'd prefer to see us re-use this generic sort of code whenever possible.

Matt

@mattbenjamin
Copy link
Contributor

@xiexingguo oh, sorry, you're adding the Sun Microsystems AVL implementation. That -is- high quality and I've re-used it in projects myself. There are then 2 points that apply:

  1. is there a compelling reason not to use boost?
  2. CDDL--I don't know all the ins and out of that; there has been a claim that (L)GPL and CDDL licensed codes are not compatible; e.g., https://lwn.net/Articles/688683/; I don't know the current accepted opinion, to be honest.

@xiexingguo
Copy link
Member Author

@mattbenjamin Thanks for your comments.

Actually I have no preference of Sun::AVL than Boost::AVL. I used to be very familiar with these (ZFS) codes, so I just pick the quickest way by copying it here...

If this (AVL allocator) turns out to be a better option or at least an alternative option, I can definitely switch to Boost::AVL.

@tchaikov
Copy link
Contributor

tchaikov commented Oct 10, 2017

i concur with @mattbenjamin. further more, i think we cannot include CDDL code in our tree in this way. as to my best knowledge, CDDL and LGPL 2.1 is not compatible. the source code of the derivative work (ceph binaries) of LGPL2.1 should be available under the license of LGPL2.1. but the avl implementation is licensed under CDDL.

@liewegas
Copy link
Member

Can you summarize what the allocator does? It doesn't seem like the use of an avl tree instead of the rbtree in stupid is the imporant part..

@xiexingguo
Copy link
Member Author

xiexingguo commented Oct 10, 2017

It doesn't seem like the use of an avl tree instead of the rbtree in stupid is the imporant part..

That's true:-)
(1) We manage space in extent, each extent is of form [offset, offset+length]/[start, end);
(2) We use two avl trees to track all free(unallocated) extents, namely range-tree and range-size-tree.
range-tree arranges extents strictly by offset, while the range-size-tree arranges extents strictly by the size of each extent. All extents are linked and updated simutaneously in both trees so we won't double the memory consumpation.
(3) Each time we allocate some space, the corresponding range will be removed from a specific matched extent, and that extent will split if necessary.
(4) Each time we free some space, the corresponding range will be added back to both trees, and a merge operation will be automatically performed if necessary.
(5) We always try to search and allocate space from the range_tree (in term of First-Fit, which is very fast). But if the free space is low or the whole space map gets too fragmented, we then switch to the range_size_tree to use Best-Fit policy, which by definition, always try to find the extent that has the closest size of the requested amount of space.

@liewegas
Copy link
Member

Sounds good! We should switch over to the boost instrusive AVL code though :)


if (allocated) {
std::lock_guard<std::mutex> l(lock);
num_reserved -= allocated;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do that inside _allocate to avoid additional lock acquisition.


/* Make sure we completely overlap with someone */
if (rs == NULL) {
assert(0 == "freeing free segment");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's not freeing free segment but rather allocating already allocated one...

rs = static_cast<range_seg_t *>(p);

if (rs != NULL && rs->rs_start <= start && rs->rs_end >= end) {
assert(0 == "allocating allocated segment");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be "freeing free segment"...

uint64_t offset = P2ROUNDUP(rs->rs_start, align);

if (offset + size <= rs->rs_end) {
*cursor = offset + size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General concern about such cursor behavior (already highlighted for stupid alocator at #18494) is that it permanently increases that causes excessive fragmentation. Here is a scenario I observed for Stupid Allocator and that's probably is applicable here as well:
There is 40Gb block device (coupled with 16 Gb DB and 10 Gb WAL devices but that's doesn't matter). It's sequentially pre-condifitioned with 32 Gb of data. Hence we have 8 Gb free at block device and allocator has a single continuous 8Gb free extent at this point. Then we start doing random 4K overwrites that cause some 4K extents to return to allocator while new ones are allocated from that large 8Gb extent. Please note that returned small extents aren't reused until we drain all the 8Gb extent since cursor permanently goes up. After overwriting 8Gb of data we have large continuous extent fully exhausted and free 8Gb spread randomly over released 4K extents. There are good chances that existing continuous extents are pretty short (<1Mb). One day bluefs rebalance procedure comes into action and requests that long continuous extent(s) of 1Mb... Allocator fails to do that and we have an assert. That's a real use case I faced recently...
May be we should do a cursor rewind on extent release?

* an allocation of this size then it switches to using more
* aggressive strategy (i.e search by size rather than offset).
*/
uint64_t range_size_alloc_threshold = 1ULL << 17; ///< 128K
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to make externally configurable

* Once the allocator's free space drops below this level we dynamically
* switch to using best-fit allocations.
*/
int range_size_alloc_free_pct = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one too.

@liewegas
Copy link
Member

liewegas commented Feb 9, 2018

@xiexingguo @ifed01 Should we merge this as is? Abandon it? If we let it sit here in this PR it will just rot.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
…figurable

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo
Copy link
Member Author

xiexingguo commented Feb 11, 2018

@xiexingguo @ifed01 Should we merge this as is? Abandon it? If we let it sit here in this PR it will just rot.

@liewegas I'd suggest getting this one in as it is. It's a good alternative and we can do further testing and improvements once it gets in.

@ifed01
Copy link
Contributor

ifed01 commented Feb 12, 2018

@liewegas - there were some licensing concerns above.. I know almost nothing about that but perhaps it's better to replace with Boost implementation then.

@liewegas
Copy link
Member

Oh yeah, I forgot about the CDDL issue. That needs to be fixed before we can merge this.

tchaikov pushed a commit to tchaikov/ceph that referenced this pull request Mar 28, 2018
Fixed: ceph#18187

when accept a CROS request, the request http origin shorter than the bucket's corsrule
(eg. origin: http://s.com corsrule: <AllowedOrigin>*.verylongdomain.com</AllowedOrigin>).
the rgw_cors.cc::is_string_in_set() will have a wrong index, the radosrgw server will
abort.

$ curl http://test.localhost:8000/app.data -H "Origin:http://s.com"

 0> 2016-12-05 03:22:29.548138 7f6add05d700 -1 *** Caught signal (Aborted) **
 in thread 7f6add05d700 thread_name:civetweb-worker

 ceph version 11.0.2-2168-gd2f8fb4 (d2f8fb4)
 1: (()+0x50720a) [0x7f6b147c420a]
 2: (()+0xf370) [0x7f6b09a33370]
 3: (gsignal()+0x37) [0x7f6b081ca1d7]
 4: (abort()+0x148) [0x7f6b081cb8c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f6b08ace9d5]
 6: (()+0x5e946) [0x7f6b08acc946]
 7: (()+0x5e973) [0x7f6b08acc973]
 8: (()+0x5eb93) [0x7f6b08accb93]
 9: (std::__throw_out_of_range(char const*)+0x77) 0x7f6b08b21a17]
 10: (()+0xbd97a) [0x7f6b08b2b97a]
 11: (()+0x449c1e) [0x7f6b14706c1e]
 12: (RGWCORSRule::is_origin_present(char const*)+0x48) [0x7f6b147073b8]
 13: (RGWCORSConfiguration::host_name_rule(char const*)+0x37) [0x7f6b147074e7]
 14: (RGWOp::generate_cors_headers(std::string&, std::string&, std::string&, std::string&, unsigned int*)+0xa3) [0x7f6b14593e63]
 15: (dump_access_control(req_state*, RGWOp*)+0x61) [0x7f6b14653f91]

Signed-off-by: LiuYang <yippeetry@gmail.com>
(cherry picked from commit 67d4d9e)
guihecheng pushed a commit to guihecheng/ceph that referenced this pull request Sep 12, 2018
Fixed: ceph#18187

when accept a CROS request, the request http origin shorter than the bucket's corsrule
(eg. origin: http://s.com corsrule: <AllowedOrigin>*.verylongdomain.com</AllowedOrigin>).
the rgw_cors.cc::is_string_in_set() will have a wrong index, the radosrgw server will
abort.

$ curl http://test.localhost:8000/app.data -H "Origin:http://s.com"

 0> 2016-12-05 03:22:29.548138 7f6add05d700 -1 *** Caught signal (Aborted) **
 in thread 7f6add05d700 thread_name:civetweb-worker

 ceph version 11.0.2-2168-gd2f8fb4 (d2f8fb4)
 1: (()+0x50720a) [0x7f6b147c420a]
 2: (()+0xf370) [0x7f6b09a33370]
 3: (gsignal()+0x37) [0x7f6b081ca1d7]
 4: (abort()+0x148) [0x7f6b081cb8c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f6b08ace9d5]
 6: (()+0x5e946) [0x7f6b08acc946]
 7: (()+0x5e973) [0x7f6b08acc973]
 8: (()+0x5eb93) [0x7f6b08accb93]
 9: (std::__throw_out_of_range(char const*)+0x77) 0x7f6b08b21a17]
 10: (()+0xbd97a) [0x7f6b08b2b97a]
 11: (()+0x449c1e) [0x7f6b14706c1e]
 12: (RGWCORSRule::is_origin_present(char const*)+0x48) [0x7f6b147073b8]
 13: (RGWCORSConfiguration::host_name_rule(char const*)+0x37) [0x7f6b147074e7]
 14: (RGWOp::generate_cors_headers(std::string&, std::string&, std::string&, std::string&, unsigned int*)+0xa3) [0x7f6b14593e63]
 15: (dump_access_control(req_state*, RGWOp*)+0x61) [0x7f6b14653f91]

Signed-off-by: LiuYang <yippeetry@gmail.com>
(cherry picked from commit 67d4d9e)
@tchaikov
Copy link
Contributor

@xiexingguo is this change still relevant ?

@xiexingguo xiexingguo closed this Oct 15, 2018
@xiexingguo xiexingguo deleted the wip-avl-allocator branch October 15, 2018 09:55
@tchaikov
Copy link
Contributor

adapted to boost::intrusive::avltree, see #30897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants