os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators.#41369
os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators.#41369tchaikov merged 2 commits intoceph:masterfrom
Conversation
| do { | ||
| start = _block_picker(range_size_tree, cursor, size, unit); | ||
| if (start != -1ULL || !force_range_size_alloc) { | ||
| start = _block_picker(range_size_tree, &fake_cursor, size, unit); |
There was a problem hiding this comment.
i don't understand why we should not update the *cursor here. since AvlAllocator was inspired by zfs's dynamic fit block allocator. i am looking into its implementation. see https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/metaslab.c#L1666 . IIUC, lbas[cbits(align) - 1] should be the end of the allocated extent to ease the search of the extent of the same alignment next time.
also, probably we can consolidate the code (not tested) like
uint64_t *cursor = &lbas[cbits(align) - 1];
const int free_pct = num_free * 100 / device_size;
uint64_t start = 0;
if (force_range_size_alloc ||
max_size < range_size_alloc_threshold ||
free_pct < range_size_alloc_free_pct) {
/*
* If we're running low on space switch to using the size
* sorted AVL tree (best-fit).
*/
start = -1ULL;
} else {
start = _block_picker(range_tree, cursor, size, unit);
}
if (start == -1ULL) {
do {
start = _block_picker(range_size_tree, cursor, size, unit);
if (start != -1ULL) {
break;
}
// try to collect smaller extents as we could fail to retrieve
// that large block due to misaligned extents
size = p2align(size >> 1, unit);
align = size & -size;
cursor = &lbas[cbits(align) - 1];
} while (size >= unit);
}what do you think?
There was a problem hiding this comment.
@tchaikov I agree that preserving cursor behavior is important. I do not fully grasp how adherence to cursor hints reflects on fragmentation behavior, but I can clearly see that there is some emerging anti-fragmentation properties that make avl fragment very little, while stupid did a lot.
I would suggest full synchronization between best and first fit:
bool best_fit = force_range_size_alloc ||
max_size < range_size_alloc_threshold ||
free_pct < range_size_alloc_free_pct;
/*
* If we're running low on space switch to using the size
* sorted AVL tree (best-fit).
*/
do {
if (best_fit) {
start = _block_picker(range_size_tree, cursor, size, unit);
} else {
start = _block_picker(range_tree, cursor, size, unit);
}
if (start != -1ULL) {
break;
}
// try to collect smaller extents as we could fail to retrieve
// that large block due to misaligned extents
size = p2align(size >> 1, unit);
} while (size >= unit);
There was a problem hiding this comment.
i don't understand why we should not update the
*cursorhere. since AvlAllocator was inspired by zfs's dynamic fit block allocator. i am looking into its implementation. see https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/metaslab.c#L1666 . IIUC,lbas[cbits(align) - 1]should be the end of the allocated extent to ease the search of the extent of the same alignment next time.
...
what do you think?
IMO cursor makes sense in first-fit mode only when we use range_tree (ordered by offset) to lookup the proper chunk. In that case preserving the cursor and using it as a hint for the next "similar-sized" lookup allows to allocate the nearest chunk. IIUC this might be beneficial from fragmentation point of view since similarly sized chunks tend to have better "geographic" locality. It's actually an open question whether we should increase cursor on each allocation to get better geo locality then. Not to mention that one might want to start with different initial cursor positions (e.g. choose them randomly on allocator's init or use some fixed but DIFFERENT positions for all of them. And preserve this info between reboots!).
I think the above improvements would provide better geo locality. But definitely that's unrelated to this PR.
Additionally cursor usage (and geo locality) might also improve spinning drive write performance when a bunch of similarly(!) sized writes is processed. But again IMO it's an open question whether this is good when handling a mixture of writes with different block sizes - high chances to get fragmented resulting write which isn't good for a spinning drive.
On the other hand in best-fit mode block picker runs through range_size_tree which is sorted by size only. Chunk offset isn't used for the selection during the lookup at all and hence no much sense in preserving the cursor - it's only the next "first-fit"(!) request which would benefit from that. Which is very unlikely to happen soon enough (as more space to be released to turn this mode back in the allocator) and hence no one would benefit from that cursor update.
Although it's rather a secondary question it seemed preferable to me to show that cursor isn't used in best-fit mode this way.
Given the above - would you prefer this to be reverted?
There was a problem hiding this comment.
@ifed01 best-fit mode sorts by size (first priority) and offset (second priority). So although limited, cursor pointing to location still play a role.
I agree that for allocators dedicated for spinners we should actively promote locality.
There was a problem hiding this comment.
@ifed01 best-fit mode sorts by size (first priority) and offset (second priority). So although limited, cursor pointing to location still play a role.
Yeah, missed that. Thanks!
I agree that for allocators dedicated for spinners we should actively promote locality.
There was a problem hiding this comment.
i agree that the cursor is not likely to help with the locality of allocation requests of the same alignment. but just one thing, i think what the cursor tries to do is to clump the write requests with the same alignment to the same region not with the similar size.
Additionally cursor usage (and geo locality) might also improve spinning drive write performance when a bunch of similarly(!) sized writes is processed.
i was reading the document, source code and watching the speaks by ZFS authors. they helped to convince me that it's designed to combine the large amount of write request at block level or at the device level to a large one for better performance. it is specifically designed to improve the performance of spindle, but it could also benefit the SSD. as a sequential write is more performant than a bunch of random small writes.
|
also would be great if this change can be accompanied with some root cause analysis in the commit message. |
|
IIRC the cursor field is used to make sure all following allocations (of the similar size) are allocated in the same direction under all circumstances, which is essential for spin disks. I am not sure what's the exact problem it is but as Kefu said, if this change can be accompanied with some root cause analysis in the commit message, that would be great. |
I am curious - why is one-direction so important? I understand why close allocations are important (same track, little time wasted for seek), but necessity for strongly preferring one direction elude me. |
d01ff6f to
57aafd7
Compare
Done for the commit which actually fixes the issue |
@ifed01 thanks, is there a reason why this issue has surfaced in 16.2.* versions (so far) and not in earlier versions using hybrid allocator? |
@neha-ojha - IMO this is just a coincidence, it doesn't look release dependent. |
Sorry, I mean we should keep subsequent allocations ordered, which plays a important role for performance of spin disks. |
57aafd7 to
3fec51e
Compare
|
@ifed01 needs rebase |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
3fec51e to
2fc31e8
Compare
|
jenkins test make check |
|
@ifed01 hi Igor, i want to take another look at your comment and the code tomorrow before merging it. so i can better understand them. sorry for the latency! |
Sure, no problem |
Avl allocator mode was returning unexpected ENOSPC in first-fit mode if all size- matching available extents were unaligned but applying the alignment made all of them shorter than required. Since no lookup retry with smaller size - ENOSPC is returned. Additionally we should proceed with a lookup in best-fit mode even when original size has been truncated to match the avail size. (force_range_size_alloc==true) Fixes: https://tracker.ceph.com/issues/50656 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
2fc31e8 to
0eed13a
Compare
Fixes: https://tracker.ceph.com/issues/50656
Signed-off-by: Igor Fedotov ifedotov@suse.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox