os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. by ifed01 · Pull Request #41369 · ceph/ceph

ifed01 · 2021-05-17T19:29:37Z

Fixes: https://tracker.ceph.com/issues/50656
Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

tchaikov · 2021-05-18T16:04:05Z

src/os/bluestore/AvlAllocator.cc

    do {
-      start = _block_picker(range_size_tree, cursor, size, unit);
-      if (start != -1ULL || !force_range_size_alloc) {
+      start = _block_picker(range_size_tree, &fake_cursor, size, unit);


i don't understand why we should not update the *cursor here. since AvlAllocator was inspired by zfs's dynamic fit block allocator. i am looking into its implementation. see https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/metaslab.c#L1666 . IIUC, lbas[cbits(align) - 1] should be the end of the allocated extent to ease the search of the extent of the same alignment next time.

also, probably we can consolidate the code (not tested) like

uint64_t *cursor = &lbas[cbits(align) - 1]; const int free_pct = num_free * 100 / device_size; uint64_t start = 0; if (force_range_size_alloc || max_size < range_size_alloc_threshold || free_pct < range_size_alloc_free_pct) { /* * If we're running low on space switch to using the size * sorted AVL tree (best-fit). */ start = -1ULL; } else { start = _block_picker(range_tree, cursor, size, unit); } if (start == -1ULL) { do { start = _block_picker(range_size_tree, cursor, size, unit); if (start != -1ULL) { break; } // try to collect smaller extents as we could fail to retrieve // that large block due to misaligned extents size = p2align(size >> 1, unit); align = size & -size; cursor = &lbas[cbits(align) - 1]; } while (size >= unit); }

what do you think?

@tchaikov I agree that preserving cursor behavior is important. I do not fully grasp how adherence to cursor hints reflects on fragmentation behavior, but I can clearly see that there is some emerging anti-fragmentation properties that make avl fragment very little, while stupid did a lot.

I would suggest full synchronization between best and first fit:

bool best_fit = force_range_size_alloc || max_size < range_size_alloc_threshold || free_pct < range_size_alloc_free_pct; /* * If we're running low on space switch to using the size * sorted AVL tree (best-fit). */ do { if (best_fit) { start = _block_picker(range_size_tree, cursor, size, unit); } else { start = _block_picker(range_tree, cursor, size, unit); } if (start != -1ULL) { break; } // try to collect smaller extents as we could fail to retrieve // that large block due to misaligned extents size = p2align(size >> 1, unit); } while (size >= unit);

i don't understand why we should not update the *cursor here. since AvlAllocator was inspired by zfs's dynamic fit block allocator. i am looking into its implementation. see https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/metaslab.c#L1666 . IIUC, lbas[cbits(align) - 1] should be the end of the allocated extent to ease the search of the extent of the same alignment next time.
...
what do you think?

IMO cursor makes sense in first-fit mode only when we use range_tree (ordered by offset) to lookup the proper chunk. In that case preserving the cursor and using it as a hint for the next "similar-sized" lookup allows to allocate the nearest chunk. IIUC this might be beneficial from fragmentation point of view since similarly sized chunks tend to have better "geographic" locality. It's actually an open question whether we should increase cursor on each allocation to get better geo locality then. Not to mention that one might want to start with different initial cursor positions (e.g. choose them randomly on allocator's init or use some fixed but DIFFERENT positions for all of them. And preserve this info between reboots!).
I think the above improvements would provide better geo locality. But definitely that's unrelated to this PR.

Additionally cursor usage (and geo locality) might also improve spinning drive write performance when a bunch of similarly(!) sized writes is processed. But again IMO it's an open question whether this is good when handling a mixture of writes with different block sizes - high chances to get fragmented resulting write which isn't good for a spinning drive.

On the other hand in best-fit mode block picker runs through range_size_tree which is sorted by size only. Chunk offset isn't used for the selection during the lookup at all and hence no much sense in preserving the cursor - it's only the next "first-fit"(!) request which would benefit from that. Which is very unlikely to happen soon enough (as more space to be released to turn this mode back in the allocator) and hence no one would benefit from that cursor update.

Although it's rather a secondary question it seemed preferable to me to show that cursor isn't used in best-fit mode this way.
Given the above - would you prefer this to be reverted?

@ifed01 best-fit mode sorts by size (first priority) and offset (second priority). So although limited, cursor pointing to location still play a role.

I agree that for allocators dedicated for spinners we should actively promote locality.

@ifed01 best-fit mode sorts by size (first priority) and offset (second priority). So although limited, cursor pointing to location still play a role.

Yeah, missed that. Thanks!

I agree that for allocators dedicated for spinners we should actively promote locality.

i agree that the cursor is not likely to help with the locality of allocation requests of the same alignment. but just one thing, i think what the cursor tries to do is to clump the write requests with the same alignment to the same region not with the similar size.

Additionally cursor usage (and geo locality) might also improve spinning drive write performance when a bunch of similarly(!) sized writes is processed.

i was reading the document, source code and watching the speaks by ZFS authors. they helped to convince me that it's designed to combine the large amount of write request at block level or at the device level to a large one for better performance. it is specifically designed to improve the performance of spindle, but it could also benefit the SSD. as a sequential write is more performant than a bunch of random small writes.

tchaikov · 2021-05-18T16:49:14Z

also would be great if this change can be accompanied with some root cause analysis in the commit message.

xiexingguo · 2021-05-19T05:45:25Z

IIRC the cursor field is used to make sure all following allocations (of the similar size) are allocated in the same direction under all circumstances, which is essential for spin disks.

I am not sure what's the exact problem it is but as Kefu said, if this change can be accompanied with some root cause analysis in the commit message, that would be great.

aclamk · 2021-05-19T11:01:03Z

@xiexingguo

IIRC the cursor field is used to make sure all following allocations (of the similar size) are allocated in the same direction under all circumstances, which is essential for spin disks.

I am curious - why is one-direction so important? I understand why close allocations are important (same track, little time wasted for seek), but necessity for strongly preferring one direction elude me.

ifed01 · 2021-05-19T12:55:48Z

also would be great if this change can be accompanied with some root cause analysis in the commit message.

Done for the commit which actually fixes the issue

neha-ojha · 2021-05-19T19:46:15Z

also would be great if this change can be accompanied with some root cause analysis in the commit message.

Done for the commit which actually fixes the issue

@ifed01 thanks, is there a reason why this issue has surfaced in 16.2.* versions (so far) and not in earlier versions using hybrid allocator?

ifed01 · 2021-05-19T22:06:07Z

also would be great if this change can be accompanied with some root cause analysis in the commit message.

Done for the commit which actually fixes the issue

@ifed01 thanks, is there a reason why this issue has surfaced in 16.2.* versions (so far) and not in earlier versions using hybrid allocator?

@neha-ojha - IMO this is just a coincidence, it doesn't look release dependent.

xiexingguo · 2021-05-20T00:26:00Z

I am curious - why is one-direction so important? I understand why close allocations are important (same track, little time wasted for seek), but necessity for strongly preferring one direction elude me.

Sorry, I mean we should keep subsequent allocations ordered, which plays a important role for performance of spin disks.

aclamk

Fine job.

src/test/objectstore/allocator_replay_test.cc

tchaikov · 2021-05-29T02:38:20Z

failures tracked by

tchaikov · 2021-05-29T02:40:45Z

@ifed01 needs rebase

github-actions · 2021-05-29T05:04:08Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

ifed01 · 2021-05-31T13:35:36Z

jenkins test make check

tchaikov · 2021-05-31T13:36:49Z

@ifed01 hi Igor, i want to take another look at your comment and the code tomorrow before merging it. so i can better understand them. sorry for the latency!

ifed01 · 2021-05-31T13:37:47Z

@ifed01 hi Igor, i want to take another look at your comment and the code tomorrow before merging it. so i can better understand them. sorry for the latency!

Sure, no problem

src/os/bluestore/AvlAllocator.cc

Avl allocator mode was returning unexpected ENOSPC in first-fit mode if all size- matching available extents were unaligned but applying the alignment made all of them shorter than required. Since no lookup retry with smaller size - ENOSPC is returned. Additionally we should proceed with a lookup in best-fit mode even when original size has been truncated to match the avail size. (force_range_size_alloc==true) Fixes: https://tracker.ceph.com/issues/50656 Signed-off-by: Igor Fedotov <ifedotov@suse.com>

ifed01 added bug-fix bluestore labels May 17, 2021

ifed01 requested a review from tchaikov May 17, 2021 19:29

github-actions bot added core tests labels May 17, 2021

ifed01 requested review from aclamk and neha-ojha May 17, 2021 19:30

tchaikov reviewed May 18, 2021

View reviewed changes

tchaikov requested a review from xiexingguo May 19, 2021 02:20

aclamk mentioned this pull request May 19, 2021

os/bluestore: Improve _block_picker function #41398

Merged

3 tasks

ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from d01ff6f to 57aafd7 Compare May 19, 2021 12:53

tchaikov added the wip-kefu-testing label May 26, 2021

tchaikov self-requested a review May 27, 2021 07:29

aclamk approved these changes May 27, 2021

View reviewed changes

neha-ojha added the needs-qa label May 27, 2021

tchaikov reviewed May 27, 2021

View reviewed changes

src/test/objectstore/allocator_replay_test.cc Outdated Show resolved Hide resolved

ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from 57aafd7 to 3fec51e Compare May 27, 2021 16:57

tchaikov assigned ifed01 May 29, 2021

tchaikov removed the wip-kefu-testing label May 29, 2021

github-actions bot added the needs-rebase label May 29, 2021

test/allocator_replay_test: make allocator type configurable

f4d1ef9

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from 3fec51e to 2fc31e8 Compare May 30, 2021 21:10

github-actions bot removed the needs-rebase label May 30, 2021

tchaikov self-requested a review May 31, 2021 13:37

tchaikov reviewed Jun 1, 2021

View reviewed changes

src/os/bluestore/AvlAllocator.cc Outdated Show resolved Hide resolved

ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from 2fc31e8 to 0eed13a Compare June 1, 2021 13:44

tchaikov approved these changes Jun 1, 2021

View reviewed changes

tchaikov merged commit f1e822e into ceph:master Jun 1, 2021

ifed01 deleted the wip-ifed-fix-avl-enospc2 branch June 1, 2021 15:01

This was referenced Jun 2, 2021

pacific: os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. #41655

Merged

octopus: os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. #41658

Merged

This was referenced Jun 3, 2021

nautilus: os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. #41673

Merged

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

Merged

Conversation

ifed01 commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

tchaikov May 18, 2021

Choose a reason for hiding this comment

Uh oh!

aclamk May 19, 2021

Choose a reason for hiding this comment

Uh oh!

ifed01 May 19, 2021

Choose a reason for hiding this comment

Uh oh!

aclamk May 19, 2021

Choose a reason for hiding this comment

Uh oh!

ifed01 May 19, 2021

Choose a reason for hiding this comment

Uh oh!

tchaikov Jun 1, 2021

Choose a reason for hiding this comment

Uh oh!

tchaikov commented May 18, 2021

Uh oh!

xiexingguo commented May 19, 2021

Uh oh!

aclamk commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neha-ojha commented May 19, 2021

Uh oh!

ifed01 commented May 19, 2021

Uh oh!

xiexingguo commented May 20, 2021

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchaikov commented May 29, 2021

Uh oh!

tchaikov commented May 29, 2021

Uh oh!

github-actions bot commented May 29, 2021

Uh oh!

ifed01 commented May 31, 2021

Uh oh!

tchaikov commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 commented May 31, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ifed01 commented May 17, 2021 •

edited

Loading

aclamk commented May 19, 2021 •

edited

Loading

ifed01 commented May 19, 2021 •

edited

Loading

tchaikov commented May 31, 2021 •

edited

Loading