Skip to content

os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators.#41369

Merged
tchaikov merged 2 commits intoceph:masterfrom
ifed01:wip-ifed-fix-avl-enospc2
Jun 1, 2021
Merged

os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators.#41369
tchaikov merged 2 commits intoceph:masterfrom
ifed01:wip-ifed-fix-avl-enospc2

Conversation

@ifed01
Copy link
Contributor

@ifed01 ifed01 commented May 17, 2021

Fixes: https://tracker.ceph.com/issues/50656
Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

do {
start = _block_picker(range_size_tree, cursor, size, unit);
if (start != -1ULL || !force_range_size_alloc) {
start = _block_picker(range_size_tree, &fake_cursor, size, unit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand why we should not update the *cursor here. since AvlAllocator was inspired by zfs's dynamic fit block allocator. i am looking into its implementation. see https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/metaslab.c#L1666 . IIUC, lbas[cbits(align) - 1] should be the end of the allocated extent to ease the search of the extent of the same alignment next time.

also, probably we can consolidate the code (not tested) like

  uint64_t *cursor = &lbas[cbits(align) - 1];
  const int free_pct = num_free * 100 / device_size;
  uint64_t start = 0;
  if (force_range_size_alloc ||
      max_size < range_size_alloc_threshold ||
      free_pct < range_size_alloc_free_pct) {
    /*
     * If we're running low on space switch to using the size
     * sorted AVL tree (best-fit).
     */
    start = -1ULL;
  } else {
    start = _block_picker(range_tree, cursor, size, unit);
  }
  if (start == -1ULL) {
    do {
      start = _block_picker(range_size_tree, cursor, size, unit);
      if (start != -1ULL) {
        break;
      }
      // try to collect smaller extents as we could fail to retrieve
      // that large block due to misaligned extents
      size = p2align(size >> 1, unit);
      align = size & -size;
      cursor = &lbas[cbits(align) - 1];
    } while (size >= unit);
  }

what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov I agree that preserving cursor behavior is important. I do not fully grasp how adherence to cursor hints reflects on fragmentation behavior, but I can clearly see that there is some emerging anti-fragmentation properties that make avl fragment very little, while stupid did a lot.

I would suggest full synchronization between best and first fit:

bool best_fit = force_range_size_alloc ||
                  max_size < range_size_alloc_threshold ||
	          free_pct < range_size_alloc_free_pct;
  /*
   * If we're running low on space switch to using the size
   * sorted AVL tree (best-fit).
   */
  do {
    if (best_fit) {
      start = _block_picker(range_size_tree, cursor, size, unit);
    } else {
      start = _block_picker(range_tree, cursor, size, unit);
    }
    if (start != -1ULL) {
      break;
    }
    // try to collect smaller extents as we could fail to retrieve
    // that large block due to misaligned extents
    size = p2align(size >> 1, unit);
  } while (size >= unit);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand why we should not update the *cursor here. since AvlAllocator was inspired by zfs's dynamic fit block allocator. i am looking into its implementation. see https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/metaslab.c#L1666 . IIUC, lbas[cbits(align) - 1] should be the end of the allocated extent to ease the search of the extent of the same alignment next time.
...
what do you think?

IMO cursor makes sense in first-fit mode only when we use range_tree (ordered by offset) to lookup the proper chunk. In that case preserving the cursor and using it as a hint for the next "similar-sized" lookup allows to allocate the nearest chunk. IIUC this might be beneficial from fragmentation point of view since similarly sized chunks tend to have better "geographic" locality. It's actually an open question whether we should increase cursor on each allocation to get better geo locality then. Not to mention that one might want to start with different initial cursor positions (e.g. choose them randomly on allocator's init or use some fixed but DIFFERENT positions for all of them. And preserve this info between reboots!).
I think the above improvements would provide better geo locality. But definitely that's unrelated to this PR.

Additionally cursor usage (and geo locality) might also improve spinning drive write performance when a bunch of similarly(!) sized writes is processed. But again IMO it's an open question whether this is good when handling a mixture of writes with different block sizes - high chances to get fragmented resulting write which isn't good for a spinning drive.

On the other hand in best-fit mode block picker runs through range_size_tree which is sorted by size only. Chunk offset isn't used for the selection during the lookup at all and hence no much sense in preserving the cursor - it's only the next "first-fit"(!) request which would benefit from that. Which is very unlikely to happen soon enough (as more space to be released to turn this mode back in the allocator) and hence no one would benefit from that cursor update.

Although it's rather a secondary question it seemed preferable to me to show that cursor isn't used in best-fit mode this way.
Given the above - would you prefer this to be reverted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 best-fit mode sorts by size (first priority) and offset (second priority). So although limited, cursor pointing to location still play a role.

I agree that for allocators dedicated for spinners we should actively promote locality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 best-fit mode sorts by size (first priority) and offset (second priority). So although limited, cursor pointing to location still play a role.

Yeah, missed that. Thanks!

I agree that for allocators dedicated for spinners we should actively promote locality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that the cursor is not likely to help with the locality of allocation requests of the same alignment. but just one thing, i think what the cursor tries to do is to clump the write requests with the same alignment to the same region not with the similar size.

Additionally cursor usage (and geo locality) might also improve spinning drive write performance when a bunch of similarly(!) sized writes is processed.

i was reading the document, source code and watching the speaks by ZFS authors. they helped to convince me that it's designed to combine the large amount of write request at block level or at the device level to a large one for better performance. it is specifically designed to improve the performance of spindle, but it could also benefit the SSD. as a sequential write is more performant than a bunch of random small writes.

@tchaikov
Copy link
Contributor

also would be great if this change can be accompanied with some root cause analysis in the commit message.

@tchaikov tchaikov requested a review from xiexingguo May 19, 2021 02:20
@xiexingguo
Copy link
Member

IIRC the cursor field is used to make sure all following allocations (of the similar size) are allocated in the same direction under all circumstances, which is essential for spin disks.

I am not sure what's the exact problem it is but as Kefu said, if this change can be accompanied with some root cause analysis in the commit message, that would be great.

@aclamk
Copy link
Contributor

aclamk commented May 19, 2021

@xiexingguo

IIRC the cursor field is used to make sure all following allocations (of the similar size) are allocated in the same direction under all circumstances, which is essential for spin disks.

I am curious - why is one-direction so important? I understand why close allocations are important (same track, little time wasted for seek), but necessity for strongly preferring one direction elude me.

@ifed01 ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from d01ff6f to 57aafd7 Compare May 19, 2021 12:53
@ifed01
Copy link
Contributor Author

ifed01 commented May 19, 2021

also would be great if this change can be accompanied with some root cause analysis in the commit message.

Done for the commit which actually fixes the issue

@neha-ojha
Copy link
Member

also would be great if this change can be accompanied with some root cause analysis in the commit message.

Done for the commit which actually fixes the issue

@ifed01 thanks, is there a reason why this issue has surfaced in 16.2.* versions (so far) and not in earlier versions using hybrid allocator?

@ifed01
Copy link
Contributor Author

ifed01 commented May 19, 2021

also would be great if this change can be accompanied with some root cause analysis in the commit message.

Done for the commit which actually fixes the issue

@ifed01 thanks, is there a reason why this issue has surfaced in 16.2.* versions (so far) and not in earlier versions using hybrid allocator?

@neha-ojha - IMO this is just a coincidence, it doesn't look release dependent.

@xiexingguo
Copy link
Member

I am curious - why is one-direction so important? I understand why close allocations are important (same track, little time wasted for seek), but necessity for strongly preferring one direction elude me.

Sorry, I mean we should keep subsequent allocations ordered, which plays a important role for performance of spin disks.

@tchaikov tchaikov self-requested a review May 27, 2021 07:29
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine job.

@ifed01 ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from 57aafd7 to 3fec51e Compare May 27, 2021 16:57
@tchaikov
Copy link
Contributor

@ifed01 needs rebase

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@ifed01 ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from 3fec51e to 2fc31e8 Compare May 30, 2021 21:10
@ifed01
Copy link
Contributor Author

ifed01 commented May 31, 2021

jenkins test make check

@tchaikov
Copy link
Contributor

tchaikov commented May 31, 2021

@ifed01 hi Igor, i want to take another look at your comment and the code tomorrow before merging it. so i can better understand them. sorry for the latency!

@tchaikov tchaikov self-requested a review May 31, 2021 13:37
@ifed01
Copy link
Contributor Author

ifed01 commented May 31, 2021

@ifed01 hi Igor, i want to take another look at your comment and the code tomorrow before merging it. so i can better understand them. sorry for the latency!

Sure, no problem

Avl allocator mode was returning unexpected ENOSPC in first-fit mode if all size-
matching available extents were unaligned but applying the alignment made all of
them shorter than required. Since no lookup retry with smaller size -
ENOSPC is returned.
Additionally we should proceed with a lookup in best-fit mode even when
original size has been truncated to match the avail size.
(force_range_size_alloc==true)

Fixes: https://tracker.ceph.com/issues/50656
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@ifed01 ifed01 force-pushed the wip-ifed-fix-avl-enospc2 branch from 2fc31e8 to 0eed13a Compare June 1, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants