Skip to content

squid: a series of optimizations for kerneldevice discard#59065

Merged
SrinivasaBharath merged 12 commits intoceph:squidfrom
YiteGu:wip-blk-discard-squid
Nov 4, 2024
Merged

squid: a series of optimizations for kerneldevice discard#59065
SrinivasaBharath merged 12 commits intoceph:squidfrom
YiteGu:wip-blk-discard-squid

Conversation

@YiteGu
Copy link
Member

@YiteGu YiteGu commented Aug 7, 2024

  1. blk: mul-thread discard support
  2. NCB fix for leaked space when bdev_async_discard is enabled
  3. React to bdev_enable_discard changes in handle_conf_change(); Fix several issues with stopping discard threads

backport tracker: https://tracker.ceph.com/issues/67139
backport tracker: https://tracker.ceph.com/issues/67404


backport of #55469, #56744, #58409
parent tracker: https://tracker.ceph.com/issues/65298, https://tracker.ceph.com/issues/66817

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Matt1360 and others added 11 commits August 7, 2024 10:45
Signed-off-by: Matt Vandermeulen <matt@reenigne.net>
(cherry picked from commit 4ae47bd)
Signed-off-by: Matt Vandermeulen <matt@reenigne.net>
(cherry picked from commit d8815e1)
Signed-off-by: Matt Vandermeulen <matt@reenigne.net>
(cherry picked from commit 671e126)
Signed-off-by: Matt Vandermeulen <matt@reenigne.net>
(cherry picked from commit 5c4a234)
…bled

Fix calls bdev->discard_drain() before calling store_allocator() to make sure all freed space is reflected in the allocator before destaging it
The fix set a timeout for the drain call (500msec) and if expires will not store the allocator (forcing a recovery on the next startup)
Fixes: https://tracker.ceph.com/issues/65298
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>

(cherry picked from commit 3aa891d)
…toring the allocator.

ON fast shutdown we will simply copy the discard queue entries to the allocator

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
(cherry picked from commit 4762ffa)
On fast-shutdown take over the main discarded queue copying it to the allocator and only wait for the threads to commit their small private discarded queues

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
(cherry picked from commit 1e10a9b)
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
(cherry picked from commit b37080c)
…_change()

This fixes two issues that were introduced by 755f3e0:
1. After an OSD boots, discard threads were not stopped when
   bdev_enable_discard was set to false, whereas that was the intent of
   that commit.
2. If bdev_enable_discard or bdev_async_discard_threads are configured
   with a mask that can't be evaluated at OSD boot (e.g. a device
   class), then async discard won't be enabled until a later config
   change to bdev_async_discard_threads.

Fixes: https://tracker.ceph.com/issues/66817
Signed-off-by: Joshua Baergen <jbaergen@digitalocean.com>
(cherry picked from commit 8ffe35e)
1. In _discard_stop(), the wait for !discard_threads.empty() was there
   from a prior implementation where there could theoretically be a race
   between _discard_start() and _discard_stop(). If that race does
   exist, this check won't help; not only is _discard_stop() not called
   if discard_threads is empty, but discard_threads won't be populated
   until _discard_start() runs and thus this won't detect such a race.
2. Calling _discard_stop() from handle_conf_change() is a guaranteed
   deadlock because discard_lock is already held, so don't do that. Use
   the same flow whether we're stopping a subset of threads or all
   threads.
3. Asking a subset of discard threads to stop was not guaranteed to take
   effect, since if they continued to find contents in discard_queue
   then they would continue to run indefinitely. Add additional logic to
   _discard_thread() to have threads stop if they have been requested to
   stop and other threads exist to continue draining discard_queue.
4. Make the flow of _discard_stop() and handle_conf_change() more
   similar.

Fixes: https://tracker.ceph.com/issues/66817
Signed-off-by: Joshua Baergen <jbaergen@digitalocean.com>
(cherry picked from commit 3d4a899)
Instead of having _discard_start() and _discard_stop() partially or
completely duplicate functionality in handle_conf_change(), have a
single _discard_update_threads() that can handle all three. Loops are
tidied slightly, the unnecessary target_discard_threads class variable
has been removed, and now handle_conf_change() will respect
support_discard.

Signed-off-by: Joshua Baergen <jbaergen@digitalocean.com>
(cherry picked from commit 617c936)
@YiteGu YiteGu requested a review from a team as a code owner August 7, 2024 02:50
@github-actions github-actions bot added this to the squid milestone Aug 7, 2024
@YiteGu
Copy link
Member Author

YiteGu commented Aug 7, 2024

jenkins test make check

@ljflores
Copy link
Member

@YiteGu @ifed01 please see this error found in QA testing that looks related to this PR:
/a/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921618

2024-09-27T09:22:38.858 INFO:tasks.rados.rados.0.smithi006.stdout:4227:  finishing copy_from to smithi00633089-523
2024-09-27T09:22:38.858 INFO:tasks.rados.rados.0.smithi006.stdout:update_object_version oid 523 v 1068 (ObjNum 1802 snap 167 seq_num 1802) dirty exists
2024-09-27T09:22:38.867 INFO:teuthology.orchestra.run.smithi179.stderr:error preparing db environment: (5) Input/output error
2024-09-27T09:22:38.870 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-09-27T09:22:38.872 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 192, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 1435, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 1171, in test_bluestore_reshard
    self.test_bluestore_reshard_action()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 1151, in test_bluestore_reshard_action
    raise Exception("ceph-bluestore-tool resharding failed.")
Exception: ceph-bluestore-tool resharding failed.

Ref: https://tracker.ceph.com/issues/68294

@YiteGu
Copy link
Member Author

YiteGu commented Oct 2, 2024

@YiteGu @ifed01 please see this error found in QA testing that looks related to this PR: /a/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921618

2024-09-27T09:22:38.858 INFO:tasks.rados.rados.0.smithi006.stdout:4227:  finishing copy_from to smithi00633089-523
2024-09-27T09:22:38.858 INFO:tasks.rados.rados.0.smithi006.stdout:update_object_version oid 523 v 1068 (ObjNum 1802 snap 167 seq_num 1802) dirty exists
2024-09-27T09:22:38.867 INFO:teuthology.orchestra.run.smithi179.stderr:error preparing db environment: (5) Input/output error
2024-09-27T09:22:38.870 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-09-27T09:22:38.872 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 192, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 1435, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 1171, in test_bluestore_reshard
    self.test_bluestore_reshard_action()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_396230dce0bbb423ab67ae5d3625348d61b9460c/qa/tasks/ceph_manager.py", line 1151, in test_bluestore_reshard_action
    raise Exception("ceph-bluestore-tool resharding failed.")
Exception: ceph-bluestore-tool resharding failed.

Ref: https://tracker.ceph.com/issues/68294

At first glance, this backport does not modify the code related to ceph-bluestore-tool rehard action. where I can see log file of ceph-bluestore-tool reshard?

@ifed01 ifed01 self-requested a review October 14, 2024 13:42
Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bring discard_stop initialization back.

@ifed01
Copy link
Contributor

ifed01 commented Oct 14, 2024

@YiteGu @ifed01 please see this error found in QA testing that looks related to this PR:
/a/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921618

I have a feeling we have two issues in this batch.
The first one is valgrind error on uninitialized variable (presumable discard_stop),
see https://qa-proxy.ceph.com/teuthology/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921444/teuthology.log

description: rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none
  mon_election/classic msgr-failures/few msgr/async-v1only objectstore/bluestore-low-osd-mem-target
  rados tasks/rados_cls_all validater/valgrind}
duration: 914.6719253063202
failure_reason: 'valgrind error: UninitCondition

  KernelDevice::_discard_update_threads()

  KernelDevice::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&)

  BlueStore::read_meta(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*)'
flavor: default
owner: scheduled_skanta@teuthology
sentry_event: https://sentry.ceph.com/organizations/ceph/?query=1feadf9efd354f1590bb35efa2e49b1e
status: fail
success: false

This is definitely to be addressed within this PR's scope.

The second one "ceph-bluestore-tool resharding failed" and it needs further investigation. Not sure if it's related to this PR.

Value discard_stop could be uninitialized.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
(cherry picked from commit bdcc7da)
@YiteGu
Copy link
Member Author

YiteGu commented Oct 15, 2024

Please bring discard_stop initialization back.

done

@YiteGu
Copy link
Member Author

YiteGu commented Oct 15, 2024

jenkins test make check

@ifed01 ifed01 self-requested a review October 15, 2024 10:07
@ifed01
Copy link
Contributor

ifed01 commented Oct 15, 2024

The second one "ceph-bluestore-tool resharding failed" and it needs further investigation. Not sure if it's related to this PR.

So after checking the logs and talking to Adam I'm pretty sure that my hypothesis is valid - there were two independent issues with the last QA run. The second one is tracked by https://tracker.ceph.com/issues/67911

@aclamk
Copy link
Contributor

aclamk commented Oct 15, 2024

@ljflores The issue in #59065 (comment) is caused by:
https://tracker.ceph.com/issues/67911

@ifed01
Copy link
Contributor

ifed01 commented Oct 15, 2024

So @ljflores - please include #59969 along with this PR to get rid of the above issues.

@ljflores
Copy link
Member

ljflores commented Nov 6, 2024

@YiteGu @ifed01 This PR was merged prematurely before it got tested. Can you please review the revert PR (#60641) and re-raise this PR?

@parth-gr
Copy link
Contributor

@ljflores @SrinivasaBharath This might be the reason for Bluestore osd-exapnsion failing https://github.com/rook/rook/pull/15251#issuecomment-2582895014|

cc @travisn

@travisn
Copy link
Member

travisn commented Jan 28, 2025

Looks like the revert PR #60641 has not been merged yet. So if this is related, we should be able to see the latest squid devel image failing in the resize testing: quay.ceph.io/ceph-ci/ceph:squid

@parth-gr
Copy link
Contributor

Yes its failing with quay.ceph.io/ceph-ci/ceph:squid

@solidDoWant
Copy link

Hey Ceph maintainers, this change removed bdev_async_discard, which is a breaking change.

While this is mentioned in the changelog, it only includes the PR title which isn't very descriptive. Can this change be rolled back until v20, or be made compatible with the current bdev_async_discard field, or can a large warning be added to the v19.2.1 changelog?

@solidDoWant
Copy link

This change causing a pretty major performance regression when bdev_async_discard_threads is greater than 1:
image

image

The charge from ~0 to ~12 is when I changed from 1 to 4.

I'm not the only one seeing this. Here's a few screenshots from some other users:

image
Before setting it, after raising it to 4, and after setting it to 1

image
Before setting it, after raising it to 4, and after dropping to 2

I can't file an issue on the official bugtracker because account creation requires manual approval. @YiteGu @ifed01 Sorry for the ping but I don't know how else to get this in front of somebody.

@YiteGu
Copy link
Member Author

YiteGu commented Mar 4, 2025

Hey Ceph maintainers, this change removed bdev_async_discard, which is a breaking change.

While this is mentioned in the changelog, it only includes the PR title which isn't very descriptive. Can this change be rolled back until v20, or be made compatible with the current bdev_async_discard field, or can a large warning be added to the v19.2.1 changelog?

The purpose of bdev_async_discard_threads is to enable concurrent completion of discard op using more threads. If set to size of one is no change from the existing behaviour. At this point, the bdev_async_discard option becomes redundant.

@solidDoWant
Copy link

I completely get that the bdev_async_discard field is now redundant, and setting it to 1 preserves the current behavior. However, because this requires ceph admins to take action to preserve the current behavior, this is by definition a breaking change.

Unless I am missing something here, ceph admins who have previously set ceph config set global bdev_async_discard 1, or have a config file with

[global]
bdev_async_discard = 1

will suddenly, without any notice, have async discards disabled on their clusters.

Additionally, running ceph config set global bdev_async_discard 1 now fails on v19.2.1 forward, which breaks automation and other tools that integrate with ceph, like Rook.

This change is breaking compatibility with v19 stable, and without backwards compatibility should have been delayed until v20.

@YiteGu
Copy link
Member Author

YiteGu commented Mar 6, 2025

I completely get that the bdev_async_discard field is now redundant, and setting it to 1 preserves the current behavior. However, because this requires ceph admins to take action to preserve the current behavior, this is by definition a breaking change.

Unless I am missing something here, ceph admins who have previously set ceph config set global bdev_async_discard 1, or have a config file with

[global]
bdev_async_discard = 1

will suddenly, without any notice, have async discards disabled on their clusters.

Additionally, running ceph config set global bdev_async_discard 1 now fails on v19.2.1 forward, which breaks automation and other tools that integrate with ceph, like Rook.

This change is breaking compatibility with v19 stable, and without backwards compatibility should have been delayed until v20.

I completely understand how you feel. It is indeed a bit troublesome during the period of introducing new configurations and canceling old configurations.

@YiteGu
Copy link
Member Author

YiteGu commented Mar 6, 2025

This change causing a pretty major performance regression when bdev_async_discard_threads is greater than 1: image

image

The charge from ~0 to ~12 is when I changed from 1 to 4.

I'm not the only one seeing this. Here's a few screenshots from some other users:

image Before setting it, after raising it to 4, and after setting it to 1

image Before setting it, after raising it to 4, and after dropping to 2

I can't file an issue on the official bugtracker because account creation requires manual approval. @YiteGu @ifed01 Sorry for the ping but I don't know how else to get this in front of somebody.

19.2.1 is missing #59529 , causing excessive CPU load during multi-threaded discard, please keep bdev_async_discard_threads = 1 at 19.2.1

JackMyers001 added a commit to JackMyers001/ducknet-ops that referenced this pull request May 30, 2025
I LOVE BREAKING CHANGES! See ceph/ceph#59065 for
details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.