blk: threaded discard support by Matt1360 · Pull Request #55469 · ceph/ceph

Matt1360 · 2024-02-06T15:27:15Z

We have encountered some drives that need discards enabled in order to stay performant. However, they aren't very quick at acting on the discard queue. I've turned the async discard functionality into a thread pool which can be tuned as needed, and if set to size of one is no change from the existing behaviour.

We're testing this in our lab currently (though against Pacific), and if there's appetite for this, I'll also backport it to Reef (selflishly, so we don't have to carry the patch). Note that because we're doing it against Pacific, I might have missed something in the config here, the yaml file is new to me - please let me know if I've missed anything there (I assume the build system does the appropriate generation here).

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

src/common/options/global.yaml.in

src/blk/kernel/KernelDevice.cc

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

src/blk/kernel/KernelDevice.cc

src/common/options/global.yaml.in

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

src/blk/kernel/KernelDevice.cc

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

ifed01

Good now!

Matt1360 · 2024-02-16T22:51:38Z

jenkins test make check

NitzanMordhai · 2024-03-14T11:24:51Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomcPB1LsGKm1970-wip-yuri8-testing-2024-03-11-1138-old-wip-yuri8-testing-2024-03-06-1329-old-wip-yuri8-testing-2024-03-04-1226-old-wip-yuri8-test

interestingyong · 2025-07-08T02:39:59Z

two questions, look for reply.

release_alloc_txc osr->q write op， will execute early than async discard。
newly allocate offset,len, will overlap with discard_queue. I can't found search and filter in discard_queue .
discard/trim nvme area[offset, len] will lost data finally.
thanks.

ifed01 · 2025-07-08T11:39:56Z

two questions, look for reply.

release_alloc_txc osr->q write op， will execute early than async discard。

newly allocate offset,len, will overlap with discard_queue. I can't found search and filter in discard_queue .

discard/trim nvme area[offset, len] will lost data finally.
thanks.

Please see

void BlueStore::_txc_release_alloc(TransContext *txc)
{
  bool discard_queued = false;
  // it's expected we're called with lazy_release_lock already taken!
  if (unlikely(cct->_conf->bluestore_debug_no_reuse_blocks ||
               txc->released.size() == 0 ||
               !alloc)) {
      goto out;
  }
  discard_queued = bdev->try_discard(txc->released);
  // if async discard succeeded, will do alloc->release when discard callback
  // else we should release here
  if (!discard_queued) {
      dout(10) << __func__ << "(sync) " << txc << " " << std::hex
               << txc->released << std::dec << dendl;
      alloc->release(txc->released);
  }

BlueStore doesn't release extents immediately if discards are enabled. Instead it postpones that until relevant discard op is completed
void BlueStore::handle_discard(interval_set<uint64_t>& to_release) { dout(10) << __func__ << dendl; ceph_assert(alloc); alloc->release(to_release); }
Hence there is no way to allocate an extent while it's being discarded.

interestingyong · 2025-09-19T07:22:27Z

two questions, look for reply.

release_alloc_txc osr->q write op， will execute early than async discard。

newly allocate offset,len, will overlap with discard_queue. I can't found search and filter in discard_queue .

discard/trim nvme area[offset, len] will lost data finally.
thanks.

Please see
void BlueStore::_txc_release_alloc(TransContext *txc)
{
  bool discard_queued = false;
  // it's expected we're called with lazy_release_lock already taken!
  if (unlikely(cct->_conf->bluestore_debug_no_reuse_blocks ||
               txc->released.size() == 0 ||
               !alloc)) {
      goto out;
  }
  discard_queued = bdev->try_discard(txc->released);
  // if async discard succeeded, will do alloc->release when discard callback
  // else we should release here
  if (!discard_queued) {
      dout(10) << __func__ << "(sync) " << txc << " " << std::hex
               << txc->released << std::dec << dendl;
      alloc->release(txc->released);
  }
BlueStore doesn't release extents immediately if discards are enabled. Instead it postpones that until relevant discard op is completed void BlueStore::handle_discard(interval_set<uint64_t>& to_release) { dout(10) << __func__ << dendl; ceph_assert(alloc); alloc->release(to_release); } Hence there is no way to allocate an extent while it's being discarded.

thanks. I got it.
BlueFS and BlueStore reset the bitmap for the release_set only when discard_queued is false. Otherwise, the space is queued for discard and the bitmap reset is performed continuously by a background thread

Matt1360 requested a review from a team as a code owner February 6, 2024 15:27

github-actions bot added common core labels Feb 6, 2024

tchaikov added the performance label Feb 6, 2024

tchaikov reviewed Feb 6, 2024

View reviewed changes

src/common/options/global.yaml.in Outdated Show resolved Hide resolved

tchaikov reviewed Feb 6, 2024

View reviewed changes

src/common/options/global.yaml.in Outdated Show resolved Hide resolved

tchaikov reviewed Feb 6, 2024

View reviewed changes

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved

Matt1360 force-pushed the main branch 4 times, most recently from fbe5179 to cc8339f Compare February 6, 2024 16:30

ifed01 reviewed Feb 7, 2024

View reviewed changes

src/blk/kernel/KernelDevice.cc Show resolved Hide resolved

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved

Matt1360 added 2 commits February 8, 2024 13:54

common: add discard threads option, descriptions and flags

4ae47bd

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

blk: add threaded discard support to kernel devices

d8815e1

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

Matt1360 force-pushed the main branch from cc8339f to d8815e1 Compare February 8, 2024 17:54

ifed01 reviewed Feb 9, 2024

View reviewed changes

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved

src/common/options/global.yaml.in Show resolved Hide resolved

common: remove lingering bdev_async_discard option

671e126

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

ifed01 reviewed Feb 16, 2024

View reviewed changes

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved

blk: support bdev_async_discard_threads == 0

5c4a234

Signed-off-by: Matt Vandermeulen <matt@reenigne.net>

Matt1360 force-pushed the main branch from 25a4dc0 to 5c4a234 Compare February 16, 2024 17:06

ifed01 approved these changes Feb 16, 2024

View reviewed changes

ifed01 added the needs-qa label Feb 16, 2024

yuriw added the wip-yuri8-testing label Mar 1, 2024

yuriw merged commit 44e5283 into ceph:main Mar 14, 2024

Matt1360 mentioned this pull request May 23, 2024

reef: blk: threaded discard support #57680

Closed

14 tasks

This was referenced Aug 7, 2024

reef: a series of optimizations for kerneldevice discard #59048

Merged

squid: a series of optimizations for kerneldevice discard #59065

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blk: threaded discard support#55469

blk: threaded discard support#55469
yuriw merged 4 commits intoceph:mainfrom
Matt1360:main

Matt1360 commented Feb 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ifed01 left a comment

Uh oh!

Matt1360 commented Feb 16, 2024

Uh oh!

NitzanMordhai commented Mar 14, 2024

Uh oh!

interestingyong commented Jul 8, 2025

Uh oh!

ifed01 commented Jul 8, 2025

Uh oh!

interestingyong commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

Matt1360 commented Feb 6, 2024

Contribution Guidelines

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ifed01 left a comment

Choose a reason for hiding this comment

Uh oh!

Matt1360 commented Feb 16, 2024

Uh oh!

NitzanMordhai commented Mar 14, 2024

Uh oh!

interestingyong commented Jul 8, 2025

Uh oh!

ifed01 commented Jul 8, 2025

Uh oh!

interestingyong commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants