osd: optimize PG removal (part1&2 of 2) by ifed01 · Pull Request #37496 · ceph/ceph

ifed01 · 2020-09-30T13:53:09Z

This patch reworks bulk PG removal by utilizing that fact that both onode and OMAP naming scheme are "PG-grouped" now.
Hence one can apply single-shot ranged delete op (well - two ranged delete ops in fact) to remove everything related to a PG from RocksDB. The tricky thing is that we should release disk space occupied by the deleted onodes so that bulk removal is prepended with space reclamation stage which enumerates all the onodes and releases their extents. Due to RocksDB implementation this looks more beneficial as we replace multiple DB deletes with multiple DB updates. The former are known to cause DB performance degradation once applied in a bulky manner. Unfortunately the above-mentioned ranged deletes aren't ideal in this respect too so they are followed by async ranged compactions to eliminate potential DB degradation.
Finally the PG removal scenario is as follows:

reclaim space for every onode (= update onode metadata and release occupied space)
invoke ranged delete for a PG for relevant onodes and OMAPs.
queue asynchronous DB compaction tasks for the ranges affected at 2)
perform async compactions in background

Performance numbers are available at: https://docs.google.com/spreadsheets/d/17V2mXUDEMAFVmSC67o1rQtrWNnAm_itBgzdxh1vRJMY/edit?usp=sharing

Fixes: https://tracker.ceph.com/issues/47174 and related

Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

ifed01 · 2020-10-01T13:55:16Z

Here is an overview on the issue and proposed fixes: https://docs.google.com/presentation/d/1Qid__UuHmE5PhVmFT8aviZADuiLp32zzbhq7dNwaGsA/edit?usp=sharing

ifed01 · 2020-10-01T22:16:15Z

jenkins test classic perf

tchaikov · 2020-10-18T14:38:07Z

2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr:2020-10-18T06:50:18.327+0000 7f8d2fc21700 -1 /build/ceph-16.0.0-6424-gf0cd5b47/src/os/bluestore/BlueStore.cc: In function 'BlueStore::_reap_collections()::<lambda(BlueStore::OnodeRef)>' thread 7f8d2fc21700 time 2020-10-18T06:50:18.329996+0000
2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr:/build/ceph-16.0.0-6424-gf0cd5b47/src/os/bluestore/BlueStore.cc: 9263: FAILED ceph_assert(!o->exists)
2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr:
2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr: ceph version 16.0.0-6424-gf0cd5b47 (f0cd5b4777f785d153f0dbf8daef9f547879be3b) pacific (dev)
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x55a21a512bc7]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55a21a512da2]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 3: ceph-osd(+0xf6d383) [0x55a21aa73383]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 4: (BlueStore::OnodeSpace::map_any(std::function<bool (boost::intrusive_ptr<BlueStore::Onode>)>)+0x98) [0x55a21aa4ce78]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 5: (BlueStore::_reap_collections()+0x19b) [0x55a21aa72b8b]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 6: (BlueStore::_kv_finalize_thread()+0x6ce) [0x55a21aad0b5e]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 7: (BlueStore::KVFinalizeThread::entry()+0xd) [0x55a21ab01bfd]
2020-10-18T06:50:18.336 INFO:tasks.ceph.osd.1.smithi003.stderr: 8: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8d3eb366db]
2020-10-18T06:50:18.336 INFO:tasks.ceph.osd.1.smithi003.stderr: 9: clone()
2020-10-18T06:50:18.336 INFO:tasks.ceph.osd.1.smithi003.stderr:

https://pulpito.ceph.com/kchai-2020-10-18_06:35:20-rados-wip-kefu-testing-2020-10-17-1517-distro-basic-smithi/5534413/.

see also https://pulpito.ceph.com/kchai-2020-10-18_06:35:20-rados-wip-kefu-testing-2020-10-17-1517-distro-basic-smithi/

tchaikov

without this change: https://pulpito.ceph.com/kchai-2020-10-18_16:30:00-rados-wip-kefu-testing-2020-10-18-2239-distro-basic-smithi/

tchaikov · 2020-12-06T09:10:21Z

@ifed01 needs rebase

tchaikov · 2020-12-10T01:40:23Z

jenkins test api

tchaikov · 2020-12-10T05:48:19Z

2020-12-10T05:18:02.066 INFO:tasks.ceph.osd.1.smithi074.stderr: ceph version 16.0.0-8120-g72b5978a (72b5978a1c8f4271f00af6f934635ec4e1278339) pacific (dev)
2020-12-10T05:18:02.066 INFO:tasks.ceph.osd.1.smithi074.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f3080b79dc0]
2020-12-10T05:18:02.066 INFO:tasks.ceph.osd.1.smithi074.stderr: 2: gsignal()
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 3: abort()
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x56184aeed4eb]
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 5: ceph-osd(+0x5636b4) [0x56184aeed6b4]
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 6: ceph-osd(+0xb5cf19) [0x56184b4e6f19]
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 7: (BlueStore::OnodeSpace::map_any(std::function<bool (boost::intrusive_ptr<BlueStore::Onode>)>)+0x98) [0x56184b4c5b58]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 8: (BlueStore::_reap_collections()+0x2b9) [0x56184b4e6879]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 9: (BlueStore::_kv_finalize_thread()+0x77b) [0x56184b52548b]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 10: (BlueStore::KVFinalizeThread::entry()+0x11) [0x56184b5795d1]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 11: /lib64/libpthread.so.0(+0x82de) [0x7f3080b6f2de]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 12: clone()
2020-12-10T05:18:02.069 INFO:tasks.ceph.osd.1.smithi074.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

see https://pulpito.ceph.com/kchai-2020-12-10_03:58:13-rados-wip-kefu-testing-2020-12-10-0947-distro-basic-smithi/5696996/

tchaikov · 2021-08-03T13:15:19Z

https://pulpito.ceph.com/kchai-2021-08-03_10:14:51-rados-wip-kefu-testing2-2021-08-03-1601-distro-basic-smithi/

tchaikov

i reran the failed tests in https://pulpito.ceph.com/kchai-2021-08-03_00:34:30-rados-wip-kefu-testing-2021-08-02-1223-1-distro-basic-smithi/ at https://pulpito.ceph.com/kchai-2021-08-03_10:14:51-rados-wip-kefu-testing2-2021-08-03-1601-distro-basic-smithi/. the later was compiled from master + this PR. so i think this changeset is suspicious.

github-actions · 2021-08-11T18:56:35Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

sseshasa · 2021-08-16T15:05:37Z

Pulpito Run Rados Suite Results :

Only this PR was included in the following run:
https://pulpito.ceph.com/yuriw-2021-08-10_20:58:01-rados-wip-yuri5-testing-2021-08-10-1143-distro-basic-smithi/

All Runs & Failures Tracked here: https://pad.ceph.com/p/master_runs_tracking

Related Failures Summary (JobIDs):

6332069 and 6332501

Tracker issue for valgrind errors: https://tracker.ceph.com/issues/52136

6332125

qa/standalone/divergent-priors.sh:622: TEST_divergent_2: return 1
Test actually passed. At the end of the test, the step to kill the osd daemons failed.

6332188 and 6332635

rados/thrash: Assertion failure in BlueStore.cc (osd.7): src/os/bluestore/BlueStore.cc: 11414: FAILED ceph_assert(o->onode.exists())

6332680

rados/thrash: Abort in BlueStore.cc (osd.3): src/os/bluestore/BlueStore.cc: 13424: ceph_abort_msg("unexpected error")

6332725

qa/standalone/osd-backfill/osd-backfill-prio.sh:315: TEST_backfill_priority: return 1
Failure in kill_daemons() at the end of TEST_backfill_priority() test.

zhscn · 2021-08-31T03:55:24Z

@ifed01

bulk removal not supported for KV on a spinner drive

It seems that the optimization in this pr targets only scenarios in which non-rotational device is used as the backend, is this correct? If so, why not optimize rotational device scenarios? It looks to me that the same approach should also work for those scenarios.

Thanks:-)

ifed01 · 2021-09-01T10:24:37Z

@ifed01

bulk removal not supported for KV on a spinner drive

It seems that the optimization in this pr targets only scenarios in which non-rotational device is used as the backend, is this correct?

right, this isn't supported for rotational drives for now.

If so, why not optimize rotational device scenarios? It looks to me that the same approach should also work for those scenarios.

Thanks:-)

Technically this should be doable but we haven't tested that for those drives. And I'm hesitant to support more complex scenarios for RocksDB above spinning drives. RocksDB is mainly aimed to work with fast(!) devices and IMO it makes sense to discourage its usage for spinning drives as much as possible. We've seen a lot of bad user experience in these scenarios...

aclamk · 2021-09-01T11:40:57Z

@ifed01 @zhscn

Function _use_db_rotational_settings is defined but never used.
Instead for checking BULK _use_rotational_settings is used.
Is this correct?

zhscn · 2021-09-01T14:00:04Z

  if (_use_rotational_settings()) {
    dout(10) << __func__
             << " bulk removal not supported for KV on a spinner drive"
             << dendl;
    return -ENOTSUP;
  }

it looks like a typo, should use _use_db_rotational_settings instead.

github-actions · 2021-11-16T17:20:31Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

neha-ojha · 2021-11-23T20:32:44Z

@ifed01 is this ready for another round of teuthology testing?

ifed01 · 2021-11-23T21:19:18Z

@ifed01 is this ready for another round of teuthology testing?

yeah, planning to give it a try once shaman completes the build

github-actions · 2021-12-05T14:50:25Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2021-12-09T05:06:27Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2022-01-08T02:34:24Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

The rationale is to have faster space reclaiming operation for bluestore which performs as minimal DB operations as possible. Additionally onode is trimmed out from the cache after it to ensure smoother collection reaping if any. Signed-off-by: Igor Fedotov <ifedotov@suse.com>

This includes ability to read (single key mode/prefix bulk mode) and clear omaps. Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Due to improved pg removal performance one can get false positive error detection on the above keyphrase when running ec-inconsistent-hinfo task. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

github-actions · 2022-04-26T00:49:15Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

ghost · 2022-09-04T08:36:09Z

hi
i m suffering from an issue that we went to production without any limit for osd_max_pg_log_entries and osd_min_pg_log_entries
the problem is that the memory of the osd got very very high with many objects and big number of PG's
i saw in similar case recommendation for this
but i do not understand based on which parameters like : osd device size ? block.db size ?
osd_min_pg_log_entries=500
osd_max_pg_log_entries=500
bluefs_buffered_io=false
https://tracker.ceph.com/issues/53729
can someone share more light ? and if your improvement will somehow help

markhpc · 2022-09-08T13:25:00Z

@ShimTanny this PR is probably not the right place for discussing existing release code. Maybe try the users mailing list?

github-actions · 2022-11-07T14:06:09Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2022-12-07T15:01:25Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

ifed01 requested a review from a team as a code owner September 30, 2020 13:53

ifed01 requested review from aclamk, jdurgin, markhpc and neha-ojha September 30, 2020 13:54

ifed01 added the performance label Sep 30, 2020

tchaikov added the wip-kefu-testing label Oct 17, 2020

tchaikov removed the wip-kefu-testing label Oct 18, 2020

tchaikov previously requested changes Oct 19, 2020

View reviewed changes

ifed01 force-pushed the wip-ifed-faster-rm-p2 branch from b0e1adc to ce14fcd Compare November 19, 2020 18:52

github-actions bot added bluestore core crimson documentation labels Nov 19, 2020

tchaikov added the needs-qa label Nov 25, 2020

tchaikov added the needs-rebase label Dec 6, 2020

ifed01 force-pushed the wip-ifed-faster-rm-p2 branch from ce14fcd to c42dc51 Compare December 7, 2020 18:32

github-actions bot added common mon labels Dec 7, 2020

ifed01 force-pushed the wip-ifed-faster-rm-p2 branch from cb70eb6 to bd092aa Compare December 7, 2020 18:58

ifed01 removed the needs-rebase label Dec 7, 2020

tchaikov added the wip-kefu-testing label Dec 10, 2020

tchaikov mentioned this pull request Dec 10, 2020

os/bluestore: Prune deleted onodes #38499

Merged

tchaikov removed the needs-qa label Dec 10, 2020

ifed01 force-pushed the wip-ifed-faster-rm-p2 branch from 10d797c to 36159ca Compare August 2, 2021 11:32

github-actions bot removed the needs-rebase label Aug 2, 2021

tchaikov previously requested changes Aug 3, 2021

View reviewed changes

This was referenced Nov 23, 2021

test/store_test, os/bluestore: bunch of minor improvements/cleanups #44055

Merged

tools/rados: add more options to rados benchmark #44062

Closed

ifed01 added 7 commits January 13, 2022 04:45

kv/rocksdbstore: introduce async_remove_key ops, new counters and more

42191c8

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

tools/rados: add more options to benchmark omap access.

22c764b

This includes ability to read (single key mode/prefix bulk mode) and clear omaps. Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: introduce "truncate" latency perf counter.

a9c9a1b

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

osd: introduce "reclaim" mode for PG removal

3b77854

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

test/store_test: UT for reclaim/bulk collection removal.

7f9e16e

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

qa/suites/rados: add "Error(s) ignored for" to log ignore list

dc1128d

Due to improved pg removal performance one can get false positive error detection on the above keyphrase when running ec-inconsistent-hinfo task. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Conversation

ifed01 commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

ifed01 commented Oct 1, 2020

Uh oh!

ifed01 commented Oct 1, 2020

Uh oh!

tchaikov commented Oct 18, 2020

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Dec 6, 2020

Uh oh!

tchaikov commented Dec 10, 2020

Uh oh!

tchaikov commented Dec 10, 2020

Uh oh!

tchaikov commented Aug 3, 2021

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 11, 2021

Uh oh!

sseshasa commented Aug 16, 2021

Uh oh!

zhscn commented Aug 31, 2021

Uh oh!

ifed01 commented Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclamk commented Sep 1, 2021

Uh oh!

zhscn commented Sep 1, 2021

Uh oh!

github-actions bot commented Nov 16, 2021

Uh oh!

neha-ojha commented Nov 23, 2021

Uh oh!

ifed01 commented Nov 23, 2021

Uh oh!

github-actions bot commented Dec 5, 2021

Uh oh!

github-actions bot commented Dec 9, 2021

Uh oh!

github-actions bot commented Jan 8, 2022

Uh oh!

github-actions bot commented Apr 26, 2022

Uh oh!

ghost commented Sep 4, 2022

Uh oh!

markhpc commented Sep 8, 2022

Uh oh!

github-actions bot commented Nov 7, 2022

Uh oh!

github-actions bot commented Dec 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ifed01 commented Sep 30, 2020 •

edited

Loading

ifed01 commented Sep 1, 2021 •

edited

Loading