Skip to content

osd: optimize PG removal (part1&2 of 2)#37496

Closed
ifed01 wants to merge 7 commits intoceph:mainfrom
ifed01:wip-ifed-faster-rm-p2
Closed

osd: optimize PG removal (part1&2 of 2)#37496
ifed01 wants to merge 7 commits intoceph:mainfrom
ifed01:wip-ifed-faster-rm-p2

Conversation

@ifed01
Copy link
Contributor

@ifed01 ifed01 commented Sep 30, 2020

This patch reworks bulk PG removal by utilizing that fact that both onode and OMAP naming scheme are "PG-grouped" now.
Hence one can apply single-shot ranged delete op (well - two ranged delete ops in fact) to remove everything related to a PG from RocksDB. The tricky thing is that we should release disk space occupied by the deleted onodes so that bulk removal is prepended with space reclamation stage which enumerates all the onodes and releases their extents. Due to RocksDB implementation this looks more beneficial as we replace multiple DB deletes with multiple DB updates. The former are known to cause DB performance degradation once applied in a bulky manner. Unfortunately the above-mentioned ranged deletes aren't ideal in this respect too so they are followed by async ranged compactions to eliminate potential DB degradation.
Finally the PG removal scenario is as follows:

  1. reclaim space for every onode (= update onode metadata and release occupied space)
  2. invoke ranged delete for a PG for relevant onodes and OMAPs.
  3. queue asynchronous DB compaction tasks for the ranges affected at 2)
  4. perform async compactions in background

Performance numbers are available at: https://docs.google.com/spreadsheets/d/17V2mXUDEMAFVmSC67o1rQtrWNnAm_itBgzdxh1vRJMY/edit?usp=sharing

Fixes: https://tracker.ceph.com/issues/47174 and related

Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@ifed01 ifed01 requested a review from a team as a code owner September 30, 2020 13:53
@ifed01
Copy link
Contributor Author

ifed01 commented Oct 1, 2020

Here is an overview on the issue and proposed fixes: https://docs.google.com/presentation/d/1Qid__UuHmE5PhVmFT8aviZADuiLp32zzbhq7dNwaGsA/edit?usp=sharing

@ifed01
Copy link
Contributor Author

ifed01 commented Oct 1, 2020

jenkins test classic perf

@tchaikov
Copy link
Contributor

2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr:2020-10-18T06:50:18.327+0000 7f8d2fc21700 -1 /build/ceph-16.0.0-6424-gf0cd5b47/src/os/bluestore/BlueStore.cc: In function 'BlueStore::_reap_collections()::<lambda(BlueStore::OnodeRef)>' thread 7f8d2fc21700 time 2020-10-18T06:50:18.329996+0000
2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr:/build/ceph-16.0.0-6424-gf0cd5b47/src/os/bluestore/BlueStore.cc: 9263: FAILED ceph_assert(!o->exists)
2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr:
2020-10-18T06:50:18.334 INFO:tasks.ceph.osd.1.smithi003.stderr: ceph version 16.0.0-6424-gf0cd5b47 (f0cd5b4777f785d153f0dbf8daef9f547879be3b) pacific (dev)
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x55a21a512bc7]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55a21a512da2]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 3: ceph-osd(+0xf6d383) [0x55a21aa73383]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 4: (BlueStore::OnodeSpace::map_any(std::function<bool (boost::intrusive_ptr<BlueStore::Onode>)>)+0x98) [0x55a21aa4ce78]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 5: (BlueStore::_reap_collections()+0x19b) [0x55a21aa72b8b]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 6: (BlueStore::_kv_finalize_thread()+0x6ce) [0x55a21aad0b5e]
2020-10-18T06:50:18.335 INFO:tasks.ceph.osd.1.smithi003.stderr: 7: (BlueStore::KVFinalizeThread::entry()+0xd) [0x55a21ab01bfd]
2020-10-18T06:50:18.336 INFO:tasks.ceph.osd.1.smithi003.stderr: 8: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8d3eb366db]
2020-10-18T06:50:18.336 INFO:tasks.ceph.osd.1.smithi003.stderr: 9: clone()
2020-10-18T06:50:18.336 INFO:tasks.ceph.osd.1.smithi003.stderr:

https://pulpito.ceph.com/kchai-2020-10-18_06:35:20-rados-wip-kefu-testing-2020-10-17-1517-distro-basic-smithi/5534413/.

see also https://pulpito.ceph.com/kchai-2020-10-18_06:35:20-rados-wip-kefu-testing-2020-10-17-1517-distro-basic-smithi/

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov
Copy link
Contributor

tchaikov commented Dec 6, 2020

@ifed01 needs rebase

@tchaikov
Copy link
Contributor

jenkins test api

@tchaikov
Copy link
Contributor

2020-12-10T05:18:02.066 INFO:tasks.ceph.osd.1.smithi074.stderr: ceph version 16.0.0-8120-g72b5978a (72b5978a1c8f4271f00af6f934635ec4e1278339) pacific (dev)
2020-12-10T05:18:02.066 INFO:tasks.ceph.osd.1.smithi074.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f3080b79dc0]
2020-12-10T05:18:02.066 INFO:tasks.ceph.osd.1.smithi074.stderr: 2: gsignal()
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 3: abort()
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x56184aeed4eb]
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 5: ceph-osd(+0x5636b4) [0x56184aeed6b4]
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 6: ceph-osd(+0xb5cf19) [0x56184b4e6f19]
2020-12-10T05:18:02.067 INFO:tasks.ceph.osd.1.smithi074.stderr: 7: (BlueStore::OnodeSpace::map_any(std::function<bool (boost::intrusive_ptr<BlueStore::Onode>)>)+0x98) [0x56184b4c5b58]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 8: (BlueStore::_reap_collections()+0x2b9) [0x56184b4e6879]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 9: (BlueStore::_kv_finalize_thread()+0x77b) [0x56184b52548b]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 10: (BlueStore::KVFinalizeThread::entry()+0x11) [0x56184b5795d1]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 11: /lib64/libpthread.so.0(+0x82de) [0x7f3080b6f2de]
2020-12-10T05:18:02.068 INFO:tasks.ceph.osd.1.smithi074.stderr: 12: clone()
2020-12-10T05:18:02.069 INFO:tasks.ceph.osd.1.smithi074.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

see https://pulpito.ceph.com/kchai-2020-12-10_03:58:13-rados-wip-kefu-testing-2020-12-10-0947-distro-basic-smithi/5696996/

@tchaikov tchaikov removed the needs-qa label Dec 10, 2020
@ifed01 ifed01 force-pushed the wip-ifed-faster-rm-p2 branch from 10d797c to 36159ca Compare August 2, 2021 11:32
@tchaikov
Copy link
Contributor

tchaikov commented Aug 3, 2021

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@sseshasa
Copy link
Contributor

Pulpito Run Rados Suite Results :

Only this PR was included in the following run:
https://pulpito.ceph.com/yuriw-2021-08-10_20:58:01-rados-wip-yuri5-testing-2021-08-10-1143-distro-basic-smithi/

All Runs & Failures Tracked here: https://pad.ceph.com/p/master_runs_tracking

Related Failures Summary (JobIDs):

  1. 6332069 and 6332501
  1. 6332125
  • qa/standalone/divergent-priors.sh:622: TEST_divergent_2: return 1

  • Test actually passed. At the end of the test, the step to kill the osd daemons failed.

  1. 6332188 and 6332635
  • rados/thrash: Assertion failure in BlueStore.cc (osd.7): src/os/bluestore/BlueStore.cc: 11414: FAILED ceph_assert(o->onode.exists())
  1. 6332680
  • rados/thrash: Abort in BlueStore.cc (osd.3): src/os/bluestore/BlueStore.cc: 13424: ceph_abort_msg("unexpected error")
  1. 6332725
  • qa/standalone/osd-backfill/osd-backfill-prio.sh:315: TEST_backfill_priority: return 1

  • Failure in kill_daemons() at the end of TEST_backfill_priority() test.

@zhscn
Copy link
Member

zhscn commented Aug 31, 2021

@ifed01

bulk removal not supported for KV on a spinner drive

It seems that the optimization in this pr targets only scenarios in which non-rotational device is used as the backend, is this correct? If so, why not optimize rotational device scenarios? It looks to me that the same approach should also work for those scenarios.

Thanks:-)

@ifed01
Copy link
Contributor Author

ifed01 commented Sep 1, 2021

@ifed01

bulk removal not supported for KV on a spinner drive

It seems that the optimization in this pr targets only scenarios in which non-rotational device is used as the backend, is this correct?

right, this isn't supported for rotational drives for now.

If so, why not optimize rotational device scenarios? It looks to me that the same approach should also work for those scenarios.

Thanks:-)

Technically this should be doable but we haven't tested that for those drives. And I'm hesitant to support more complex scenarios for RocksDB above spinning drives. RocksDB is mainly aimed to work with fast(!) devices and IMO it makes sense to discourage its usage for spinning drives as much as possible. We've seen a lot of bad user experience in these scenarios...

@aclamk
Copy link
Contributor

aclamk commented Sep 1, 2021

@ifed01 @zhscn

Function _use_db_rotational_settings is defined but never used.
Instead for checking BULK _use_rotational_settings is used.
Is this correct?

@zhscn
Copy link
Member

zhscn commented Sep 1, 2021

  if (_use_rotational_settings()) {
    dout(10) << __func__
             << " bulk removal not supported for KV on a spinner drive"
             << dendl;
    return -ENOTSUP;
  }

it looks like a typo, should use _use_db_rotational_settings instead.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@neha-ojha
Copy link
Member

@ifed01 is this ready for another round of teuthology testing?

@ifed01
Copy link
Contributor Author

ifed01 commented Nov 23, 2021

@ifed01 is this ready for another round of teuthology testing?

yeah, planning to give it a try once shaman completes the build

@github-actions
Copy link

github-actions bot commented Dec 5, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

2 similar comments
@github-actions
Copy link

github-actions bot commented Dec 9, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented Jan 8, 2022

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
The rationale is to have faster space reclaiming operation for bluestore
which performs as minimal DB operations as possible. Additionally onode
is trimmed out from the cache after it to ensure smoother collection
reaping if any.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This includes ability to read (single key mode/prefix bulk mode) and clear omaps.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Due to improved pg removal performance one can get false
positive error detection on the above keyphrase when running
ec-inconsistent-hinfo task.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@ghost
Copy link

ghost commented Sep 4, 2022

hi
i m suffering from an issue that we went to production without any limit for osd_max_pg_log_entries and osd_min_pg_log_entries
the problem is that the memory of the osd got very very high with many objects and big number of PG's
i saw in similar case recommendation for this
but i do not understand based on which parameters like : osd device size ? block.db size ?
osd_min_pg_log_entries=500
osd_max_pg_log_entries=500
bluefs_buffered_io=false
https://tracker.ceph.com/issues/53729
can someone share more light ? and if your improvement will somehow help

@markhpc
Copy link
Member

markhpc commented Sep 8, 2022

@ShimTanny this PR is probably not the right place for discussing existing release code. Maybe try the users mailing list?

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

github-actions bot commented Dec 7, 2022

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants