os/bluestore: get rid off statfs update on each txc by ifed01 · Pull Request #46036 · ceph/ceph

ifed01 · 2022-04-26T11:19:17Z

Instead this relies on NCB stuff to recover bluestore stats in case of non-graceful shutdown.
This functionality is a prerequisite for upcoming new WAL implementation. Additionally this might provide some performance improvements as DB gets less load.
And even more - redesigned NCB recovery is more performant too.

Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

ifed01 · 2022-04-26T18:17:49Z

jenkins test make check

aclamk

This is great stuff.
I think refactor of decode_some should be made differently, but I am open to discussion.

src/os/bluestore/BlueStore.cc

src/os/bluestore/FreelistManager.cc

src/os/bluestore/BlueStore.cc

src/os/bluestore/BlueStore.h

src/test/objectstore/test_bluestore_types.cc

src/os/bluestore/BlueStore.cc

aclamk · 2022-04-28T10:08:03Z

src/os/bluestore/BlueStore.cc

-      } catch (ceph::buffer::error& e) {
-        derr << "fsck error: failed to decode Pool StatFS record"
-	     << pretty_binary_string(key) << dendl;
+  string key;


I feel like we can make check/repair statfs much simpler.
We could create a map of {key->stat} from the stats we just scanned.
This should be ideally reflected by state of DB.
If DB does not have something - we complain and add (in repair mode).
If DB has different value - we complain and fix (in repair mode).
If DB has extra key - we complain and delete (in repair mode).

The tricky thing is that now we might have statfs persisted at DB in two different ways depending on the previous shutdown: graceful or not. This implementation checks in-memory stats (loaded either from relevant DB records or via recovery) vs. stats loaded by fsck.
For your suggestion we'll need to perform DB update immediately after the recovery and then de-serialize it back for fsck (and then finally update DB once again on a shutdown). Which doesn't make much sense to me...

ifed01 · 2022-05-12T12:58:56Z

@aclamk - I think I resolved most of you comments and responded to the rest, would you take another look?

aclamk

Strangely, I found no errors.
I wonder how many times you recheck it yourself Igor.

src/os/bluestore/BlueStore.cc

ifed01 · 2022-06-17T10:19:12Z

jenkins test make check

ifed01 · 2022-06-17T10:19:32Z

jenkins test make check

ifed01 · 2022-06-20T23:23:14Z

jenkins test make check

ifed01 · 2022-06-20T23:23:39Z

jenkins test make check

ifed01 · 2022-07-08T13:44:27Z

jenkins test make check

ifed01 · 2022-08-11T11:30:54Z

@aclamk - mind taking another look?

aclamk

I thought I already approved this.
I already have some cleanup work that depends on this one.

sseshasa · 2022-09-05T14:26:44Z

Teuthology Test Result
http://pulpito.front.sepia.ceph.com/?branch=wip-yuri5-testing-2022-08-18-0812

@ifed01 Across both the runs from the link above, the following failed jobIDs appear to be related to this PR.

The tests are thrashosds related and in each case the failure is due to an osd not coming up after being restarted due to fsck errors being reported by _fsck_check_statfs(). Could you please take a look?

Some logs from JobID: 6978863

...
2022-08-18T18:28:09.247+0000 7f83ac952440 10 bluefs _read h 0x5580b7c03500 0x7faa~8000 from file(ino 117 size 0xb79d mtime 2022-08-18T18:27:19.033067+0000 allocated 10000 alloc_commit 10000 extents [1:0x60000~10000]) prefetch
2022-08-18T18:28:09.247+0000 7f83ac952440 20 bluefs _read reaching (or past) eof, len clipped to 0x37f3
2022-08-18T18:28:09.247+0000 7f83ac952440 20 bluefs _read left 0x4056 len 0x37f3
2022-08-18T18:28:09.247+0000 7f83ac952440 20 bluefs _read got 14323
2022-08-18T18:28:09.248+0000 7f83902d9700 20 bluestore.MempoolThread(0x5580b7c26b38) _resize_shards cache_size: 563370167 kv_alloc: 197132288 kv_used: 204864 kv_onode_alloc: 142606336 kv_onode_used: 7497280 meta_alloc: 130023424 meta_used: 136 data_alloc: 75497472 data_used: 0
2022-08-18T18:28:09.256+0000 7f83ac952440  1 bluestore(/var/lib/ceph/osd/ceph-3) _fsck_on_open checking pool_statfs
2022-08-18T18:28:09.256+0000 7f83ac952440 -1 bluestore(/var/lib/ceph/osd/ceph-3) _fsck_check_statfs::fsck error: pool 3 has got no statfs to match against: store_statfs(0x0/0x0/0x0, data 0xeb9d000/0x975a000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)

...
...

2022-08-18T18:28:09.264+0000 7f83ac952440  1 bluestore(/var/lib/ceph/osd/ceph-3) _fsck_on_open checking deferred events
2022-08-18T18:28:09.264+0000 7f83ac952440  2 bluestore(/var/lib/ceph/osd/ceph-3) _fsck_on_open 464 objects, 0 of them sharded.
2022-08-18T18:28:09.264+0000 7f83ac952440  2 bluestore(/var/lib/ceph/osd/ceph-3) _fsck_on_open 692 extents to 692 blobs, 0 spanning, 0 shared.
2022-08-18T18:28:09.264+0000 7f83ac952440  1 bluestore(/var/lib/ceph/osd/ceph-3) _fsck_on_open <<<FINISH>>> with 1 errors, 0 warnings, 0 repaired, 1 remaining in 0.036685 seconds

...

2022-08-18T18:28:09.265+0000 7f83ac952440  1 bluefs umount
2022-08-18T18:28:09.265+0000 7f83ac952440 10 bluefs sync_metadata - no pending log events
2022-08-18T18:28:09.265+0000 7f83ac952440 10 bluefs _drain_writer 0x5580b7c40600 type 0
2022-08-18T18:28:09.265+0000 7f83ac952440 20 bluefs _stop_alloc
2022-08-18T18:28:09.265+0000 7f83ac952440  1 bdev(0x5580b89a1c00 /var/lib/ceph/osd/ceph-3/block) close
2022-08-18T18:28:09.462+0000 7f83ac952440 10 bluestore(/var/lib/ceph/osd/ceph-3) _close_fm
2022-08-18T18:28:09.462+0000 7f83ac952440  1 freelist shutdown
2022-08-18T18:28:09.462+0000 7f83ac952440  1 bdev(0x5580b89a0400 /var/lib/ceph/osd/ceph-3/block) close
2022-08-18T18:28:09.681+0000 7f83ac952440 -1 bluestore(/var/lib/ceph/osd/ceph-3) _mount fsck found 1 errors
2022-08-18T18:28:09.681+0000 7f83ac952440 -1 osd.3 0 OSD:init: unable to mount object store
2022-08-18T18:28:09.681+0000 7f83ac952440 -1 ^[[0;31m ** ERROR: osd init failed: (5) Input/output error^[[0m

yuriw · 2022-09-05T18:13:35Z

Per @sseshasa

"@yuriweinstein I looked into the pending jobIDs (as mentioned by @Nehaojha ) and all the other failures reported in the re-run (http://pulpito.front.sepia.ceph.com/lflores-2022-08-19_21:39:29-rados-wip-yuri5-testing-2022-08-18-0812-distro-default-smithi/).

Except for PR: os/bluestore: get rid off statfs update on each txc, the rest
of the PRs are Rados approved."

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

This implements a basis for statfs recovery from persistent Onode metadata. Plus some redesign to make this procedure more lightweight and performant - via avoiding full Onode rebuild. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Refine actions taken in close_db_environment. Its role is to close db handle and environment when db was used in special modes - repair/reshard, and is not actually open to typical r/w. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

Signed-off-by: Igor Fedotov <ifedotov@croit.io>

ljflores · 2022-11-10T19:22:59Z

Rados suite review: http://pulpito.front.sepia.ceph.com/?branch=wip-yuri7-testing-2022-10-17-0814

Failures:, unrelated
1. https://tracker.ceph.com/issues/57311
2. https://tracker.ceph.com/issues/52321
3. https://tracker.ceph.com/issues/52657
4. https://tracker.ceph.com/issues/57935

Details:
1. rook: ensure CRDs are installed first - Ceph - Orchestrator
2. qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Ceph - Orchestrator
2. rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs - Ceph - RADOS
3. MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' - Ceph - RADOS
4. all test jobs get stuck at "Running task ansible.cephlab..." - Infrstructure - Sepia

readability. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 3df4a8d) Conflicts: src/os/bluestore/BlueStore.cc src/os/bluestore/BlueStore.h <lack of ceph#46036 backporting>

This should eliminate duplicate onode releases that could happen before. Additionally onode pinning is performed during cache trimming not onode ref count increment. [Hopefully] fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit a3057f4) Conflicts: src/os/bluestore/BlueStore.cc src/os/bluestore/BlueStore.h <lack of ceph#46036 and ceph#43299 backporting>

readability. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 3df4a8d) Conflicts: src/os/bluestore/BlueStore.cc src/os/bluestore/BlueStore.h <lack of ceph#46036 backporting> (cherry picked from commit 816b4fc) Resolves: rhbz#2218445

This should eliminate duplicate onode releases that could happen before. Additionally onode pinning is performed during cache trimming not onode ref count increment. [Hopefully] fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit a3057f4) Conflicts: src/os/bluestore/BlueStore.cc src/os/bluestore/BlueStore.h <lack of ceph#46036 and ceph#43299 backporting> (cherry picked from commit 4a80641) Resolves: rhbz#2218445

ifed01 requested review from aclamk and benhanokh April 26, 2022 11:19

ifed01 added feature performance bluestore labels Apr 26, 2022

github-actions bot added common core tests labels Apr 26, 2022

ifed01 force-pushed the wip-ifed-new-statfs-update branch 3 times, most recently from d1fc3de to fdfe29a Compare April 26, 2022 14:19

aclamk requested changes Apr 28, 2022

View reviewed changes

ifed01 force-pushed the wip-ifed-new-statfs-update branch 4 times, most recently from 7032f8a to 3301422 Compare April 28, 2022 20:36

ifed01 force-pushed the wip-ifed-new-statfs-update branch from 3301422 to d088026 Compare May 12, 2022 12:54

djgalloway changed the base branch from master to main May 25, 2022 20:00

aclamk approved these changes May 31, 2022

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

ifed01 force-pushed the wip-ifed-new-statfs-update branch from d088026 to 54ec858 Compare June 17, 2022 09:35

ifed01 requested a review from a team as a code owner June 17, 2022 09:35

ifed01 force-pushed the wip-ifed-new-statfs-update branch 2 times, most recently from 7286b73 to 350bbf0 Compare July 4, 2022 11:49

aclamk approved these changes Aug 16, 2022

View reviewed changes

yuriw added the wip-yuri5-testing label Aug 16, 2022

yuriw removed the wip-yuri5-testing label Sep 5, 2022

yuriw added the TESTED label Sep 5, 2022

aclamk mentioned this pull request Sep 8, 2022

bluestore: Admin commands to inspect onode metadata #48017

Closed

14 tasks

ifed01 and others added 10 commits October 3, 2022 16:09

os/bluestore: cleanup around null freelist manager

8840110

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

test/store_test: enable sporadic full NCB recovery

9544de8

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: get rid off redundant is_null_fm() func

59d973e

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: mark has_null_manager() const

7c28956

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: refactor NCB recovery procedure.

bda1a20

This implements a basis for statfs recovery from persistent Onode metadata. Plus some redesign to make this procedure more lightweight and performant - via avoiding full Onode rebuild. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: do not persist statfs on every txc

b343580

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

test/objectstore/store_test: update tests due to new statfs mechanics

62c6d8e

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: get rid off redundant close_db_leave_bluefs() method

56e8bb2

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: invalidate statfs completely when destaging allocations

18cc766

Signed-off-by: Igor Fedotov <ifedotov@croit.io>

ifed01 force-pushed the wip-ifed-new-statfs-update branch from bef157e to 18cc766 Compare October 4, 2022 13:33

ifed01 removed the TESTED label Oct 4, 2022

ljflores added the wip-yuri7-testing label Oct 17, 2022

ljflores merged commit b0d73c5 into ceph:main Nov 10, 2022

ifed01 deleted the wip-ifed-new-statfs-update branch November 10, 2022 21:01

ifed01 mentioned this pull request Feb 9, 2023

quincy: os/bluesore: cumulative backport for Onode stuff and more #50048

Merged

14 tasks

aclamk mentioned this pull request Feb 17, 2023

os/store_test: Retune tests to current code #50150

Closed

14 tasks

Conversation

ifed01 commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

ifed01 commented Apr 26, 2022

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aclamk Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

ifed01 Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

ifed01 commented May 12, 2022

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ifed01 commented Jun 17, 2022

Uh oh!

ifed01 commented Jun 17, 2022

Uh oh!

ifed01 commented Jun 20, 2022

Uh oh!

ifed01 commented Jun 20, 2022

Uh oh!

ifed01 commented Jul 8, 2022

Uh oh!

ifed01 commented Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

sseshasa commented Sep 5, 2022

Uh oh!

yuriw commented Sep 5, 2022

Uh oh!

ljflores commented Nov 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ifed01 commented Apr 26, 2022 •

edited

Loading

ifed01 commented Aug 11, 2022 •

edited

Loading