os/bluestore: replicate bdev label by pereman2 · Pull Request #53095 · ceph/ceph

pereman2 · 2023-08-23T14:35:48Z

Currently this is wip as I'm working on trying to replicate bluestore's label while not messing existing allocations on where replicated label will be placed in case of backports.

The idea is to replicate bluestore's label on different offsets like: 0, 1GiB 10GiB and 1TiB. Currently, only /block is being replicated but it is expected to expand this onto bluefs if possible.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

src/os/bluestore/BlueStore.cc

github-actions · 2023-09-06T18:50:15Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Instead of replicating when writing running `_write_bdev_label` we replicate on mkfs and in mount. `_replicate_bdev_label` will try to replicate the label by checking first if the allocator didn't allocate anything in the offset we want to replicate at. `read_bdev_label` will read all offsets where we might find labels as it expects at least one to be ok. If it wasn't replicated and we try to read other offset a part from "0", we assume it will fail on crc check. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

always run init_rm_free on expected offsets at mount and fsck so that aloc doesn't try to use those chunks Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>

aclamk · 2023-09-14T14:16:24Z

src/os/bluestore/BlueStore.cc

+  }
+}
+
+int BlueStore::_replicate_bdev_label(const string &path, bluestore_bdev_label_t label, bool alloc_available) {


Why _replicate_bdev_label is separate from _write_bdev_label?
All of them should always contain the same data.
I think that the additional locations of label should be handled inside _write_bdev_label().

_write_bdev_label is static in contrast to _replicate_bdev_label which isn't because it uses alloc

IMO better then to have _write_bdev_label being responsible specifically for writing labels to multiple destinations and to introduce another function to tag label's extents unallocated?

src/os/bluestore/BlueStore.cc

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>

pereman2 · 2023-09-18T14:49:43Z

jenkins test make check

ifed01 · 2023-09-22T13:06:02Z

src/os/bluestore/BlueStore.cc


+// Label offsets where they might be replicated. It is possible on previous versions where these offsets
+// were already used so labels won't exist there.
+const vector<uint64_t> bdev_label_offsets = {0, 1073741824, 107374182400, 1099511627776};


what's about making this configurable through regular ceph settings?

@aclamk What do you think about this?

Caph settings can change, but to locate bdev label copies it is best to have them in fixed, predictable location.
I do not think that having this configurable brings additional value.

aclamk · 2023-09-27T07:07:24Z

I feel like there will be a problem when we upgrade from previous version.
We should have something in bdev label that would tell us that writing multiple label copies is expected.
So, on upgraded OSDs it will work as before and only new deployments would have the multi copies.

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>

pereman2 · 2023-10-06T08:36:35Z

jenkins test make check

ifed01 · 2023-12-13T13:33:27Z

@pereman2 - could you please provide some design overview behind your implementation so we can discuss the approach.

I've been thinking about something like the following but as far as I can see your approach is rather different:

Introduce epoch field to bdev label starting at 1. Epoch is just a counter incremented on each writing to bdev. Legacy labels to have zero epoch which is constant.
on mkfs - labels are written to multiple predefined locations with epoch = 1
on (first?) label reading - we read the label0 at offset 0, if epoch is zero - fall back to legacy single label mode. If label has got a non-zero epoch or label0 is corrupted (undecodable/bad crc/wrong cluster id/disk read error/etc) - read labels from all the other potential locations and note valid/corrupted (as per above) ones. Among the valid labels the latest one is selected as active given epoch field and all further reads occur against it. If multilabel mode has been discovered - all the corrupted labels to be reported to log and possibly raise an OSD alert. If no(!) valid labels have been found - OSD fails to start.
Valid active label to be kept in RAM rather than re-read from disk each time which is different from what we have had so far IIRC, Alternatively we can repeat the above procedure on each read request but IMO this could spoil OSD log in case of any troubles.
on mount - if multilabel mode is discovered - unconditionally tag all label locations as non-free in allocator.
on (or possibly after) umount - if multilabel mode has been in use - mark label locations as free as we do for BlueFS and persist allocator map in this state.
on label write - if new multilabel mode has been determined - write multiple labels using the active label instance (+updates) values and incremented epoch. Disk write errors to be reported to OSD log (and may be even as OSD alert) but shouldn't impact OSD functioning unless all the labels are unwritable. In legacy mode - just a single label0 to be written with zero epoch.

Alternatively (and apparently more in line with your approach) we can fail on step 3 if some label is invalid and request fsck to fix that but there is a drawback in this approach - some label locations could be permanently damaged due to e.g. disk errors - this will effectively forbid OSD from starting as the following loop (read - error - fsck - read - error - ...) occurs.
@aclamk - your thoughts are welcome too ;)

ifed01 · 2023-12-13T11:50:00Z

src/os/bluestore/BlueStore.cc

+void BlueStore::_unalloc_bdev_label_offset(uint64_t offset) {
+  ceph_assert(alloc);
+  bool existing_allocation_in_offset = true;
+  alloc->foreach([&](uint64_t extent_offset, uint64_t extent_length) {


This could be quite time consuming operation on a highly fragmented modern HDD.
Suggest to introduce virtual bool Allocator::is_allocated(offset [, len]) method and implement it in descendants to perform that efficiently.

And most likely we can avoid this at all if we rely on bdev content to determine new multilabel mode and unconditionally mark relevant locations as non-free on mount.

Agree, I missed a procedure like that. How hard would it be to implement on all allocators? Are the some that we could skip?

This is doable but for sure requires some efforts to maintain this functionality in every allocator (and cover with proper testing). Let's postpone this though till we agree on the overall design. May be we'll don't need it at all

src/os/bluestore/BlueStore.cc

pereman2 · 2023-12-15T10:49:26Z

@pereman2 - could you please provide some design overview behind your implementation so we can discuss the approach.

I've been thinking about something like the following but as far as I can see your approach is rather different:

Introduce epoch field to bdev label starting at 1. Epoch is just a counter incremented on each writing to bdev. Legacy labels to have zero epoch which is constant.

on mkfs - labels are written to multiple predefined locations with epoch = 1

on (first?) label reading - we read the label0 at offset 0, if epoch is zero - fall back to legacy single label mode. If label has got a non-zero epoch or label0 is corrupted (undecodable/bad crc/wrong cluster id/disk read error/etc) - read labels from all the other potential locations and note valid/corrupted (as per above) ones. Among the valid labels the latest one is selected as active given epoch field and all further reads occur against it. If multilabel mode has been discovered - all the corrupted labels to be reported to log and possibly raise an OSD alert. If no(!) valid labels have been found - OSD fails to start.

Valid active label to be kept in RAM rather than re-read from disk each time which is different from what we have had so far IIRC, Alternatively we can repeat the above procedure on each read request but IMO this could spoil OSD log in case of any troubles.

on mount - if multilabel mode is discovered - unconditionally tag all label locations as non-free in allocator.

on (or possibly after) umount - if multilabel mode has been in use - mark label locations as free as we do for BlueFS and persist allocator map in this state.

on label write - if new multilabel mode has been determined - write multiple labels using the active label instance (+updates) values and incremented epoch. Disk write errors to be reported to OSD log (and may be even as OSD alert) but shouldn't impact OSD functioning unless all the labels are unwritable. In legacy mode - just a single label0 to be written with zero epoch.

Alternatively (and apparently more in line with your approach) we can fail on step 3 if some label is invalid and request fsck to fix that but there is a drawback in this approach - some label locations could be permanently damaged due to e.g. disk errors - this will effectively forbid OSD from starting as the following loop (read - error - fsck - read - error - ...) occurs. @aclamk - your thoughts are welcome too ;)

In general what you are proposing looks more or less like our first of many ideas. The key idea behind committed changes is to only deal with multiple labels on upgraded OSDs. If you upgraded and mount again replicated_label, without rerunning mkfs, boolean is checked to see if it was mounted with multiple labels -- we could change that to epoch just fine.

Right now when we mount we read first label (replicated_label == 1 || epoch == 1) if it's corrupted we just don't mount and communicate it to later perform a fsck repair that will take care of fixing corrupted labels. In summary, we decided not to mount BlueStore if the first label is corrupted.

ifed01

I'm still skeptical about the overall design - primarily I don't like extent at offset zero being a key stone - persistent disk error at it will make OSD unrecoverable - which IMO is a design flaw. And that's the issue which I really observed a few times in the field.

Persistent disk errors at other label locations aren't well handled IMO for now as well. Could be fixed by proper implementation though.

Apart from that there are a few label corruption/recovery scenarios which aren't properly implemented as well - undecodable label or label mismatch fields are the examples. Definitely more thorough unit testing is required.

And finally I'd really like to have more options to control the feature:

Enable/disable label replication on mkfs. This will help in both working around some potential issues in the field and have better testing coverage - e.g. we can't implement "upgrade" test cases at store_test right now.
Enable/disable label replication in _write_bdev_label/_replicate_bdev_label even when it's been originally enabled on mkfs. Might be helpful to work around potential implementation issues in the field.

ifed01 · 2024-01-16T14:24:20Z

src/os/bluestore/BlueStore.cc

+  if (recovery_mode) {
+    for (auto &label1 : labels) {
+      for (auto &label2 : labels) {
+        if (label1.osd_uuid != label2.osd_uuid) {


Let's have bluestore_bdev_label_t::compare method or something to compare labels. Just to have this logic close to class definition as label content change will be error-prone.

ifed01 · 2024-01-16T17:00:13Z

src/test/objectstore/store_test.cc

  ASSERT_EQ(0, r);
 }

+TEST_P(StoreTest, CorruptedLabelFsck) {


This is a very limited scenario to test the functionality. Provided overwrite causes checksum mismatch in the first label only.
E.g. making some label undecodable or making label comparison failed (with proper checksums) will cause completely different behavior: most likely - unrecoverable OSD state.
E.g. try to increase bl size by calling:
bl.append_zero(100);

ifed01 · 2024-01-16T17:02:53Z

src/os/bluestore/BlueStore.cc

+      return r;
+    }
+  } else {
+    int r = _read_bdev_label(cct, p, &label, true);


Please see the error conditions in this function. Once some one is met - no recovery will happen as you return a few lines below. Please also see my comment in the new test case at store_test.cc - it simply doesn't cover all the possible cases hence you miss the error in this path.

ifed01 · 2024-01-16T17:17:21Z

src/os/bluestore/BlueStore.cc

+  int r = 0;
+  for (uint64_t offset : bdev_label_offsets) {
+    r = bl.write_fd(fd, offset);
+    if (r < 0) {


So failing to write out a specific label (e.g. due to persistent disk block error) will cause inability to "fix" an OSD and e.g. use it for data recovery. I don't think this is our expectation for this new feature.

ifed01 · 2024-01-16T17:25:02Z

src/os/bluestore/BlueStore.cc

+    string p = path + "/block";
+    bluestore_bdev_label_t label;
+    _read_bdev_label(cct, p, &label);
+    _replicate_bdev_label(p, label);


DB is in read-write mode at this point. Hence there is no guarantee that there were no allocations at main device which could be corrupted by label replication. Additionally this unconditional replication is in fact pretty error-prone and future changes to mkfs procedure could cause selected locations to be occupied and this function will silently corrupt data there.

ifed01 · 2024-01-16T17:44:38Z

src/os/bluestore/BlueStore.cc

+      if (offset >= bdev->get_size()) {
+        break;
+      }
+      _unalloc_bdev_label_offset(offset);


Similarly to mkfs - it's too late to tag label extents as allocated at this point - DB has been already opened in R/W and someone could has written some data over there. As a result labels will be corrupted and one will need to repair them at the next OSD start - causing new data corruption for DB data or something..

ifed01 · 2024-01-16T18:06:56Z

src/os/bluestore/BlueStore.cc

+void BlueStore::_unalloc_bdev_label_offset(uint64_t offset) {
+  ceph_assert(alloc);
+  bool existing_allocation_in_offset = true;
+  alloc->foreach([&](uint64_t extent_offset, uint64_t extent_length) {


This is doable but for sure requires some efforts to maintain this functionality in every allocator (and cover with proper testing). Let's postpone this though till we agree on the overall design. May be we'll don't need it at all

ifed01 · 2024-01-23T13:12:59Z

src/os/bluestore/BlueStore.cc

+    string value;
+    int r = read_meta("replicated_label", &value);
+    if (r == 0 && value == "yes") {
+      int r = _replicate_bdev_label(p, label, false);


Please clarify what label do we use here as a source for the recovery? Presumably the one which we managed to read successfully above. But what if there exist two different but both valid labels, e.g. after an incomplete previous replication? Which one to choose? IMO that's another design flaw. We really need epoch/timestamp/seqno in the label to distinguish the latest replica.

github-actions · 2024-02-06T14:49:33Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2024-04-06T15:01:26Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-05-06T16:01:44Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

pereman2 requested a review from aclamk August 23, 2023 14:35

github-actions bot added bluestore core tests labels Aug 23, 2023

pereman2 marked this pull request as ready for review August 28, 2023 12:32

pereman2 requested a review from a team as a code owner August 28, 2023 12:32

aclamk reviewed Aug 28, 2023

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

aclamk reviewed Aug 28, 2023

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

github-actions bot added the needs-rebase label Sep 6, 2023

pereman2 and others added 10 commits September 13, 2023 14:44

os/bluestore: simple replicated label block

ab84f9f

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: test bad corrupted superblock

681b0ef

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: replicate /block label

200697e

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: test corrupted label with fsck

8d64e0b

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: fsck replicate label check

6f01b2b

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: replicate on mount option

3202ab4

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: fsck repair improve test

4fdfc8d

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: only replicate on mkfs

ddff6eb

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: alloc rm free on replicated labels fsck and mount

ce05964

always run init_rm_free on expected offsets at mount and fsck so that aloc doesn't try to use those chunks Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>

pereman2 force-pushed the bluestore-replicate-label branch from 3ff2ac5 to ce05964 Compare September 14, 2023 13:04

github-actions bot removed the needs-rebase label Sep 14, 2023

os/bluestore: downcase bdev_label_offsets

1eb412f

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>