Skip to content

os/bluestore: replicate bdev label #53095

Closed
pereman2 wants to merge 13 commits intoceph:mainfrom
rhcs-dashboard:bluestore-replicate-label
Closed

os/bluestore: replicate bdev label #53095
pereman2 wants to merge 13 commits intoceph:mainfrom
rhcs-dashboard:bluestore-replicate-label

Conversation

@pereman2
Copy link
Contributor

Currently this is wip as I'm working on trying to replicate bluestore's label while not messing existing allocations on where replicated label will be placed in case of backports.

The idea is to replicate bluestore's label on different offsets like: 0, 1GiB 10GiB and 1TiB. Currently, only /block is being replicated but it is expected to expand this onto bluefs if possible.

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@pereman2 pereman2 requested a review from aclamk August 23, 2023 14:35
@pereman2 pereman2 marked this pull request as ready for review August 28, 2023 12:32
@pereman2 pereman2 requested a review from a team as a code owner August 28, 2023 12:32
@github-actions
Copy link

github-actions bot commented Sep 6, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

pereman2 and others added 10 commits September 13, 2023 14:44
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Instead of replicating when writing running `_write_bdev_label` we
replicate on mkfs and in mount.

`_replicate_bdev_label` will try to replicate the label by checking
first if the allocator didn't allocate anything in the offset we want to
replicate at.

`read_bdev_label` will read all offsets where we might find labels as it
expects at least one to be ok. If it wasn't replicated and we try to
read other offset a part from "0", we assume it will fail on crc check.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
always run init_rm_free on expected offsets at mount and fsck so that
aloc doesn't try to use those chunks

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
@pereman2 pereman2 force-pushed the bluestore-replicate-label branch from 3ff2ac5 to ce05964 Compare September 14, 2023 13:04
Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
}
}

int BlueStore::_replicate_bdev_label(const string &path, bluestore_bdev_label_t label, bool alloc_available) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _replicate_bdev_label is separate from _write_bdev_label?
All of them should always contain the same data.
I think that the additional locations of label should be handled inside _write_bdev_label().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_write_bdev_label is static in contrast to _replicate_bdev_label which isn't because it uses alloc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better then to have _write_bdev_label being responsible specifically for writing labels to multiple destinations and to introduce another function to tag label's extents unallocated?

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
@pereman2
Copy link
Contributor Author

jenkins test make check


// Label offsets where they might be replicated. It is possible on previous versions where these offsets
// were already used so labels won't exist there.
const vector<uint64_t> bdev_label_offsets = {0, 1073741824, 107374182400, 1099511627776};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about making this configurable through regular ceph settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclamk What do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caph settings can change, but to locate bdev label copies it is best to have them in fixed, predictable location.
I do not think that having this configurable brings additional value.

@aclamk
Copy link
Contributor

aclamk commented Sep 27, 2023

I feel like there will be a problem when we upgrade from previous version.
We should have something in bdev label that would tell us that writing multiple label copies is expected.
So, on upgraded OSDs it will work as before and only new deployments would have the multi copies.

Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
@pereman2
Copy link
Contributor Author

pereman2 commented Oct 6, 2023

jenkins test make check

@ifed01
Copy link
Contributor

ifed01 commented Dec 13, 2023

@pereman2 - could you please provide some design overview behind your implementation so we can discuss the approach.

I've been thinking about something like the following but as far as I can see your approach is rather different:

  1. Introduce epoch field to bdev label starting at 1. Epoch is just a counter incremented on each writing to bdev. Legacy labels to have zero epoch which is constant.
  2. on mkfs - labels are written to multiple predefined locations with epoch = 1
  3. on (first?) label reading - we read the label0 at offset 0, if epoch is zero - fall back to legacy single label mode. If label has got a non-zero epoch or label0 is corrupted (undecodable/bad crc/wrong cluster id/disk read error/etc) - read labels from all the other potential locations and note valid/corrupted (as per above) ones. Among the valid labels the latest one is selected as active given epoch field and all further reads occur against it. If multilabel mode has been discovered - all the corrupted labels to be reported to log and possibly raise an OSD alert. If no(!) valid labels have been found - OSD fails to start.
  4. Valid active label to be kept in RAM rather than re-read from disk each time which is different from what we have had so far IIRC, Alternatively we can repeat the above procedure on each read request but IMO this could spoil OSD log in case of any troubles.
  5. on mount - if multilabel mode is discovered - unconditionally tag all label locations as non-free in allocator.
  6. on (or possibly after) umount - if multilabel mode has been in use - mark label locations as free as we do for BlueFS and persist allocator map in this state.
  7. on label write - if new multilabel mode has been determined - write multiple labels using the active label instance (+updates) values and incremented epoch. Disk write errors to be reported to OSD log (and may be even as OSD alert) but shouldn't impact OSD functioning unless all the labels are unwritable. In legacy mode - just a single label0 to be written with zero epoch.

Alternatively (and apparently more in line with your approach) we can fail on step 3 if some label is invalid and request fsck to fix that but there is a drawback in this approach - some label locations could be permanently damaged due to e.g. disk errors - this will effectively forbid OSD from starting as the following loop (read - error - fsck - read - error - ...) occurs.
@aclamk - your thoughts are welcome too ;)

void BlueStore::_unalloc_bdev_label_offset(uint64_t offset) {
ceph_assert(alloc);
bool existing_allocation_in_offset = true;
alloc->foreach([&](uint64_t extent_offset, uint64_t extent_length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be quite time consuming operation on a highly fragmented modern HDD.
Suggest to introduce virtual bool Allocator::is_allocated(offset [, len]) method and implement it in descendants to perform that efficiently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And most likely we can avoid this at all if we rely on bdev content to determine new multilabel mode and unconditionally mark relevant locations as non-free on mount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I missed a procedure like that. How hard would it be to implement on all allocators? Are the some that we could skip?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doable but for sure requires some efforts to maintain this functionality in every allocator (and cover with proper testing). Let's postpone this though till we agree on the overall design. May be we'll don't need it at all

@pereman2
Copy link
Contributor Author

@pereman2 - could you please provide some design overview behind your implementation so we can discuss the approach.

I've been thinking about something like the following but as far as I can see your approach is rather different:

  1. Introduce epoch field to bdev label starting at 1. Epoch is just a counter incremented on each writing to bdev. Legacy labels to have zero epoch which is constant.
  2. on mkfs - labels are written to multiple predefined locations with epoch = 1
  3. on (first?) label reading - we read the label0 at offset 0, if epoch is zero - fall back to legacy single label mode. If label has got a non-zero epoch or label0 is corrupted (undecodable/bad crc/wrong cluster id/disk read error/etc) - read labels from all the other potential locations and note valid/corrupted (as per above) ones. Among the valid labels the latest one is selected as active given epoch field and all further reads occur against it. If multilabel mode has been discovered - all the corrupted labels to be reported to log and possibly raise an OSD alert. If no(!) valid labels have been found - OSD fails to start.
  4. Valid active label to be kept in RAM rather than re-read from disk each time which is different from what we have had so far IIRC, Alternatively we can repeat the above procedure on each read request but IMO this could spoil OSD log in case of any troubles.
  5. on mount - if multilabel mode is discovered - unconditionally tag all label locations as non-free in allocator.
  6. on (or possibly after) umount - if multilabel mode has been in use - mark label locations as free as we do for BlueFS and persist allocator map in this state.
  7. on label write - if new multilabel mode has been determined - write multiple labels using the active label instance (+updates) values and incremented epoch. Disk write errors to be reported to OSD log (and may be even as OSD alert) but shouldn't impact OSD functioning unless all the labels are unwritable. In legacy mode - just a single label0 to be written with zero epoch.

Alternatively (and apparently more in line with your approach) we can fail on step 3 if some label is invalid and request fsck to fix that but there is a drawback in this approach - some label locations could be permanently damaged due to e.g. disk errors - this will effectively forbid OSD from starting as the following loop (read - error - fsck - read - error - ...) occurs. @aclamk - your thoughts are welcome too ;)

In general what you are proposing looks more or less like our first of many ideas. The key idea behind committed changes is to only deal with multiple labels on upgraded OSDs. If you upgraded and mount again replicated_label, without rerunning mkfs, boolean is checked to see if it was mounted with multiple labels -- we could change that to epoch just fine.

Right now when we mount we read first label (replicated_label == 1 || epoch == 1) if it's corrupted we just don't mount and communicate it to later perform a fsck repair that will take care of fixing corrupted labels. In summary, we decided not to mount BlueStore if the first label is corrupted.

Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still skeptical about the overall design - primarily I don't like extent at offset zero being a key stone - persistent disk error at it will make OSD unrecoverable - which IMO is a design flaw. And that's the issue which I really observed a few times in the field.

Persistent disk errors at other label locations aren't well handled IMO for now as well. Could be fixed by proper implementation though.

Apart from that there are a few label corruption/recovery scenarios which aren't properly implemented as well - undecodable label or label mismatch fields are the examples. Definitely more thorough unit testing is required.

And finally I'd really like to have more options to control the feature:

  1. Enable/disable label replication on mkfs. This will help in both working around some potential issues in the field and have better testing coverage - e.g. we can't implement "upgrade" test cases at store_test right now.
  2. Enable/disable label replication in _write_bdev_label/_replicate_bdev_label even when it's been originally enabled on mkfs. Might be helpful to work around potential implementation issues in the field.

if (recovery_mode) {
for (auto &label1 : labels) {
for (auto &label2 : labels) {
if (label1.osd_uuid != label2.osd_uuid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have bluestore_bdev_label_t::compare method or something to compare labels. Just to have this logic close to class definition as label content change will be error-prone.

ASSERT_EQ(0, r);
}

TEST_P(StoreTest, CorruptedLabelFsck) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very limited scenario to test the functionality. Provided overwrite causes checksum mismatch in the first label only.
E.g. making some label undecodable or making label comparison failed (with proper checksums) will cause completely different behavior: most likely - unrecoverable OSD state.
E.g. try to increase bl size by calling:
bl.append_zero(100);

return r;
}
} else {
int r = _read_bdev_label(cct, p, &label, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the error conditions in this function. Once some one is met - no recovery will happen as you return a few lines below. Please also see my comment in the new test case at store_test.cc - it simply doesn't cover all the possible cases hence you miss the error in this path.

int r = 0;
for (uint64_t offset : bdev_label_offsets) {
r = bl.write_fd(fd, offset);
if (r < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So failing to write out a specific label (e.g. due to persistent disk block error) will cause inability to "fix" an OSD and e.g. use it for data recovery. I don't think this is our expectation for this new feature.

string p = path + "/block";
bluestore_bdev_label_t label;
_read_bdev_label(cct, p, &label);
_replicate_bdev_label(p, label);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB is in read-write mode at this point. Hence there is no guarantee that there were no allocations at main device which could be corrupted by label replication. Additionally this unconditional replication is in fact pretty error-prone and future changes to mkfs procedure could cause selected locations to be occupied and this function will silently corrupt data there.

if (offset >= bdev->get_size()) {
break;
}
_unalloc_bdev_label_offset(offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to mkfs - it's too late to tag label extents as allocated at this point - DB has been already opened in R/W and someone could has written some data over there. As a result labels will be corrupted and one will need to repair them at the next OSD start - causing new data corruption for DB data or something..

void BlueStore::_unalloc_bdev_label_offset(uint64_t offset) {
ceph_assert(alloc);
bool existing_allocation_in_offset = true;
alloc->foreach([&](uint64_t extent_offset, uint64_t extent_length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doable but for sure requires some efforts to maintain this functionality in every allocator (and cover with proper testing). Let's postpone this though till we agree on the overall design. May be we'll don't need it at all

string value;
int r = read_meta("replicated_label", &value);
if (r == 0 && value == "yes") {
int r = _replicate_bdev_label(p, label, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what label do we use here as a source for the recovery? Presumably the one which we managed to read successfully above. But what if there exist two different but both valid labels, e.g. after an incomplete previous replication? Which one to choose? IMO that's another design flaw. We really need epoch/timestamp/seqno in the label to distinguish the latest replica.

@github-actions
Copy link

github-actions bot commented Feb 6, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented Apr 6, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Apr 6, 2024
@github-actions
Copy link

github-actions bot commented May 6, 2024

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants