Skip to content

os/bluestore: Multiple bdev labels on main block device#55374

Merged
aclamk merged 45 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-multi-label
Aug 7, 2024
Merged

os/bluestore: Multiple bdev labels on main block device#55374
aclamk merged 45 commits intoceph:mainfrom
aclamk:wip-aclamk-bs-multi-label

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Jan 30, 2024

Corruption of bdev label makes it very hard to recover OSD.
This PR copies bdev label to 4 potential replica places on device:
0(original), 1GB, 10GB, 100GB, 1000GB.

If all replicas do not match system refuses to start.
Fsck (repair mode) is the way to fix it.

This is an alternative version for #53095. Borrows many concepts and some code from it.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@aclamk aclamk requested a review from a team as a code owner January 30, 2024 13:33
@aclamk aclamk requested review from pereman2 and removed request for a team January 30, 2024 13:33
int _check_or_set_bdev_label(std::string path, uint64_t size, std::string desc,
bool create);
int _check_or_set_main_bdev_label(
std::string path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string path,
std::string& path,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines +5038 to 6034
if (bdev_label_valid_locations.empty()) {
_read_main_bdev_label(cct, p, &bdev_label,
&bdev_label_valid_locations, &bdev_label_multi, &bdev_label_epoch);
}
if (!bdev_label_valid_locations.empty()) {
bdev_label.meta[key] = value;
if (bdev_label_multi) {
bdev_label_epoch++;
bdev_label.meta["epoch"] = std::to_string(bdev_label_epoch);
}
int r = _write_bdev_label(cct, p, bdev_label, bdev_label_valid_locations);
ceph_assert(r == 0);
}
label.meta[key] = value;
r = _write_bdev_label(cct, p, label);
ceph_assert(r == 0);
return ObjectStore::write_meta(key, value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if bdev_label_valid_locations.empty() == true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to skip writing to bdev label if no bdev label was available, because we write "type=bluestore" in mkfs before creation of bdev label.
Granted, its weird.

@aclamk
Copy link
Contributor Author

aclamk commented Jan 31, 2024

jenkins test make check

@github-actions
Copy link

github-actions bot commented Feb 2, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk requested a review from pereman2 February 4, 2024 10:26
@aclamk aclamk force-pushed the wip-aclamk-bs-multi-label branch from 5d754bc to a0530b1 Compare February 5, 2024 12:27
@aclamk aclamk force-pushed the wip-aclamk-bs-multi-label branch from a0530b1 to b5356ba Compare February 5, 2024 23:38
@github-actions
Copy link

github-actions bot commented Feb 6, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

}
}
// Mark bits or locations of all bdev labels.
for (size_t i = 0; i < bdev_label_positions.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this go over bdev_labels_in_repair instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can have bdev labels locations that have proper data, but are in collision with some object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the above comment is a bit confusing then - in fact you mark all the possible bdev locations here. Irrespective of their bluefs usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have completely different comment here. Lets check after push.

@aclamk aclamk force-pushed the wip-aclamk-bs-multi-label branch from df3e95d to 06bfcba Compare February 8, 2024 23:55
@aclamk
Copy link
Contributor Author

aclamk commented Feb 9, 2024

jenkins test make check

@aclamk aclamk force-pushed the wip-aclamk-bs-multi-label branch 2 times, most recently from 603ec0e to aad1fb8 Compare February 13, 2024 14:32
@aclamk
Copy link
Contributor Author

aclamk commented Feb 20, 2024

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk force-pushed the wip-aclamk-bs-multi-label branch from 4ea3ecb to 811f297 Compare July 22, 2024 12:45
@aclamk aclamk added the aclamk-testing-phoebe bluestore testing label Jul 22, 2024
@NitzanMordhai
Copy link
Contributor

@yuriw
Copy link
Contributor

yuriw commented Aug 4, 2024

jenkins test this please

@yuriw
Copy link
Contributor

yuriw commented Aug 4, 2024

@aclamk this was tested and approved see: https://tracker.ceph.com/issues/67266
pls merge at will

@aclamk
Copy link
Contributor Author

aclamk commented Aug 6, 2024

jenkins test api

2 similar comments
@aclamk
Copy link
Contributor Author

aclamk commented Aug 6, 2024

jenkins test api

@aclamk
Copy link
Contributor Author

aclamk commented Aug 6, 2024

jenkins test api

@aclamk aclamk merged commit 7bcb68e into ceph:main Aug 7, 2024
guits added a commit to guits/ceph that referenced this pull request Sep 11, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
guits added a commit to guits/ceph that referenced this pull request Sep 11, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
guits added a commit to guits/ceph that referenced this pull request Sep 11, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
guits added a commit to guits/ceph that referenced this pull request Sep 13, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
guits added a commit to guits/ceph that referenced this pull request Sep 13, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
guits added a commit to guits/ceph that referenced this pull request Sep 13, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
guits added a commit to guits/ceph that referenced this pull request Sep 25, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
(cherry picked from commit dcf7439)
Naveenaidu pushed a commit to Naveenaidu/ceph that referenced this pull request Oct 3, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
oshrey16 pushed a commit to oshrey16/ceph that referenced this pull request Oct 13, 2024
BlueStore now writes its metadata at multiple offset on devices [1].
It means `ceph-volume lvm zap` doesn't remove BlueStore signature altogether.
This can confuse ceph-volume when redeploying an OSD on a previously
zapped device because there is still old BlueStore metadata on it.

ceph-volume should call `ceph-bluestore-tool zap-device` [2]
in addition to the existing calls when wiping a device.

[1] ceph#55374
[2] ceph#59632

Fixes: https://tracker.ceph.com/issues/68035

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
@sergiuiacob1
Copy link

Hello! What is the proper way to remove these labels for ceph v20.2.0 and up? I'm especially targeting a ceph v18 -> v20 migration. I'm currently using this logic:

  # Check if the device's cluster FSID matches the current cluster
  if [ -n "$DEVICE_CLUSTER_FSID" ] && [ "$DEVICE_CLUSTER_FSID" != "$CLUSTER_FSID" ]; then
    echo "WARNING: FSID MISMATCH DETECTED!"
    echo "  Device has ceph_fsid: $DEVICE_CLUSTER_FSID"
    echo "  Current cluster FSID: $CLUSTER_FSID"
    echo "  The device contains an OSD from a DIFFERENT cluster."
    echo "  Wiping the device to prepare for this cluster..."
    
    # Wipe BlueStore labels at all known locations
    # BlueStore stores labels at: beginning, 1GB, 10GB, and near the end
    echo "Wiping BlueStore labels..."
    dd if=/dev/zero of=/dev/ceph-volume bs=1M count=100 conv=fsync 2>/dev/null || true
    dd if=/dev/zero of=/dev/ceph-volume bs=1M count=10 seek=1024 conv=fsync 2>/dev/null || true
    dd if=/dev/zero of=/dev/ceph-volume bs=1M count=10 seek=10240 conv=fsync 2>/dev/null || true
    dd if=/dev/zero of=/dev/ceph-volume bs=1M count=10 seek=102400 conv=fsync 2>/dev/null || true
    # TODO BlueStore also stores a label "near the END" of the device
    
    # Also try wipefs if available
    wipefs -af /dev/ceph-volume 2>/dev/null || true
    
    echo "Device wiped."
    DEVICE_NEEDS_PREPARE=true
  else
    echo "FSID matches current cluster, reusing existing OSD."
  fi

This needs to work on both on VMs (ceph as a docker container) and on K8s clusters (ceph as a pod).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants