Skip to content

Mon: Add new health warning for non prime w+1 in blaum-roth EC profiles#66129

Merged
rzarzynski merged 2 commits intoceph:mainfrom
Tom-Sollers:fixing_blaum_roth_for_ec_profiles
Dec 15, 2025
Merged

Mon: Add new health warning for non prime w+1 in blaum-roth EC profiles#66129
rzarzynski merged 2 commits intoceph:mainfrom
Tom-Sollers:fixing_blaum_roth_for_ec_profiles

Conversation

@Tom-Sollers
Copy link
Contributor

@Tom-Sollers Tom-Sollers commented Nov 5, 2025

This PR is the second attempt of the original PR, which was merged to early and reverted in #66114

This PR adds a health warning for when a Erasure code profile uses the blaum-roth technique and has a w value such that w+1 is not prime, which can cause data corruption. It also improves the validation for when the user attempts to create a new erasure code profile by throwing a more descriptive warning when an invalid w value is used, and allows the user to override this validation using the --yes-i-really-mean-it flag. In addition to the previous changes this PR also adds a new test to /qa/standalone/erasure-code/test-erasure-code-plugins.sh to check that the health warning is functioning correctly.

Fixes: http://tracker.ceph.com/issues/64419
Signed-off-by: Tom Sollers tom.sollers@ibm.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@Tom-Sollers Tom-Sollers marked this pull request as draft November 5, 2025 10:47
@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from b96a598 to 9299700 Compare November 5, 2025 13:37
An EC pool is using the ``blaum_roth`` technique and ``w + 1`` is not a prime number.
This can result in data corruption.

To check the list of Erasure Code Profiles use the command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase


ceph osd erasure-code-profile ls

Then to check the ``w`` value for a particular profile use the command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Then//

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a command of the following form:

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you locate the cause of the QA error in

#65940 (comment)

@batrick batrick removed the needs-qa label Nov 5, 2025
@Tom-Sollers
Copy link
Contributor Author

Did you locate the cause of the QA error in

#65940 (comment)

Yeah so in HealthMonitor.cc line 1332. I have added a check for if the "technique" attribute exists in the metadata for the EC profile. The error was caused by line 1333 checking the value of "technique" when there wasn't one, this is now fixed.

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from 15ba199 to d91344b Compare November 11, 2025 11:52
@Tom-Sollers
Copy link
Contributor Author

jenkins test make check

@Tom-Sollers
Copy link
Contributor Author

jenkins test make check arm64

@Tom-Sollers
Copy link
Contributor Author

jenkins test

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch 3 times, most recently from e7609ec to a10026c Compare November 13, 2025 13:57
@Tom-Sollers
Copy link
Contributor Author

To avoid making the same mistake from the last PR, I have ran a teuthology test running a subset of runs from the RADOS tests suite to make sure there are no new regressions caused by the change.
https://pulpito.ceph.com/tsollers-2025-11-11_14:14:01-rados-main-distro-default-smithi/
I also ran 2 tests that contained the same configuration that broke the previous PR to demonstrate that the issue is fixed.
https://pulpito.ceph.com/tsollers-2025-11-13_09:21:25-rados-main-distro-default-smithi/

@Tom-Sollers Tom-Sollers marked this pull request as ready for review November 13, 2025 14:20
@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from a10026c to d71d61f Compare November 13, 2025 16:19
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from d71d61f to 74a8ab3 Compare November 14, 2025 13:53
…th a w+1 that is not prime

This commit adds a new health warning for when a user has an erasure code
profile using the blaum-roth technique which has a w+1 value that is not
prime.

Fixes: http://tracker.ceph.com/issues/64419
Signed-off-by: Tom Sollers <tom.sollers@ibm.com>
…alth warn

This commit adds a new test to test-erasure-code-plugins.sh that
tests for the health warning caused by having a erasure-code-profile
with the blaum-roth technique and a w+1 value that is not prime.

Fixes: http://tracker.ceph.com/issues/64419
Signed-off-by: Tom Sollers <tom.sollers@ibm.com>
@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from 74a8ab3 to e61450c Compare November 18, 2025 16:01
@Tom-Sollers
Copy link
Contributor Author

Pushed what should hopefully be the final version of the PR

Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Once Patrick approves we can set the needs-qa flag and get a full run of the rados suite to double check that there are no regressions.

@Naveenaidu
Copy link
Contributor

RADOS Approved: https://tracker.ceph.com/issues/73968#note-3

@rzarzynski rzarzynski merged commit 551ac2b into ceph:main Dec 15, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants