Skip to content

Mon: Add health warning for blaum-roth EC profile when w+1 is not prime#65940

Merged
Tom-Sollers merged 2 commits intoceph:mainfrom
Tom-Sollers:fixing_blaum_roth_for_ec_profiles
Oct 24, 2025
Merged

Mon: Add health warning for blaum-roth EC profile when w+1 is not prime#65940
Tom-Sollers merged 2 commits intoceph:mainfrom
Tom-Sollers:fixing_blaum_roth_for_ec_profiles

Conversation

@Tom-Sollers
Copy link
Contributor

@Tom-Sollers Tom-Sollers commented Oct 14, 2025

This PR adds a health warning for when a Erasure code profile uses the blaum-roth technique and has a w value such that w+1 is not prime, which can cause data corruption. It also improves the validation for when the user attempts to create a new erasure code profile by throwing a more descriptive warning when an invalid w value is used, and allows the user to override this validation using the --yes-i-really-mean-it flag. In addition to the previous changes this PR also adds a new test to /qa/standalone/erasure-code/test-erasure-code-plugins.sh to check that the health warning is functioning correctly.

Fixes: http://tracker.ceph.com/issues/64419
Signed-off-by: Tom Sollers tom.sollers@ibm.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several suggestions. Please also change the commit/PR titles to be a present-tense imperative, i.e.

"mon: Add health warning for blaum-roth EC profile when w+1 is not prime"

Also, might it be possible to enhance the ceph osd erasure-code-profile set command to not create such a profile unless --yes-i-really-mean-it and -force are passed?

We would want to keep the health warning still to cover existing profiles/pools and cases where someone modifies existing defs via other means.

@Tom-Sollers Tom-Sollers changed the title Mon: Added health warning for blaum-roth EC profile when w+1 is not prime Mon: Add health warning for blaum-roth EC profile when w+1 is not prime Oct 15, 2025
Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are looking good. This review is just very minor issues - mainly not following the coding standard that I've highlighted to get you into good habits :-)

dout(20) << erasure_code_profile << dendl;

//This will look at the erasure code profiles technique is blaum_roth and will check that the w key exists
if (erasure_code_profile.second.at("technique") == "blaum_roth" && erasure_code_profile.second.count("w") == 1){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: coding standard has space before {
nit: coding standard is 80 character lines (a few exceptions are ok if its hard to break the line) - you could break the line after && here. There are several places in this PR where you need to fix line lengths

@bluikko
Copy link
Contributor

bluikko commented Oct 16, 2025

The commit message signed-off-by lines are missing < and > around the e-mail address which makes the "Signed-off-by" job fail. I might be mistaken but I believe it might be a requirement to have those lines correct.

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from a071968 to 228a244 Compare October 16, 2025 14:45
@Tom-Sollers
Copy link
Contributor Author

Fixed all the commit messages so they now are all in the present tense and the email tom.sollers@ibm.com is now inside <>

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from 228a244 to 36dfa83 Compare October 16, 2025 14:56
@Tom-Sollers
Copy link
Contributor Author

Amended a commit with a change that should have been a part of it

@bill-scales
Copy link
Contributor

jenkins test make check

@bill-scales
Copy link
Contributor

jenkins test make check arm64

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from 36dfa83 to 11f1bc3 Compare October 20, 2025 10:08
@Tom-Sollers
Copy link
Contributor Author

Updated all the commit messages again to remove src from the title and to remove the use of I.

@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from 11f1bc3 to c88965d Compare October 21, 2025 15:40
…th a w+1 that is not prime

This commit adds a new health warning for when a user has an erasure code
profile using the blaum-roth technique which has a w+1 value that is not
prime.

Fixes: http://tracker.ceph.com/issues/64419
Signed-off-by: Tom Sollers <tom.sollers@ibm.com>
…alth warn

This commit adds a new test to test-erasure-code-plugins.sh that
tests for the health warning caused by having a erasure-code-profile
with the blaum-roth technique and a w+1 value that is not prime.

Fixes: http://tracker.ceph.com/issues/64419
Signed-off-by: Tom Sollers <tom.sollers@ibm.com>
@Tom-Sollers Tom-Sollers force-pushed the fixing_blaum_roth_for_ec_profiles branch from c88965d to 256e277 Compare October 21, 2025 15:59
@Tom-Sollers
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@Tom-Sollers
Copy link
Contributor Author

jenkins test make check arm64

@Tom-Sollers
Copy link
Contributor Author

jenkins test make check arm64

@bill-scales
Copy link
Contributor

jenkins test make check arm64

1 similar comment
@Tom-Sollers
Copy link
Contributor Author

jenkins test make check arm64

@Tom-Sollers Tom-Sollers merged commit a222787 into ceph:main Oct 24, 2025
13 checks passed
@github-actions
Copy link

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/18777419898

@NitzanMordhai
Copy link
Contributor

NitzanMordhai commented Oct 28, 2025

@Tom-Sollers that PR was merged before it went to QA, there are some issue that causing ceph-mon to segfault:

    -9> 2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).paxosservice(mgrstat 1..11) maybe_trim 1~6
    -8> 2025-10-27T21:03:56.953+0000 7fa2013b5640 10 mon.a@0(leader).paxosservice(mgrstat 1..11) maybe_trim trim_to 6 would only trim 5 < paxos_service_trim_min 250
    -7> 2025-10-27T21:03:56.953+0000 7fa2013b5640 10 mon.a@0(leader).health tick
    -6> 2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_member_health
    -5> 2025-10-27T21:03:56.953+0000 7fa2013b5640 10 mon.a@0(leader).health check_member_health avail 99% total 15 GiB, used 72 MiB, avail 15 GiB
    -4> 2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_leader_health
    -3> 2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_netsplit
    -2> 2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_erasure_code_profilescheckingdefault,{crush-failure-domain=osd,k=2,m=1,plugin=isa,technique=reed_sol_van}
    -1> 2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_erasure_code_profilescheckinglrcprofile,{crush-device-class=,crush-failure-domain=osd,crush-num-failure-domains=0,crush-osds-per-failure-domain=0,crush-root=default,k=4,l=3,m=2,name=lrcprofile,plugin=lrc}
     0> 2025-10-27T21:03:56.957+0000 7fa2013b5640 -1 *** Caught signal (Aborted) **
 in thread 7fa2013b5640 thread_name:safe_timer

 ceph version 20.3.0-3797-g3375e026 (3375e026de8afb84f410a96280fab68dfd698f08) tentacle (dev - None)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fa206f75520]
 2: pthread_kill()
 3: raise()
 4: abort()
 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fa207305b9e]
 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fa20731120c]
 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fa207311277]
 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fa2073114d8]
 9: (std::__throw_out_of_range(char const*)+0x40) [0x7fa2073084a0]
 10: (HealthMonitor::check_erasure_code_profiles(health_check_map_t*)+0x16f) [0x55ff3653a25f]
 11: (HealthMonitor::check_leader_health()+0x27d) [0x55ff3653ae3d]
 12: (HealthMonitor::tick()+0x90) [0x55ff36525970]
 13: (Monitor::tick()+0xc0) [0x55ff36496d80]
 14: ceph-mon(+0x27b6ed) [0x55ff3643f6ed]
 15: (CommonSafeTimer<std::mutex>::timer_thread()+0x12f) [0x7fa2079f94ff]
 16: /usr/lib/ceph/libceph-common.so.2(+0x2b5fd1) [0x7fa2079f9fd1]
 17: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa206fc7ac3]
 18: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa207059850]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

please check: /a/nmordech-2025-10-27_18:05:22-rados-wip-sjust-rocky10-mgr-serialize-distro-default-smithi/8570732
cc: @rzarzynski


//This is a loop that will go through all the erasure code profiles
for (auto& erasure_code_profile : mon.osdmon()->osdmap.get_erasure_code_profiles()) {
dout(20) << "check_erasure_code_profiles" << "checking" << erasure_code_profile << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add space so the string won't looks like one word:

2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_erasure_code_profilescheckingdefault,{crush-failure-domain=osd,k=2,m=1,plugin=isa,technique=reed_sol_van}

dout(20) << "check_erasure_code_profiles" << "checking" << erasure_code_profile << dendl;

//This will look at the erasure code profiles technique is blaum_roth and will check that the w key exists
if (erasure_code_profile.second.at("technique") == "blaum_roth" &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't have a technique in lrcprofile, that caused the fail:

2025-10-27T21:03:56.953+0000 7fa2013b5640 20 mon.a@0(leader).health check_erasure_code_profilescheckinglrcprofile,{crush-device-class=,crush-failure-domain=osd,crush-num-failure-domains=0,crush-osds-per-failure-domain=0,crush-root=default,k=4,l=3,m=2,name=lrcprofile,plugin=lrc}

yaarith pushed a commit to ceph/ceph-ci that referenced this pull request Oct 29, 2025
…files with a w+1 that is not prime"

This reverts commit b9aca15.

See: ceph/ceph#65940 (comment)
yaarith pushed a commit to ceph/ceph-ci that referenced this pull request Oct 29, 2025
…r new health warn"

This reverts commit 256e277.

See: ceph/ceph#65940 (comment)

DISTROS: rocky10
ARCHS: x86_64
FLAVORS: default
CEPH-BUILD-BRANCH: r10-container
yaarith pushed a commit to ceph/ceph-ci that referenced this pull request Oct 30, 2025
batrick added a commit to batrick/ceph that referenced this pull request Nov 3, 2025
…oth_for_ec_profiles"

This reverts commit a222787, reversing
changes made to cf11d9d.

This was not QA'd or reviewed by SME for mon code changes.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
@batrick
Copy link
Member

batrick commented Nov 3, 2025

I am reverting this PR in #66114. This should have been QA'd by the core team and reviewed by a Monitor SME.

Please do not merge your own PRs without consulting with a PTL.

batrick added a commit that referenced this pull request Nov 4, 2025
* refs/pull/66114/head:
	Revert "Merge pull request #65940 from Tom-Sollers/fixing_blaum_roth_for_ec_profiles"

Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
harriscr pushed a commit to harriscr/ceph that referenced this pull request Nov 6, 2025
…oth_for_ec_profiles"

This reverts commit a222787, reversing
changes made to cf11d9d.

This was not QA'd or reviewed by SME for mon code changes.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
harriscr pushed a commit to harriscr/ceph that referenced this pull request Nov 6, 2025
…oth_for_ec_profiles"

This reverts commit a222787, reversing
changes made to cf11d9d.

This was not QA'd or reviewed by SME for mon code changes.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
ideepika pushed a commit to ideepika/ceph that referenced this pull request Nov 25, 2025
…oth_for_ec_profiles"

This reverts commit a222787, reversing
changes made to cf11d9d.

This was not QA'd or reviewed by SME for mon code changes.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants