osd: fix for "data digests are inconsistent"#65623
osd: fix for "data digests are inconsistent"#65623SrinivasaBharath merged 2 commits intoceph:mainfrom
Conversation
4c0c4f0 to
e2b3571
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
e2b3571 to
8b5e9e6
Compare
8b5e9e6 to
c143b5f
Compare
|
Make checks timed out. Will retrigger |
|
jenkins test make check |
|
jenkins test make check arm64 |
|
(I hope to complete the review on my next work day - Sunday) |
src/osd/scrubber/scrub_backend.cc
Outdated
| const hobject_t& ho) { | ||
| this_chunk->m_ec_digest_map.clear(); | ||
|
|
||
| if (auth_selection.auth_oi.version != eversion_t() && !m_is_replicated && |
There was a problem hiding this comment.
Do we really need to repeat all the checks for m_is_replicated in a function names setup_ec_digest_map? Perhaps asserting once would be enough?
There was a problem hiding this comment.
my suggested changes in #65497
introduced 'm_ec_digest_map_size' that saved us from having to repeat the check.
Might be usable here
There was a problem hiding this comment.
I can assert that m_is_replicated is false at the start instead and move the check to outside the function, that would be more logical. The m_ec_digest_map_size does look like it will improve things here in the future too.
ronen-fr
left a comment
There was a problem hiding this comment.
LGTM, assuming comments (esp. Radek's) are addressed.
c143b5f to
cbfe3da
Compare
|
All comments should now be addressed/responded to |
|
jenkins test make check arm64 |
| std::string extract_crcs_from_map(const shard_id_map<bufferlist>& map); | ||
| std::string extract_crc_from_bufferlist(const bufferlist& crc_buffer); | ||
| char retrieve_byte(uint32_t value, uint32_t index) { | ||
| return value >> (8 * index) & 0xFF; |
There was a problem hiding this comment.
Hmm, it's surprising we don't such a primitive in include/intarith.h.
| clog.error() << candidates_errors.str(); | ||
| } | ||
|
|
||
| if (!m_is_replicated) { |
|
|
||
| for (uint32_t i = 0; i < sizeof(zero_data_crc); i++) { | ||
| bl.c_str()[i] = bl[i] ^ ((zero_data_crc >> (8 * i)) & 0xff); | ||
| bl.c_str()[i] ^= retrieve_byte(zero_data_crc, i); |
src/osd/scrubber/scrub_backend.cc
Outdated
| shard_id_set available_shards; | ||
|
|
||
| for (const auto& [srd, smap] : this_chunk->received_maps) { | ||
| if (!m_is_replicated && m_pg.get_ec_supports_crc_encode_decode() && |
There was a problem hiding this comment.
Tiny nit: there is no (shouldn't be) way m_is_replicated can change here.
|
|
||
| void ScrubBackend::setup_ec_digest_map(auth_selection_t& auth_selection, | ||
| const hobject_t& ho) { | ||
| ceph_assert(!m_is_replicated); |
There was a problem hiding this comment.
Hey @JonBailey1993, the teuthology tests still turned up some instances of this error. I ensured the commit was present in these tests. Can you take a look?
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8543999
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8543947
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8543988
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8544022
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8544087
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8543867
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8543939
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8543942
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8544099
Testing ref: https://tracker.ceph.com/issues/73477
|
Hey Laura, I'll take a look at this and see what has gone wrong |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@JonBailey1993: it also needs a rebase. |
|
I will do the rebase with my fixes, thank you @rzarzynski |
It was possible to see "data digests are inconsistent" being output to the logs at incorrect times due to multiple bugs. This code reorganises some of the deep scrubbing code and fixes the issues. The root cause of the issue that is being fixed here is: * We were comparing crc buffers beyond the end of the crcs * There was a double call to logical_to_ondisk_size when creating the crcs for zero buffers, causing them to be mis-sized * The code was not padding smaller shards as its a requirement for shards to be the same sized when used for parity comparison. All the above are fixed in this commit Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
…sistency of erasure coded pools Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
cbfe3da to
20623d6
Compare
|
The second issue appeared to be the changes were not guarded against running on shallow scrub. I've run some tests on teuthology and no longer see this error erroneously occurring as before. I've rebased on the latest now as well. |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
Apologies for the delay on approval; I found an issue with a separate PR in the batch. We're rerunning tests and should have fresh results soon. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
Not stale. Currently in QA. |
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker Update Log: https://github.com/ceph/ceph/actions/runs/21731423168 |
Fix for "data digests are inconsistent"
It was possible to see "data digests are inconsistent" being output to the logs at incorrect times due to multiple bugs. This code reorganises some of the deep scrubbing code and fixes the issues. The root cause of the issue that is being fixed here is:
All the above are fixed in this commit.
Tracker ticket: https://tracker.ceph.com/issues/72945
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.