Project

General

Profile

Actions

Bug #72945

closed

Data digests are inconsistent during scrubbing

Added by Laura Flores 6 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Fixed In:
v20.3.0-5156-gdfa42f1600
Released In:
Upkeep Timestamp:
2026-02-05T22:49:06+00:00

Description

/a/teuthology-2025-09-07_20:00:25-rados-main-distro-default-smithi/8485449

From the mon log:

2025-09-07T20:53:55.158+0000 7f6644c59640  7 mon.a@0(leader).log v978 update_from_paxos applying incremental log 978 2025-09-07T20:46:41.837545+0000 osd.3 (osd.3) 337 : cluster [DBG] 3.b0 scrub starts
2025-09-07T20:53:55.158+0000 7f6644c59640  7 mon.a@0(leader).log v978 update_from_paxos applying incremental log 978 2025-09-07T20:46:41.839526+0000 osd.3 (osd.3) 338 : cluster [DBG] 3.b0 scrub ok
2025-09-07T20:53:55.158+0000 7f6644c59640  7 mon.a@0(leader).log v978 update_from_paxos applying incremental log 978 2025-09-07T20:46:43.825545+0000 osd.3 (osd.3) 339 : cluster [DBG] 3.b3 scrub starts
2025-09-07T20:53:55.158+0000 7f6644c59640  7 mon.a@0(leader).log v978 update_from_paxos applying incremental log 978 2025-09-07T20:46:43.829306+0000 osd.3 (osd.3) 340 : cluster [DBG] 3.b3 scrub ok
2025-09-07T20:53:55.158+0000 7f6644c59640  7 mon.a@0(leader).log v978 update_from_paxos applying incremental log 978 2025-09-07T20:46:44.836992+0000 osd.3 (osd.3) 341 : cluster [DBG] 3.48 scrub starts
2025-09-07T20:53:55.158+0000 7f6644c59640  7 mon.a@0(leader).log v978 update_from_paxos applying incremental log 978 2025-09-07T20:46:44.847499+0000 osd.3 (osd.3) 342 : cluster [ERR] 3.48s0 3:12813970:::smithi14520280-46:4adata digests are inconsistent

In checking between a good commit and a bad commit, this set came up:

$ git log --pretty=oneline --no-merges 346846543c6bfc93a360476c739580cd2344fec0..f96567578976c2c84a31d366a04c28fb95ceb0d9 src/osd
aaa198692734666459ef4110c0ebf26b8499707f osd/scrub: clear m_ec_digest_map between objects
6b85e4d453f829c69f6441007bc3a6893b6b3d99 osd/scrub: reinstate one-warning-per-chunk behaviour
5e59c521f8dcd3a5d86ee5cf1f0576a7be6c274e osd/scrub: modify OMAP stats collection
547d13f7f88652e5a96f2a432f5c53358cd07cf3 osd/scrub: avoid using moved-from auth_n_errs
100c20b7d6588295f539208a2812ba7fd3fb5222 osd/scrub: fix heap-buffer-overflow when checking digest emptiness
b6f50d5f89b66188d3fafcf58a535dc43aecae9c osd: add missing includes

I suspect it's coming from one of these.

Actions #1

Updated by Ronen Friedman 6 months ago

(update: currently being investigated by Jonathan Bailey)

Actions #2

Updated by Shraddha Agrawal 6 months ago

/a/skanta-2025-09-11_16:30:11-rados-wip-bharath7-testing-2025-09-11-1359-distro-default-smithi/8494747

Actions #3

Updated by Laura Flores 6 months ago

  • Assignee changed from Ronen Friedman to Jonathan Bailey
Actions #4

Updated by Radoslaw Zarzynski 6 months ago

@Jonathan Bailey: would you mind taking a look and judge whether it's EC related?

Actions #5

Updated by Laura Flores 6 months ago

/a/yuriw-2025-09-12_19:42:42-rados-wip-yuri3-testing-2025-09-12-0906-distro-default-smithi/8496787

Several more on this run.

Actions #6

Updated by Jonathan Bailey 6 months ago · Edited

@Radoslaw Zarzynski I am already investigating. This is isolated to runs which have EC Optimizations turned on from what I have seen. Trying to put together a fix and am adding in more logging to get the cause of failure.

Actions #7

Updated by Jonathan Bailey 6 months ago

To expand further, it appears this bug only appears when using profiles that are using the ISA plugin and have erasure coding optimizations enabled.

This should be isolated to main as the code for ec checking during scrubbing is not part of the code going into tentacle.

Actions #8

Updated by Jonathan Bailey 6 months ago

I believe I've found the root of the issue. My findings are as follows:
  • We were comparing crc buffers beyond the end of the crcs
  • There was a double call to logical_to_ondisk_size when creating the crcs for zero buffers, causing them to be mis-sized
  • The code was not padding smaller shards as its a requirement for shards to be the same sized when used for parity comparison.

I'm currently running the code through some testing to make sure these are all the issues and will do some tidying up of my currently very messy and verbose code before I create a PR to check this in and fix the issue.

Actions #9

Updated by Jonathan Bailey 6 months ago

Created a PR with a fix here: https://github.com/ceph/ceph/pull/65623

Actions #10

Updated by Laura Flores 6 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 65623
Actions #11

Updated by Radoslaw Zarzynski 6 months ago

The EC scrubbing optimizations aren't in Tentacle, so likely we don't need to backport the fix.

Actions #12

Updated by Jonathan Bailey 6 months ago

Agreed. Just to further expand, the proposed PR to fix this only changes parts of the code that are in main and nothing that is in Tentacle.

Actions #13

Updated by Laura Flores 6 months ago · Edited

/a/skanta-2025-09-07_23:32:26-rados-wip-bharath2-testing-2025-09-07-1916-distro-default-smithi/8486564

Actions #14

Updated by Laura Flores 6 months ago

PR needs review.

Actions #15

Updated by Aishwarya Mathuria 5 months ago

/a/skanta-2025-10-07_22:45:50-rados-wip-bharath1-testing-2025-10-06-2038-distro-default-smithi/8540411
8540414, 8540415, 8540416, 8540422, 8540423, 8540426, 8540428, 8540430

Actions #16

Updated by Kamoltat (Junior) Sirivadhna 5 months ago

/a/skanta-2025-09-08_23:33:07-rados-wip-bharath2-testing-2025-09-07-1916-distro-default-smithi/

[8488706, 8488712, 8488718, 8488719, 8488723, 8488727, 8488729]

Actions #17

Updated by Kamoltat (Junior) Sirivadhna 5 months ago

suite watcher: this tracker is in progress, currently being tested in teuthology

Actions #18

Updated by Radoslaw Zarzynski 5 months ago

scrub note: in QA.

Actions #19

Updated by Jaya Prakash 5 months ago

Updates from Rados Watcher :
yuriw-2025-10-22_23:56:36-rados-wip-yuri5-testing-2025-10-22-1314-distro-default-smithi
10 jobs: ['8566156', '8566144', '8566000', '8565920', '8566079', '8565992', '8566056', '8565995', '8566045', '8566089']
teuthology-2025-10-26_20:00:25-rados-main-distro-default-smithi
11 jobs: ['8569767', '8569680', '8569595', '8569620', '8569558', '8569590', '8569605', '8569711', '8569662', '8569763', '8569512']

Actions #20

Updated by Aishwarya Mathuria 5 months ago

/a/skanta-2025-11-01_01:03:27-rados-wip-bharath1-testing-2025-10-31-0445-distro-default-smithi/
10 jobs: ['8578565', '8578556', '8578574', '8578566', '8578580', '8578579', '8578567', '8578573', '8578564', '8578576']

Actions #21

Updated by Radoslaw Zarzynski 5 months ago

The PR needs a rebase and maybe some rework.

Actions #22

Updated by Jonathan Bailey 5 months ago

I am looking into failures currently and will update the PR with fix once I have done so

Actions #23

Updated by Kamoltat (Junior) Sirivadhna 4 months ago

RADOS bug scrub: bump (waiting for rebasing)

Actions #24

Updated by Aishwarya Mathuria 4 months ago

RADOS main watcher update:
Quite a few failures in:
/a/teuthology-2025-11-09_20:00:24-rados-main-distro-default-smithi
/a/teuthology-2025-11-16_20:00:21-rados-main-distro-default-smithi

The PR has been re-based, added needs-qa label again

Actions #25

Updated by Radoslaw Zarzynski 4 months ago

scrub note: ACK, waiting for QA to pick it up!

Actions #26

Updated by Sridhar Seshasayee 4 months ago

/a/skanta-2025-11-13_10:26:04-rados-wip-bharath3-testing-2025-11-12-2038-distro-default-smithi/
[8601365, 8601366, 8601367, 8601371, 8601372, 8601381, 8601387]

Actions #27

Updated by Laura Flores 4 months ago

/a/lflores-2025-11-19_18:47:12-rados-wip-lflores-testing-2-2025-11-19-1711-distro-default-smithi

['8613505', '8613391', '8613310', '8613307', '8613316', '8613454', '8613357', '8613468', '8613367']

Actions #28

Updated by Kamoltat (Junior) Sirivadhna 4 months ago

/a/skanta-2025-11-01_02:37:10-rados-wip-bharath4-testing-2025-10-31-1459-distro-default-smithi/
[8578618, 8578627, 8578628, 8578637, 8578638, 8578641, 8578645, 8578646]

Actions #29

Updated by Laura Flores 4 months ago

QA ticket in progress here: https://tracker.ceph.com/issues/73898

Actions #30

Updated by Radoslaw Zarzynski 4 months ago

scrub note: QA results under analysis, should be ready soon.

Actions #31

Updated by Laura Flores 4 months ago

/a/lflores-2025-12-02_17:29:40-rados-wip-lflores-testing-4-2025-12-01-1527-distro-default-smithi/8636005

Actions #32

Updated by Aishwarya Mathuria 4 months ago

/a/yuriw-2025-12-03_15:44:36-rados-wip-yuri5-testing-2025-12-02-1256-distro-default-smithi/8639534

Actions #33

Updated by Laura Flores 3 months ago

Note from bug scrub: In second round of testing.

Actions #34

Updated by Naveen Naidu 3 months ago

/a/skanta-2025-11-21_10:17:34-rados-wip-bharath11-testing-2025-11-21-0531-distro-default-smithi

7 jobs: ['8617903', '8617816', '8617840', '8617756', '8617917', '8617806', '8617958']

Actions #35

Updated by Kamoltat (Junior) Sirivadhna 3 months ago

Rados suite watcher: bump

Actions #36

Updated by Sridhar Seshasayee 3 months ago

/a/skanta-2025-12-03_02:50:04-rados-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi
7 Jobs
[8638385, 8638391, 8638392, 8638397, 8638401,8638404, 8638410]

Actions #37

Updated by Radoslaw Zarzynski 3 months ago

scrub note: an unrelated failure delays the merging after-the-lab-migration (see https://tracker.ceph.com/issues/73898).

Actions #38

Updated by Laura Flores 2 months ago · Edited

Scrub note: QA in progress (delays due to lab migration)

Actions #39

Updated by Sridhar Seshasayee about 2 months ago

/a/skanta-2026-01-27_05:35:03-rados-wip-bharath1-testing-2026-01-26-1242-distro-default-trial/
7 jobs: ['19749', '19748', '19759', '19766', '19781', '19752', '19757']

Actions #40

Updated by Aishwarya Mathuria about 2 months ago

/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial
['28583', '28567', '28560', '28592', '28561', '28573', '28568', '28586', '28564', '28571']

Actions #41

Updated by Connor Fawcett about 1 month ago

/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19851
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19860
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19887
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19865
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19850
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19879
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19875
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19859
/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19853

Actions #42

Updated by Upkeep Bot about 1 month ago

  • Status changed from Fix Under Review to Resolved
  • Merge Commit set to dfa42f16005f96c47ba21da048edf9c5294b3871
  • Fixed In set to v20.3.0-5156-gdfa42f1600
  • Upkeep Timestamp set to 2026-02-05T22:49:06+00:00
Actions #43

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_02:19:11-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/
['24696', '24636', '24645', '24720', '24834', '24686', '24797', '24639' ]

Actions #44

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/
['25719', '25730', '25732', '25712', '25705', '25704', '25707', '25711', '25739']

Actions #45

Updated by Jaya Prakash about 1 month ago

8 jobs: ['38087', '38172', '38056', '38157', '38084', '38101', '38045', '38174']
jayaprakash-2026-02-06_12:54:34-rados-jaya-bs-testing-05-02-2026-distro-default-trial

Actions #46

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial
['35643', '35655', '35651', '35644', '35674', '35668', '35646', '35650']

Actions #47

Updated by Naveen Naidu about 1 month ago

/a/skanta-2026-01-26_08:54:40-rados-wip-bharath4-testing-2026-01-26-1300-distro-default-trial/
9 jobs: ['17847', '17759', '17686', '17736', '17695', '17770', '17884', '17689', '17746']

Actions

Also available in: Atom PDF