mds: mark the scrub passed if dirfrag is dirty by kotreshhr · Pull Request #57953 · ceph/ceph

kotreshhr · 2024-06-10T13:05:00Z

The in-memory and on-disk stats might not match on a directory inode if any of the dirfrag is dirty.
So don't fail the scrub and mark it as passed if
the dirfrag is dirty.

Fixes: https://tracker.ceph.com/issues/65020

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

kotreshhr · 2024-06-11T07:30:34Z

Scheduled a QA run with fs:workload multi mds with dynamic balancer or ephemeral random pinning enabled.

https://pulpito.ceph.com/khiremat-2024-06-11_07:27:30-fs:workload-wip-khiremat-57953-scrub-error-0-distro-default-smithi/

kotreshhr · 2024-06-13T12:14:30Z

I still see the error after refreshing the PR with the fix for remote_dirfrag dirtiness. I will investigate further and update the PR
https://pulpito.ceph.com/khiremat-2024-06-13_08:02:02-fs:workload-wip-khiremat-57953-scrub-error-1-distro-default-smithi/

kotreshhr · 2024-06-13T16:36:28Z

I still see the error after refreshing the PR with the fix for remote_dirfrag dirtiness. I will investigate further and update the PR https://pulpito.ceph.com/khiremat-2024-06-13_08:02:02-fs:workload-wip-khiremat-57953-scrub-error-1-distro-default-smithi/

@vshankar @leonid-s-usov This particular scrub failure is with the fix. I see that both directory inode and corresponding dirfrag is not dirty and still the rstat is mismatched. The test yml config is as below. So something else is going on. Need to figure out.

fs:workload/{0-centos_9.stream begin/{0-install 1-cephadm 2-logrotate 3-modules} clusters/1a11s-mds-1c-client-3node conf/{client mds mgr mon osd} mount/kclient/{base/{mount-syntax/{v1} mount overrides/{distro/stock/{centos_9.stream k-stock} ms-die-on-skipped}} ms_mode/crc wsync/yes} objectstore-ec/bluestore-comp-ec-root omap_limit/10000 overrides/{cephsqlite-timeout frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts pg_health session_timeout} ranks/multi/{balancer/random export-check n/3 replication/default} standby-replay tasks/{0-subvolume/{with-namespace-isolated} 1-check-counter 2-scrub/yes 3-snaps/yes 4-flush/yes 5-quiesce/with-quiesce 6-workunit/fs/misc}}

The scrub error is on

"2024-06-13T09:00:29.607433+0000 mds.e (mds.2) 711 : cluster [WRN] Scrub error on inode 0x10000012f02 (/volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17) see mds.e log and `damage ls` output for details" in cluster log

Related mds.e.log (inode auth of the directory) . We can find that the inode is not dirty.

...
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.scrubstack handle_scrub mds_scrub(queue_dir_ack 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8) from mds.0
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs: state=RUNNING
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs entering with 4 in progress and 7992 in the stack
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs examining [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.scrubstack scrub_dir_inode [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack scrub_dir_inode recursive mode, frags [*]
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack scrub_dir_inode_final [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) scrub starting validate_disk_state on [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
....
....
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.scrubstack scrub_dir_inode done
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs dir inode, done
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack dequeue [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=1 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=1 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100] from ScrubStack
...
...
2024-06-13T09:00:29.604+0000 7f4b4ea75640 20 mds.2.cache.ino(0x10000012f02) ondisk_read_retval: 0
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache.ino(0x10000012f02) decoded 399 bytes of backtrace successfully
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache.ino(0x10000012f02) scrub: inotable ino = 0x10000012f02
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache.ino(0x10000012f02) scrub: inotable free says 0
2024-06-13T09:00:29.604+0000 7f4b4ea75640  7 mds.2.cache request_start_internal request(mds.2:94019 nref=2) op 5383
...
...
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache rdlock_dirfrags_stats_work [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=2 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=1 discoverbase=0 scrubqueue=0 randepin 0x5564d8188100]
....
....
2024-06-13T09:00:29.606+0000 7f4b55282640  7 mds.2.locker handle_file_lock a=syncack on (inest mix->sync g=0) from mds.0 [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=3 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix->sync g=0) (ifile sync r=1) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=1 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=1 authpin=1 discoverbase=0 scrubqueue=0 randepin 0x5564d8188100]
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * [2,head]
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * rstat n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0)
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * accounted_rstat n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0)
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * dirty_old_rstat {}
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) * first 2 -> 2 on [dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] rep@0.1 dir_auth=0 state=0 f(v0 m2024-06-13T08:55:10.381452+0000 8=8+0)/f(v0 m2024-06-13T08:55:04.955560+0000 3=3+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=0+0,ss=0+0 | subtree=1 importbound=0 sticky=0 0x5564d789e900]
....
....
2024-06-13T09:00:29.606+0000 7f4b55282640  0 log_channel(cluster) log [WRN] : Scrub error on inode 0x10000012f02 (/volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/
17) see mds.e log and `damage ls` output for details
2024-06-13T09:00:29.606+0000 7f4b55282640 -1 mds.2.scrubstack _validate_inode_done scrub error on inode [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=2 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest sync->mix g=0) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=1 discoverbase=0 scrubqueue=0 randepin 0x5564d8188100]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value":"(2)0x10000012f02:[<0x10000012ec8/17 v952>,<0x10000012ec6/.build-id v6429>,<0x10000012ec5/multiple_rsync_payload.228227 v748>,<0x1000000c579/payload.2 v714>,<0x10000000005/tmp v1831>,<0x100000001fb/client.0 v1617>,<0x100000001fa/257f5e61-1d1a-4ee3-8dfe-78182c700535 v1572>,<0x10000000001/sv_1 v1540>,<0x10000000000/qa v1503>,<0x1/volumes v1450>]//[]","memoryvalue":"(2)0x10000012f02:[<0x10000012ec8/17 v952>,<0x10000012ec6/.build-id v6429>,<0x10000012ec5/multiple_rsync_payload.228227 v849>,<0x1000000c579/payload.2 v736>,<0x10000000005/tmp v1873>,<0x100000001fb/client.0 v1655>,<0x100000001fa/257f5e61-1d1a-4ee3-8dfe-78182c700535 v1612>,<0x10000000001/sv_1 v1576>,<0x10000000000/qa v1539>,<0x1/volumes v1486>]//[]","error_str":""},"raw_stats":{"checked":true,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 m2024-06-13T08:55:04.955560+0000 3=3+0)","ondisk_value.rstat":"n(v0 rc2024-06-13T08:55:10.383452+0000 b347 9=8+1)","memory_value.dirstat":"f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0)","memory_value.rstat":"n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1)","error_str":"freshly-calculated rstats don't match existing ones"},"return_code":0}
2024-06-13T09:00:29.606+0000 7f4b55282640 20 mds.2.cache.ino(0x10000012f02) scrub_finished
.....
.....

Related mds.b.log (dirfrag auth of the directory). We can find that the dirfrag is not dirty from below logs. So remote_dirfrag_dirty flag is not set. And still the test failed.

2024-06-13T09:00:29.601+0000 7f2aa23d8640 10 mds.0.scrubstack handle_scrub mds_scrub(queue_dir 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8 force recursive) from mds.2
2024-06-13T09:00:29.601+0000 7f2aa23d8640 10 mds.0.scrubstack _enqueue with {[dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] auth{2=1} v=35 cv=35/35 dir_auth=0 state=1074266113|complete|auxsubtree f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=8+0,ss=0+0 | child=1 subtree=1 exportbound=0 replicated=1 dirty=0 authpin=0 scrubqueue=0 0x55884cd10d00]}, top=1
2024-06-13T09:00:29.601+0000 7f2aa23d8640 10 mds.0.cache.dir(0x10000012f02) auth_pin by 0x55883619f600 on [dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] auth{2=1} v=35 cv=35/35 dir_auth=0 ap=1+0 state=1074266113|complete|auxsubtree f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=8+0,ss=0+0 | child=1 subtree=1 exportbound=0 replicated=1 dirty=0 authpin=1 scrubqueue=0 0x55884cd10d00] count now 1
2024-06-13T09:00:29.601+0000 7f2aa23d8640 20 mds.0.scrubstack enqueue [dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] auth{2=1} v=35 cv=35/35 dir_auth=0 ap=1+0 state=1074266113|complete|auxsubtree f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=8+0,ss=0+0 | child=1 subtree=1 exportbound=0 replicated=1 dirty=0 authpin=1 scrubqueue=0 0x55884cd10d00] to top of ScrubStack
2024-06-13T09:00:29.601+0000 7f2aa23d8640 20 mds.0.scrubstack kick_off_scrubs: state=RUNNING
2024-06-13T09:00:29.601+0000 7f2aa23d8640 20 mds.0.scrubstack kick_off_scrubs entering with 5 in progress and 8306 in the stack
2024-06-13T09:00:29.601+0000 7f2aa23d8640  1 -- [v2:172.21.15.123:6838/2791669125,v1:172.21.15.123:6839/2791669125] send_to--> mds [v2:172.21.15.123:6834/2496736045,v1:172.21.15.123:6835/2496736045] -- mds_scrub(queue_dir_ack 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8) -- ?+0 0x558844fcae00
2024-06-13T09:00:29.601+0000 7f2aa23d8640  1 -- [v2:172.21.15.123:6838/2791669125,v1:172.21.15.123:6839/2791669125] --> [v2:172.21.15.123:6834/2496736045,v1:172.21.15.123:6835/2496736045] -- mds_scrub(queue_dir_ack 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8) -- 0x558844fcae00 con 0x558836451c00
2024-06-13T09:00:29.601+0000 7f2aa23d8640  1 -- [v2:172.21.15.123:6838/2791669125,v1:172.21.15.123:6839/2791669125] <== mds.1 v2:172.21.15.123:6836/3089364143 11418 ==== mds_scrub(queue_dir 0x2000000052c fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8 force recursive) ==== 72+0+0 (crc 0 0 0) 0x55883c410800 con 0x558836451800

leonid-s-usov

This is a valid approach, but there should be a way to avoid adding a new argument to so many methods. Please consider using the CInode::validated_data structure to store and access the information about the remote scrub ack.

vshankar · 2024-06-19T16:46:16Z

This is a valid approach, but there should be a way to avoid adding a new argument to so many methods. Please consider using the CInode::validated_data structure to store and access the information about the remote scrub ack.

@kotreshhr mentioned that this change doesn't fix the issue. Some explanation in https://tracker.ceph.com/issues/65020#note-34

leonid-s-usov · 2024-06-19T17:29:57Z

This is a valid approach, but there should be a way to avoid adding a new argument to so many methods. Please consider using the CInode::validated_data structure to store and access the information about the remote scrub ack.

@kotreshhr mentioned that this change doesn't fix the issue. Some explanation in https://tracker.ceph.com/issues/65020#note-34

I understand this differently - there is another issue that this change isn't supposed to fix. Kotresh states that in that new issue both the inode and the dirfrag are clean, so it's out of the scope of what the original problem was about. The fix will be different, too.

github-actions · 2024-08-18T18:01:47Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

vshankar · 2024-09-23T09:21:49Z

@kotreshhr I remember this change fixing one part of the issue. Did we RCA the other part (incorrect rfiles)?

kotreshhr · 2024-09-30T06:14:54Z

@kotreshhr I remember this change fixing one part of the issue. Did we RCA the other part (incorrect rfiles)?

There are two separate issues. I think we can take this and work on the other separately.

vshankar · 2024-09-30T07:44:29Z

@kotreshhr I remember this change fixing one part of the issue. Did we RCA the other part (incorrect rfiles)?

There are two separate issues. I think we can take this and work on the other separately.

ACK.

vshankar · 2024-10-28T04:49:12Z

@kotreshhr I think this PR needs to be updated as per comment https://tracker.ceph.com/issues/65020#note-36, yes?

vshankar · 2024-11-12T13:44:45Z

@rishabh-d-dave PTAL once @kotreshhr pushes an update.

rishabh-d-dave · 2024-11-12T14:32:08Z

@rishabh-d-dave PTAL once @kotreshhr pushes an update.

Sure.

github-actions · 2025-01-11T16:01:35Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

vshankar · 2025-01-13T06:26:10Z

@kotreshhr I think this PR needs to be updated as per comment https://tracker.ceph.com/issues/65020#note-36, yes?

@kotreshhr Would it be possible to push an update this week?

github-actions · 2025-02-06T17:58:42Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2025-04-07T18:02:04Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

vshankar · 2025-05-13T14:27:40Z

@kotreshhr please rebase.

kotreshhr · 2025-05-16T05:02:48Z

jenkins test make check arm64

kotreshhr · 2025-05-16T08:35:40Z

@kotreshhr please rebase.

@vshankar rebased and simplified the commit abf2800

Sample run with fs:workoads with scrub/yes - https://pulpito.ceph.com/khiremat-2025-05-16_05:33:04-fs:workload-wip-khiremat-57953-scrub-error-2-distro-default-smithi/

kotreshhr · 2025-05-16T08:45:48Z

@kotreshhr please rebase.

@vshankar rebased and simplified the commit abf2800

Sample run with fs:workoads with scrub/yes - https://pulpito.ceph.com/khiremat-2025-05-16_05:33:04-fs:workload-wip-khiremat-57953-scrub-error-2-distro-default-smithi/

Grepping the teuthology logs did show the code is exercised.

[khiremat@vossi04 khiremat-2025-05-16_05:33:04-fs:workload-wip-khiremat-57953-scrub-error-2-distro-default-smithi]$ find . | grep mds.*.log.gz | xargs zgrep "raw stats most likely wont match since"

./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:21:07.694+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003836) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.333+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003593) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.368+0000 7f75c3676640 20 mds.0.cache.ino(0x10000001f57) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.587+0000 7f75c3676640 20 mds.0.cache.ino(0x10000000fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.670+0000 7f75c3676640 20 mds.0.cache.ino(0x10000004e85) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.724+0000 7f75c3676640 20 mds.0.cache.ino(0x10000001fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.835+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003836) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.230+0000 7f75c3676640 20 mds.0.cache.ino(0x10000000fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.304+0000 7f75c3676640 20 mds.0.cache.ino(0x10000004e85) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.387+0000 7f75c3676640 20 mds.0.cache.ino(0x10000001fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.440+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003836) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286697/remote/smithi177/log/8f61231c-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:15:55.053+0000 7f60ad71e640 20 mds.0.cache.ino(0x10000000006) raw stats most likely wont match since inode is dirty; please rerun scrub when system is stable; assuming passed for now;
...
...
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.928+0000 7fb630317640 20 mds.1.cache.ino(0x10000000701) raw stats most likely wont match since it's a directory inode and a local dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.936+0000 7fb630317640 20 mds.1.cache.ino(0x1000000065a) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.952+0000 7fb630317640 20 mds.1.cache.ino(0x10000000d06) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.955+0000 7fb630317640 20 mds.1.cache.ino(0x1000000064a) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.957+0000 7fb630317640 20 mds.1.cache.ino(0x10000001986) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.970+0000 7fb630317640 20 mds.1.cache.ino(0x10000000de0) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:19:15.116+0000 7fb630317640 20 mds.1.cache.ino(0x100000002d9) raw stats most likely wont match since inode is dirty; please rerun scrub when system is stable; assuming passed for now;

vshankar · 2025-06-03T07:25:10Z

@rishabh-d-dave PTAL.

kotreshhr · 2025-06-25T08:51:08Z

Rebased and addressed an issue - the scrubber was not using the dirfrag dirty flag sent by scrub msg in the ack.

vshankar

Otherwise LGTM.

src/mds/CInode.h

src/mds/CInode.cc

The in-memory and on-disk stats might not match on a directory inode if any of the local or remote dirfrag is dirty. So don't fail the scrub and mark it as passed if the local or remote dirfrag is dirty. Signed-off-by: Kotresh HR <khiremat@redhat.com> Fixes: https://tracker.ceph.com/issues/65020

kotreshhr · 2025-07-10T08:23:43Z

jenkins make test arm64

vshankar · 2025-07-10T13:07:13Z

This PR is under test in https://tracker.ceph.com/issues/72073.

Leonid is no longer working on Ceph project

kotreshhr · 2025-07-14T08:14:01Z

jenkins test make check arm64

rishabh-d-dave

Looks good.

src/mds/ScrubStack.cc

vshankar

https://tracker.ceph.com/projects/cephfs/wiki/QA_main_2025#wip-vshankar-testing-20250714111344-debug

github-actions bot added the cephfs Ceph File System label Jun 10, 2024

kotreshhr force-pushed the scrub_error branch from 7b5561e to 4d0ed50 Compare June 13, 2024 07:36

leonid-s-usov previously requested changes Jun 19, 2024

View reviewed changes

github-actions bot added the stale label Aug 18, 2024

vshankar removed the stale label Sep 17, 2024

vshankar requested a review from a team September 30, 2024 07:44

vshankar changed the title ~~mds: Mark the scrub passed if dirfrag is dirty~~ mds: mark the scrub passed if dirfrag is dirty Sep 30, 2024

vshankar assigned vshankar, batrick and rishabh-d-dave Oct 17, 2024

github-actions bot added the stale label Jan 11, 2025

vshankar removed the stale label Jan 13, 2025

github-actions bot added the needs-rebase label Feb 6, 2025

github-actions bot added the stale label Apr 7, 2025

dparmar18 reopened this May 8, 2025

dparmar18 removed the stale label May 8, 2025

kotreshhr force-pushed the scrub_error branch from 4d0ed50 to abf2800 Compare May 15, 2025 18:21

github-actions bot removed the needs-rebase label May 15, 2025

kotreshhr requested review from batrick and vshankar May 19, 2025 12:35

kotreshhr requested a review from dparmar18 June 24, 2025 10:26

kotreshhr force-pushed the scrub_error branch from abf2800 to 0d6d154 Compare June 25, 2025 08:48

vshankar approved these changes Jul 9, 2025

View reviewed changes

src/mds/CInode.h Outdated Show resolved Hide resolved

src/mds/CInode.cc Outdated Show resolved Hide resolved

kotreshhr force-pushed the scrub_error branch 2 times, most recently from ddafd5c to c665b5e Compare July 9, 2025 13:56

kotreshhr force-pushed the scrub_error branch from c665b5e to 5f5bf82 Compare July 9, 2025 14:02

vshankar added the wip-vshankar-testing4 label Jul 10, 2025

kotreshhr requested a review from leonid-s-usov July 11, 2025 03:57

rishabh-d-dave reviewed Jul 14, 2025

View reviewed changes

src/mds/ScrubStack.cc Show resolved Hide resolved

rishabh-d-dave approved these changes Jul 14, 2025

View reviewed changes

vshankar approved these changes Jul 21, 2025

View reviewed changes

vshankar merged commit e1caea2 into ceph:main Jul 21, 2025
13 checks passed

vshankar removed the wip-vshankar-testing4 label Jul 21, 2025

Conversation

kotreshhr commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

kotreshhr commented Jun 11, 2024

Uh oh!

kotreshhr commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kotreshhr commented Jun 13, 2024

Uh oh!

leonid-s-usov left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented Jun 19, 2024

Uh oh!

leonid-s-usov commented Jun 19, 2024

Uh oh!

github-actions bot commented Aug 18, 2024

Uh oh!

vshankar commented Sep 23, 2024

Uh oh!

kotreshhr commented Sep 30, 2024

Uh oh!

vshankar commented Sep 30, 2024

Uh oh!

vshankar commented Oct 28, 2024

Uh oh!

vshankar commented Nov 12, 2024

Uh oh!

rishabh-d-dave commented Nov 12, 2024

Uh oh!

github-actions bot commented Jan 11, 2025

Uh oh!

vshankar commented Jan 13, 2025

Uh oh!

github-actions bot commented Feb 6, 2025

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

vshankar commented May 13, 2025

Uh oh!

kotreshhr commented May 16, 2025

Uh oh!

kotreshhr commented May 16, 2025

Uh oh!

kotreshhr commented May 16, 2025

Uh oh!

vshankar commented Jun 3, 2025

Uh oh!

kotreshhr commented Jun 25, 2025

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kotreshhr commented Jul 10, 2025

Uh oh!

vshankar commented Jul 10, 2025

Uh oh!

kotreshhr commented Jul 14, 2025

Uh oh!

rishabh-d-dave left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

kotreshhr commented Jun 10, 2024 •

edited

Loading

kotreshhr commented Jun 13, 2024 •

edited

Loading