Skip to content

mds: mark the scrub passed if dirfrag is dirty#57953

Merged
vshankar merged 1 commit intoceph:mainfrom
kotreshhr:scrub_error
Jul 21, 2025
Merged

mds: mark the scrub passed if dirfrag is dirty#57953
vshankar merged 1 commit intoceph:mainfrom
kotreshhr:scrub_error

Conversation

@kotreshhr
Copy link
Contributor

@kotreshhr kotreshhr commented Jun 10, 2024

The in-memory and on-disk stats might not match on a directory inode if any of the dirfrag is dirty.
So don't fail the scrub and mark it as passed if
the dirfrag is dirty.

Fixes: https://tracker.ceph.com/issues/65020

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Jun 10, 2024
@kotreshhr
Copy link
Contributor Author

Scheduled a QA run with fs:workload multi mds with dynamic balancer or ephemeral random pinning enabled.

https://pulpito.ceph.com/khiremat-2024-06-11_07:27:30-fs:workload-wip-khiremat-57953-scrub-error-0-distro-default-smithi/

@kotreshhr
Copy link
Contributor Author

kotreshhr commented Jun 13, 2024

I still see the error after refreshing the PR with the fix for remote_dirfrag dirtiness. I will investigate further and update the PR
https://pulpito.ceph.com/khiremat-2024-06-13_08:02:02-fs:workload-wip-khiremat-57953-scrub-error-1-distro-default-smithi/

@kotreshhr
Copy link
Contributor Author

I still see the error after refreshing the PR with the fix for remote_dirfrag dirtiness. I will investigate further and update the PR https://pulpito.ceph.com/khiremat-2024-06-13_08:02:02-fs:workload-wip-khiremat-57953-scrub-error-1-distro-default-smithi/

@vshankar @leonid-s-usov This particular scrub failure is with the fix. I see that both directory inode and corresponding dirfrag is not dirty and still the rstat is mismatched. The test yml config is as below. So something else is going on. Need to figure out.

fs:workload/{0-centos_9.stream begin/{0-install 1-cephadm 2-logrotate 3-modules} clusters/1a11s-mds-1c-client-3node conf/{client mds mgr mon osd} mount/kclient/{base/{mount-syntax/{v1} mount overrides/{distro/stock/{centos_9.stream k-stock} ms-die-on-skipped}} ms_mode/crc wsync/yes} objectstore-ec/bluestore-comp-ec-root omap_limit/10000 overrides/{cephsqlite-timeout frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts pg_health session_timeout} ranks/multi/{balancer/random export-check n/3 replication/default} standby-replay tasks/{0-subvolume/{with-namespace-isolated} 1-check-counter 2-scrub/yes 3-snaps/yes 4-flush/yes 5-quiesce/with-quiesce 6-workunit/fs/misc}}

The scrub error is on

"2024-06-13T09:00:29.607433+0000 mds.e (mds.2) 711 : cluster [WRN] Scrub error on inode 0x10000012f02 (/volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17) see mds.e log and `damage ls` output for details" in cluster log

Related mds.e.log (inode auth of the directory) . We can find that the inode is not dirty.

...
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.scrubstack handle_scrub mds_scrub(queue_dir_ack 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8) from mds.0
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs: state=RUNNING
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs entering with 4 in progress and 7992 in the stack
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs examining [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.scrubstack scrub_dir_inode [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack scrub_dir_inode recursive mode, frags [*]
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack scrub_dir_inode_final [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) scrub starting validate_disk_state on [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=0 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100]
....
....
2024-06-13T09:00:29.602+0000 7f4b55282640 10 mds.2.scrubstack scrub_dir_inode done
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack kick_off_scrubs dir inode, done
2024-06-13T09:00:29.602+0000 7f4b55282640 20 mds.2.scrubstack dequeue [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=1 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=1 discoverbase=0 scrubqueue=1 randepin 0x5564d8188100] from ScrubStack
...
...
2024-06-13T09:00:29.604+0000 7f4b4ea75640 20 mds.2.cache.ino(0x10000012f02) ondisk_read_retval: 0
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache.ino(0x10000012f02) decoded 399 bytes of backtrace successfully
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache.ino(0x10000012f02) scrub: inotable ino = 0x10000012f02
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache.ino(0x10000012f02) scrub: inotable free says 0
2024-06-13T09:00:29.604+0000 7f4b4ea75640  7 mds.2.cache request_start_internal request(mds.2:94019 nref=2) op 5383
...
...
2024-06-13T09:00:29.604+0000 7f4b4ea75640 10 mds.2.cache rdlock_dirfrags_stats_work [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=2 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=1 discoverbase=0 scrubqueue=0 randepin 0x5564d8188100]
....
....
2024-06-13T09:00:29.606+0000 7f4b55282640  7 mds.2.locker handle_file_lock a=syncack on (inest mix->sync g=0) from mds.0 [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=3 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest mix->sync g=0) (ifile sync r=1) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=1 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=1 authpin=1 discoverbase=0 scrubqueue=0 randepin 0x5564d8188100]
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * [2,head]
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * rstat n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0)
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * accounted_rstat n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0)
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) decode_lock_inest * dirty_old_rstat {}
2024-06-13T09:00:29.606+0000 7f4b55282640 10 mds.2.cache.ino(0x10000012f02) * first 2 -> 2 on [dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] rep@0.1 dir_auth=0 state=0 f(v0 m2024-06-13T08:55:10.381452+0000 8=8+0)/f(v0 m2024-06-13T08:55:04.955560+0000 3=3+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=0+0,ss=0+0 | subtree=1 importbound=0 sticky=0 0x5564d789e900]
....
....
2024-06-13T09:00:29.606+0000 7f4b55282640  0 log_channel(cluster) log [WRN] : Scrub error on inode 0x10000012f02 (/volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/
17) see mds.e log and `damage ls` output for details
2024-06-13T09:00:29.606+0000 7f4b55282640 -1 mds.2.scrubstack _validate_inode_done scrub error on inode [inode 0x10000012f02 [...2a,head] /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ auth{0=1} v952 ap=2 RANDEPHEMERALPIN f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1) (inest sync->mix g=0) mcw={0=AsLsXsFs} | dirtyscattered=0 request=0 lock=0 importingcaps=0 dirfrag=1 caps=0 stickydirs=0 dirtyparent=0 dirwaiter=0 replicated=1 dirty=0 waiter=0 authpin=1 discoverbase=0 scrubqueue=0 randepin 0x5564d8188100]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value":"(2)0x10000012f02:[<0x10000012ec8/17 v952>,<0x10000012ec6/.build-id v6429>,<0x10000012ec5/multiple_rsync_payload.228227 v748>,<0x1000000c579/payload.2 v714>,<0x10000000005/tmp v1831>,<0x100000001fb/client.0 v1617>,<0x100000001fa/257f5e61-1d1a-4ee3-8dfe-78182c700535 v1572>,<0x10000000001/sv_1 v1540>,<0x10000000000/qa v1503>,<0x1/volumes v1450>]//[]","memoryvalue":"(2)0x10000012f02:[<0x10000012ec8/17 v952>,<0x10000012ec6/.build-id v6429>,<0x10000012ec5/multiple_rsync_payload.228227 v849>,<0x1000000c579/payload.2 v736>,<0x10000000005/tmp v1873>,<0x100000001fb/client.0 v1655>,<0x100000001fa/257f5e61-1d1a-4ee3-8dfe-78182c700535 v1612>,<0x10000000001/sv_1 v1576>,<0x10000000000/qa v1539>,<0x1/volumes v1486>]//[]","error_str":""},"raw_stats":{"checked":true,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 m2024-06-13T08:55:04.955560+0000 3=3+0)","ondisk_value.rstat":"n(v0 rc2024-06-13T08:55:10.383452+0000 b347 9=8+1)","memory_value.dirstat":"f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0)","memory_value.rstat":"n(v1 rc2024-06-13T08:55:16.501331+0000 b347 9=8+1)","error_str":"freshly-calculated rstats don't match existing ones"},"return_code":0}
2024-06-13T09:00:29.606+0000 7f4b55282640 20 mds.2.cache.ino(0x10000012f02) scrub_finished
.....
.....

Related mds.b.log (dirfrag auth of the directory). We can find that the dirfrag is not dirty from below logs. So remote_dirfrag_dirty flag is not set. And still the test failed.

2024-06-13T09:00:29.601+0000 7f2aa23d8640 10 mds.0.scrubstack handle_scrub mds_scrub(queue_dir 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8 force recursive) from mds.2
2024-06-13T09:00:29.601+0000 7f2aa23d8640 10 mds.0.scrubstack _enqueue with {[dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] auth{2=1} v=35 cv=35/35 dir_auth=0 state=1074266113|complete|auxsubtree f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=8+0,ss=0+0 | child=1 subtree=1 exportbound=0 replicated=1 dirty=0 authpin=0 scrubqueue=0 0x55884cd10d00]}, top=1
2024-06-13T09:00:29.601+0000 7f2aa23d8640 10 mds.0.cache.dir(0x10000012f02) auth_pin by 0x55883619f600 on [dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] auth{2=1} v=35 cv=35/35 dir_auth=0 ap=1+0 state=1074266113|complete|auxsubtree f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=8+0,ss=0+0 | child=1 subtree=1 exportbound=0 replicated=1 dirty=0 authpin=1 scrubqueue=0 0x55884cd10d00] count now 1
2024-06-13T09:00:29.601+0000 7f2aa23d8640 20 mds.0.scrubstack enqueue [dir 0x10000012f02 /volumes/qa/sv_1/257f5e61-1d1a-4ee3-8dfe-78182c700535/client.0/tmp/payload.2/multiple_rsync_payload.228227/.build-id/17/ [2,head] auth{2=1} v=35 cv=35/35 dir_auth=0 ap=1+0 state=1074266113|complete|auxsubtree f(v1 m2024-06-13T08:55:10.381452+0000 8=8+0) n(v1 rc2024-06-13T08:55:10.383452+0000 b347 8=8+0) hs=8+0,ss=0+0 | child=1 subtree=1 exportbound=0 replicated=1 dirty=0 authpin=1 scrubqueue=0 0x55884cd10d00] to top of ScrubStack
2024-06-13T09:00:29.601+0000 7f2aa23d8640 20 mds.0.scrubstack kick_off_scrubs: state=RUNNING
2024-06-13T09:00:29.601+0000 7f2aa23d8640 20 mds.0.scrubstack kick_off_scrubs entering with 5 in progress and 8306 in the stack
2024-06-13T09:00:29.601+0000 7f2aa23d8640  1 -- [v2:172.21.15.123:6838/2791669125,v1:172.21.15.123:6839/2791669125] send_to--> mds [v2:172.21.15.123:6834/2496736045,v1:172.21.15.123:6835/2496736045] -- mds_scrub(queue_dir_ack 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8) -- ?+0 0x558844fcae00
2024-06-13T09:00:29.601+0000 7f2aa23d8640  1 -- [v2:172.21.15.123:6838/2791669125,v1:172.21.15.123:6839/2791669125] --> [v2:172.21.15.123:6834/2496736045,v1:172.21.15.123:6835/2496736045] -- mds_scrub(queue_dir_ack 0x10000012f02 fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8) -- 0x558844fcae00 con 0x558836451c00
2024-06-13T09:00:29.601+0000 7f2aa23d8640  1 -- [v2:172.21.15.123:6838/2791669125,v1:172.21.15.123:6839/2791669125] <== mds.1 v2:172.21.15.123:6836/3089364143 11418 ==== mds_scrub(queue_dir 0x2000000052c fragset_t(*) d0e8b015-afbe-4362-93f9-ce15cda58bb8 force recursive) ==== 72+0+0 (crc 0 0 0) 0x55883c410800 con 0x558836451800

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid approach, but there should be a way to avoid adding a new argument to so many methods. Please consider using the CInode::validated_data structure to store and access the information about the remote scrub ack.

@vshankar
Copy link
Contributor

This is a valid approach, but there should be a way to avoid adding a new argument to so many methods. Please consider using the CInode::validated_data structure to store and access the information about the remote scrub ack.

@kotreshhr mentioned that this change doesn't fix the issue. Some explanation in https://tracker.ceph.com/issues/65020#note-34

@leonid-s-usov
Copy link
Contributor

This is a valid approach, but there should be a way to avoid adding a new argument to so many methods. Please consider using the CInode::validated_data structure to store and access the information about the remote scrub ack.

@kotreshhr mentioned that this change doesn't fix the issue. Some explanation in https://tracker.ceph.com/issues/65020#note-34

I understand this differently - there is another issue that this change isn't supposed to fix. Kotresh states that in that new issue both the inode and the dirfrag are clean, so it's out of the scope of what the original problem was about. The fix will be different, too.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 18, 2024
@vshankar vshankar removed the stale label Sep 17, 2024
@vshankar
Copy link
Contributor

@kotreshhr I remember this change fixing one part of the issue. Did we RCA the other part (incorrect rfiles)?

@kotreshhr
Copy link
Contributor Author

@kotreshhr I remember this change fixing one part of the issue. Did we RCA the other part (incorrect rfiles)?

There are two separate issues. I think we can take this and work on the other separately.

@vshankar
Copy link
Contributor

@kotreshhr I remember this change fixing one part of the issue. Did we RCA the other part (incorrect rfiles)?

There are two separate issues. I think we can take this and work on the other separately.

ACK.

@vshankar vshankar requested a review from a team September 30, 2024 07:44
@vshankar vshankar changed the title mds: Mark the scrub passed if dirfrag is dirty mds: mark the scrub passed if dirfrag is dirty Sep 30, 2024
@vshankar
Copy link
Contributor

@kotreshhr I think this PR needs to be updated as per comment https://tracker.ceph.com/issues/65020#note-36, yes?

@vshankar
Copy link
Contributor

@rishabh-d-dave PTAL once @kotreshhr pushes an update.

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave PTAL once @kotreshhr pushes an update.

Sure.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 11, 2025
@vshankar vshankar removed the stale label Jan 13, 2025
@vshankar
Copy link
Contributor

@kotreshhr I think this PR needs to be updated as per comment https://tracker.ceph.com/issues/65020#note-36, yes?

@kotreshhr Would it be possible to push an update this week?

@github-actions
Copy link

github-actions bot commented Feb 6, 2025

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented Apr 7, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Apr 7, 2025
@dparmar18 dparmar18 reopened this May 8, 2025
@dparmar18 dparmar18 removed the stale label May 8, 2025
@vshankar
Copy link
Contributor

@kotreshhr please rebase.

@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

@kotreshhr
Copy link
Contributor Author

@kotreshhr please rebase.

@vshankar rebased and simplified the commit abf2800

Sample run with fs:workoads with scrub/yes - https://pulpito.ceph.com/khiremat-2025-05-16_05:33:04-fs:workload-wip-khiremat-57953-scrub-error-2-distro-default-smithi/

@kotreshhr
Copy link
Contributor Author

@kotreshhr please rebase.

@vshankar rebased and simplified the commit abf2800

Sample run with fs:workoads with scrub/yes - https://pulpito.ceph.com/khiremat-2025-05-16_05:33:04-fs:workload-wip-khiremat-57953-scrub-error-2-distro-default-smithi/

Grepping the teuthology logs did show the code is exercised.

[khiremat@vossi04 khiremat-2025-05-16_05:33:04-fs:workload-wip-khiremat-57953-scrub-error-2-distro-default-smithi]$ find . | grep mds.*.log.gz | xargs zgrep "raw stats most likely wont match since"

./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:21:07.694+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003836) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.333+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003593) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.368+0000 7f75c3676640 20 mds.0.cache.ino(0x10000001f57) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.587+0000 7f75c3676640 20 mds.0.cache.ino(0x10000000fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.670+0000 7f75c3676640 20 mds.0.cache.ino(0x10000004e85) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.724+0000 7f75c3676640 20 mds.0.cache.ino(0x10000001fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:22:34.835+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003836) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.230+0000 7f75c3676640 20 mds.0.cache.ino(0x10000000fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.304+0000 7f75c3676640 20 mds.0.cache.ino(0x10000004e85) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.387+0000 7f75c3676640 20 mds.0.cache.ino(0x10000001fd7) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286703/remote/smithi063/log/94b78f04-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:27:02.440+0000 7f75c3676640 20 mds.0.cache.ino(0x10000003836) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286697/remote/smithi177/log/8f61231c-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:15:55.053+0000 7f60ad71e640 20 mds.0.cache.ino(0x10000000006) raw stats most likely wont match since inode is dirty; please rerun scrub when system is stable; assuming passed for now;
...
...
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.928+0000 7fb630317640 20 mds.1.cache.ino(0x10000000701) raw stats most likely wont match since it's a directory inode and a local dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.936+0000 7fb630317640 20 mds.1.cache.ino(0x1000000065a) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.952+0000 7fb630317640 20 mds.1.cache.ino(0x10000000d06) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.955+0000 7fb630317640 20 mds.1.cache.ino(0x1000000064a) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.957+0000 7fb630317640 20 mds.1.cache.ino(0x10000001986) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:18:27.970+0000 7fb630317640 20 mds.1.cache.ino(0x10000000de0) raw stats most likely wont match since it's a directory inode and a remote dirfrag is dirty; please rerun scrub when system is stable; assuming passed for now;
./8286698/remote/smithi055/log/960251dc-321a-11f0-86fb-adfe0268badd/ceph-mds.b.log.gz:2025-05-16T06:19:15.116+0000 7fb630317640 20 mds.1.cache.ino(0x100000002d9) raw stats most likely wont match since inode is dirty; please rerun scrub when system is stable; assuming passed for now;


@kotreshhr kotreshhr requested review from batrick and vshankar May 19, 2025 12:35
@vshankar
Copy link
Contributor

vshankar commented Jun 3, 2025

@rishabh-d-dave PTAL.

@kotreshhr
Copy link
Contributor Author

Rebased and addressed an issue - the scrubber was not using the dirfrag dirty flag sent by scrub msg in the ack.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@kotreshhr kotreshhr force-pushed the scrub_error branch 2 times, most recently from ddafd5c to c665b5e Compare July 9, 2025 13:56
The in-memory and on-disk stats might not match on a
directory inode if any of the local or remote dirfrag
is dirty.  So don't fail the scrub and mark it as
passed if the local or remote dirfrag is dirty.

Signed-off-by: Kotresh HR <khiremat@redhat.com>
Fixes: https://tracker.ceph.com/issues/65020
@kotreshhr
Copy link
Contributor Author

jenkins make test arm64

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/72073.

@kotreshhr kotreshhr requested a review from leonid-s-usov July 11, 2025 03:57
@kotreshhr kotreshhr dismissed leonid-s-usov’s stale review July 14, 2025 08:12

Leonid is no longer working on Ceph project

@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

Copy link
Contributor

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit e1caea2 into ceph:main Jul 21, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants