Skip to content

mds: nudge log for unstable locks after early reply#64229

Merged
vshankar merged 4 commits intoceph:mainfrom
batrick:i71876
Jul 16, 2025
Merged

mds: nudge log for unstable locks after early reply#64229
vshankar merged 4 commits intoceph:mainfrom
batrick:i71876

Conversation

@batrick
Copy link
Member

@batrick batrick commented Jun 27, 2025

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions
Copy link

Config Diff Tool Output

+ added: mds_allow_batched_ops (mds.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

@batrick batrick force-pushed the i71876 branch 4 times, most recently from b71510a to 9ea4a6e Compare June 27, 2025 20:01
@batrick batrick marked this pull request as ready for review June 27, 2025 20:01
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me at a read through, but I don't pretend to understand op batching and how it interplays with caps and the mds journal so take that with a large pinch of salt.

}
if (changed.count("mds_allow_batched_ops")) {
allow_batched_ops = g_conf().get_val<bool>("mds_allow_batched_ops");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlikely to actually cause trouble, but I think changing this config could technically allow ops to be reordered incorrectly? So maybe we don't make it changeable live?

Copy link
Member Author

@batrick batrick Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so. The Locker would finish contexts in order so the ops before the new "batch head" would get to lock first.

In any case, the MDS does not promise that requests will not be reordered between clients or even for the same client. (Unless I'm forgetting some case where it matters but this constraint makes no sense to me now that you ask about it.) Edit: for client-replay it matters though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more

  1. batching is turned on. Many ops are queued
  2. batching is turned off
  3. new op comes in that would have been batched

I guess at that point the new op gets put on a waitlist, which means it is queued behind the batch head? I was worried that it might be able to grab locks immediately and go first. (Which may also not violate any rules, given they're ops in flight simultaneously, but locking changes make me nervous!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more

1. batching is turned on. Many ops are queued

2. batching is turned off

3. new op comes in that would have been batched

I guess at that point the new op gets put on a waitlist, which means it is queued behind the batch head?

Yes

I was worried that it might be able to grab locks immediately and go first. (Which may also not violate any rules, given they're ops in flight simultaneously, but locking changes make me nervous!)

I do not think this should be possible. I also think if it did that it would not be incorrect.

@batrick batrick force-pushed the i71876 branch 2 times, most recently from f0cefd8 to 01280a9 Compare June 28, 2025 01:38
@batrick
Copy link
Member Author

batrick commented Jun 28, 2025

jenkins test make check arm64

@batrick batrick force-pushed the i71876 branch 2 times, most recently from 1cd7c62 to 1756fcb Compare June 28, 2025 17:35
* triggered an unstable state change. We need to nudge the log now to
* move things along.
*/
nudged = nudge_log(lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying here our discussion from private chat

Does it suffice to nudge once and only for the first nudgable lock? ISTM, the mdlog flush is that underlying requirement, yes?

Yes. There's no reason to send multiple flushes; it may only create more work for the flush thread.

Morevoer, in the past, such things have been addressed by nudging the lock during the actual state transition for a lock. E.g.: #45243

getattr does nudge the lock but the code wrongly assumes that whatever request holds the lock has already early replied.

@batrick
Copy link
Member Author

batrick commented Jun 30, 2025

jenkins test api

@vshankar
Copy link
Contributor

Change looks fine to me. Haven't seen the batch ops disabling change, but @gregsfortytwo has already looked at those.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. @batrick Could we additionally document when do users disabled batching.

@batrick
Copy link
Member Author

batrick commented Jun 30, 2025

Mostly looks good. @batrick Could we additionally document when do users disabled batching.

I view the config as an emergency debugging measure. We don't usually document those, no?

@vshankar
Copy link
Contributor

Mostly looks good. @batrick Could we additionally document when do users disabled batching.

I view the config as an emergency debugging measure. We don't usually document those, no?

Someone is going to fiddle with it someday and ask why it's there or complains about perf drops. We should at least mention why its there even if its a simple one liner.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/71910.

@batrick batrick requested a review from a team as a code owner July 1, 2025 01:28
@batrick
Copy link
Member Author

batrick commented Jul 1, 2025

@vshankar is the doc commit I just added address your concern?

@vshankar
Copy link
Contributor

vshankar commented Jul 1, 2025

@vshankar is the doc commit I just added address your concern?

LGTM. If we can add a bit more description in the mds config yaml about this, that would get referenced in the docs and that would be good enough. I have build this change in my test branch, so, just conf changes are good enough to be able to merge this if nothing shows up in test (which I'm pretty much sure will be fine).

vshankar added a commit to vshankar/ceph that referenced this pull request Jul 1, 2025
* refs/pull/64229/head:
	mds: nudge log for unstable locks after early_reply
	mds: allow disabling batch ops
@batrick
Copy link
Member Author

batrick commented Jul 1, 2025

@vshankar is the doc commit I just added address your concern?

LGTM. If we can add a bit more description in the mds config yaml about this, that would get referenced in the docs and that would be good enough. I have build this change in my test branch, so, just conf changes are good enough to be able to merge this if nothing shows up in test (which I'm pretty much sure will be fine).

description updated

@batrick batrick force-pushed the i71876 branch 2 times, most recently from b78cee7 to 9bb3cd8 Compare July 1, 2025 19:33
batrick added 4 commits July 1, 2025 15:41
If there are trailing newlines in a string (like long_desc), then the generated
C++ code is invalid (because the newline will not be escaped).

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
To address a bug and future ones where batching lookup/getattr does not help
"kick" the MDS in switching state more quickly (e.g. flushing the MDS journal).

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
A getattr/lookup can cause a wrlock or xlock to become unstable after a request
(like rename) acquires it but before early reply. The MDS will not nudge the
log in this situation and the getattr/lookup will need to wait for the eventual
journal flush before the lock is released.

Now looks like:

    2025-06-27T19:41:32.043+0000 7f11d21a9640  5 mds.0.log _submit_thread 25185408~2845 : EUpdate rename [metablob 0x1, 3 dirs]
    2025-06-27T19:41:32.043+0000 7f11d99b8640  1 -- [v2:172.21.10.4:6868/56297870,v1:172.21.10.4:6869/56297870] --> [v2:172.21.10.4:6818/999439823,v1:172.21.10.4:6819/999439823] -- osd_op(unknown.0.23:135 2.e 2:7bf7e7b5:::200.00000006:head [write 19584~2865 [fadvise_dontneed] in=2865b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e72) -- 0x563e9b730000 con 0x563e9a674800
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) on [dentry #0x1/a/file [2,head] auth (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14558 v=14556 ap=2 ino=0x10000000002 remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 0x563e95629900]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) on [dentry #0x1/b/file [2,head] auth NULL (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14557 v=14555 ap=2 ino=(nil) remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 0x563e95629b80]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dversion lock w=1 last_client=4393) on [dentry #0x1/a/file [2,head] auth (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14558 v=14556 ap=2 ino=0x10000000002 remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 0x563e95629900]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dversion lock w=1 last_client=4393) on [dentry #0x1/b/file [2,head] auth NULL (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14557 v=14555 ap=2 ino=(nil) remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 0x563e95629b80]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (ifile excl w=1) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (inest lock w=1 dirty) on [inode 0x1 [...2,head] / auth v179 snaprealm=0x563e9a6ce6c0 f(v0 m2025-06-27T19:09:29.187695+0000 2=0+2) n(v45 rc2025-06-27T19:41:31.249722+0000 3=0+3)/n(v0 rc2025-06-27T19:08:43.024940+0000 1=0+1) (isnap sync r=2) (inest lock w=1 dirty) caps={4359=pAsLsXsFs/-@83,4393=pAsLsXsFs/-@1245} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 openingsnapparents=0 dirty=1 authpin=0 0x563e9a6f4b00]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (inest lock w=1) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (iquiesce lock w=1 last_client=4393) on [inode 0x10000000000 [...2,head] /a/ auth v29161 pv29163 ap=3 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000 1=1+0) n(v0 rc2025-06-27T19:41:32.017971+0000 2=1+1) (isnap sync r=2) (inest lock w=1) (ifile lock->sync w=1) (iversion lock w=1 last_client=4393) (iquiesce lock w=1 last_client=4393) caps={4359=pAsLsXs/-@109,4393=pAsLsXs/-@955} | request=1 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563e9a6f5600]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (isnap sync r=1) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (isnap sync r=2) on [inode 0x1 [...2,head] / auth v179 snaprealm=0x563e9a6ce6c0 f(v0 m2025-06-27T19:09:29.187695+0000 2=0+2) n(v45 rc2025-06-27T19:41:31.249722+0000 3=0+3)/n(v0 rc2025-06-27T19:08:43.024940+0000 1=0+1) (isnap sync r=2) (inest lock w=1 dirty) caps={4359=pAsLsXsFs/-@83,4393=pAsLsXsFs/-@1245} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 openingsnapparents=0 dirty=1 authpin=0 0x563e9a6f4b00]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (iversion lock w=1 last_client=4393) on [inode 0x10000000000 [...2,head] /a/ auth v29161 pv29163 ap=3 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000 1=1+0) n(v0 rc2025-06-27T19:41:32.017971+0000 2=1+1) (isnap sync r=2) (inest lock w=1) (ifile lock->sync w=1) (iversion lock w=1 last_client=4393) (iquiesce lock w=1 last_client=4393) caps={4359=pAsLsXs/-@109,4393=pAsLsXs/-@955} | request=1 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563e9a6f5600]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (iversion lock w=1 last_client=4393) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log YES (ifile lock->sync w=1) on [inode 0x10000000000 [...2,head] /a/ auth v29161 pv29163 ap=3 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000 1=1+0) n(v0 rc2025-06-27T19:41:32.017971+0000 2=1+1) (isnap sync r=2) (inest lock w=1) (ifile lock->sync w=1) (iversion lock w=1 last_client=4393) (iquiesce lock w=1 last_client=4393) caps={4359=pAsLsXs/-@109,4393=pAsLsXs/-@955} | request=1 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563e9a6f5600]
    2025-06-27T19:41:32.043+0000 7f11d99b8640 20 mds.0.locker : request(client.4393:20702 nref=5 cr=0x563e9b83ae00)
    2025-06-27T19:41:32.043+0000 7f11d99b8640 20 mds.0.log _submit_entry EUpdate rename [metablob 0x1, 3 dirs]

Easily reproducible with two ceph-fuse clients. One session doing:

    $ mkdir a b
    $ touch a/file
    $ while true; do mv -v a/file b/file; mv -v b/file a/file; done

and the other

    $ while true; do stat a/file; done

You can observe the rename/stat stalls (up to 5 seconds; MDS tick interval) without this patch.

Fixes: https://tracker.ceph.com/issues/71876
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
@vshankar
Copy link
Contributor

vshankar commented Jul 7, 2025

This is mostly ready to be merged. (waiting on run wiki update)

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants