mds: nudge log for unstable locks after early reply by batrick · Pull Request #64229 · ceph/ceph

batrick · 2025-06-27T18:57:53Z

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

github-actions · 2025-06-27T18:58:24Z

Config Diff Tool Output

+ added: mds_allow_batched_ops (mds.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

gregsfortytwo

This looks okay to me at a read through, but I don't pretend to understand op batching and how it interplays with caps and the mds journal so take that with a large pinch of salt.

gregsfortytwo · 2025-06-27T20:13:21Z

src/mds/Server.cc

  }
+  if (changed.count("mds_allow_batched_ops")) {
+    allow_batched_ops = g_conf().get_val<bool>("mds_allow_batched_ops");
+  }


It's unlikely to actually cause trouble, but I think changing this config could technically allow ops to be reordered incorrectly? So maybe we don't make it changeable live?

I do not think so. The Locker would finish contexts in order so the ops before the new "batch head" would get to lock first.

In any case, the MDS does not promise that requests will not be reordered between clients or even for the same client. (Unless I'm forgetting some case where it matters but this constraint makes no sense to me now that you ask about it.) Edit: for client-replay it matters though.

I was thinking more

batching is turned on. Many ops are queued

batching is turned off

new op comes in that would have been batched

I guess at that point the new op gets put on a waitlist, which means it is queued behind the batch head? I was worried that it might be able to grab locks immediately and go first. (Which may also not violate any rules, given they're ops in flight simultaneously, but locking changes make me nervous!)

I was thinking more

1. batching is turned on. Many ops are queued 2. batching is turned off 3. new op comes in that would have been batched

I guess at that point the new op gets put on a waitlist, which means it is queued behind the batch head?

Yes

I was worried that it might be able to grab locks immediately and go first. (Which may also not violate any rules, given they're ops in flight simultaneously, but locking changes make me nervous!)

I do not think this should be possible. I also think if it did that it would not be incorrect.

src/mds/Locker.cc

batrick · 2025-06-28T17:29:57Z

jenkins test make check arm64

vshankar · 2025-06-30T13:52:01Z

src/mds/Locker.cc

+       * triggered an unstable state change. We need to nudge the log now to
+       * move things along.
+       */
+      nudged = nudge_log(lock);


Copying here our discussion from private chat

Does it suffice to nudge once and only for the first nudgable lock? ISTM, the mdlog flush is that underlying requirement, yes?

Yes. There's no reason to send multiple flushes; it may only create more work for the flush thread.

Morevoer, in the past, such things have been addressed by nudging the lock during the actual state transition for a lock. E.g.: #45243

getattr does nudge the lock but the code wrongly assumes that whatever request holds the lock has already early replied.

batrick · 2025-06-30T13:53:53Z

jenkins test api

vshankar · 2025-06-30T13:56:33Z

Change looks fine to me. Haven't seen the batch ops disabling change, but @gregsfortytwo has already looked at those.

vshankar

Mostly looks good. @batrick Could we additionally document when do users disabled batching.

batrick · 2025-06-30T15:56:13Z

Mostly looks good. @batrick Could we additionally document when do users disabled batching.

I view the config as an emergency debugging measure. We don't usually document those, no?

vshankar · 2025-06-30T16:00:44Z

Mostly looks good. @batrick Could we additionally document when do users disabled batching.

I view the config as an emergency debugging measure. We don't usually document those, no?

Someone is going to fiddle with it someday and ask why it's there or complains about perf drops. We should at least mention why its there even if its a simple one liner.

vshankar · 2025-06-30T16:53:31Z

This PR is under test in https://tracker.ceph.com/issues/71910.

batrick · 2025-07-01T01:28:54Z

@vshankar is the doc commit I just added address your concern?

vshankar · 2025-07-01T01:45:54Z

@vshankar is the doc commit I just added address your concern?

LGTM. If we can add a bit more description in the mds config yaml about this, that would get referenced in the docs and that would be good enough. I have build this change in my test branch, so, just conf changes are good enough to be able to merge this if nothing shows up in test (which I'm pretty much sure will be fine).

* refs/pull/64229/head: mds: nudge log for unstable locks after early_reply mds: allow disabling batch ops

batrick · 2025-07-01T13:57:03Z

@vshankar is the doc commit I just added address your concern?

LGTM. If we can add a bit more description in the mds config yaml about this, that would get referenced in the docs and that would be good enough. I have build this change in my test branch, so, just conf changes are good enough to be able to merge this if nothing shows up in test (which I'm pretty much sure will be fine).

description updated

If there are trailing newlines in a string (like long_desc), then the generated C++ code is invalid (because the newline will not be escaped). Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

To address a bug and future ones where batching lookup/getattr does not help "kick" the MDS in switching state more quickly (e.g. flushing the MDS journal). Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

@83

A getattr/lookup can cause a wrlock or xlock to become unstable after a request (like rename) acquires it but before early reply. The MDS will not nudge the log in this situation and the getattr/lookup will need to wait for the eventual journal flush before the lock is released. Now looks like: 2025-06-27T19:41:32.043+0000 7f11d21a9640 5 mds.0.log _submit_thread 25185408~2845 : EUpdate rename [metablob 0x1, 3 dirs] 2025-06-27T19:41:32.043+0000 7f11d99b8640 1 -- [v2:172.21.10.4:6868/56297870,v1:172.21.10.4:6869/56297870] --> [v2:172.21.10.4:6818/999439823,v1:172.21.10.4:6819/999439823] -- osd_op(unknown.0.23:135 2.e 2:7bf7e7b5:::200.00000006:head [write 19584~2865 [fadvise_dontneed] in=2865b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e72) -- 0x563e9b730000 con 0x563e9a674800 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) on [dentry #0x1/a/file [2,head] auth (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14558 v=14556 ap=2 ino=0x10000000002 remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 0x563e95629900] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) on [dentry #0x1/b/file [2,head] auth NULL (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14557 v=14555 ap=2 ino=(nil) remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 0x563e95629b80] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dversion lock w=1 last_client=4393) on [dentry #0x1/a/file [2,head] auth (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14558 v=14556 ap=2 ino=0x10000000002 remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 0x563e95629900] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (dversion lock w=1 last_client=4393) on [dentry #0x1/b/file [2,head] auth NULL (dn xlock x=1 by request(client.4393:20702 nref=6 cr=0x563e9b83ae00)) (dversion lock w=1 last_client=4393) pv=14557 v=14555 ap=2 ino=(nil) remote_ino=0x0 referent_inode_ptr=(nil) referent_ino=0x0 state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 0x563e95629b80] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (ifile excl w=1) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (inest lock w=1 dirty) on [inode 0x1 [...2,head] / auth v179 snaprealm=0x563e9a6ce6c0 f(v0 m2025-06-27T19:09:29.187695+0000 2=0+2) n(v45 rc2025-06-27T19:41:31.249722+0000 3=0+3)/n(v0 rc2025-06-27T19:08:43.024940+0000 1=0+1) (isnap sync r=2) (inest lock w=1 dirty) caps={4359=pAsLsXsFs/-@83,4393=pAsLsXsFs/-@1245} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 openingsnapparents=0 dirty=1 authpin=0 0x563e9a6f4b00] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (inest lock w=1) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (iquiesce lock w=1 last_client=4393) on [inode 0x10000000000 [...2,head] /a/ auth v29161 pv29163 ap=3 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000 1=1+0) n(v0 rc2025-06-27T19:41:32.017971+0000 2=1+1) (isnap sync r=2) (inest lock w=1) (ifile lock->sync w=1) (iversion lock w=1 last_client=4393) (iquiesce lock w=1 last_client=4393) caps={4359=pAsLsXs/-@109,4393=pAsLsXs/-@955} | request=1 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563e9a6f5600] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (isnap sync r=1) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (isnap sync r=2) on [inode 0x1 [...2,head] / auth v179 snaprealm=0x563e9a6ce6c0 f(v0 m2025-06-27T19:09:29.187695+0000 2=0+2) n(v45 rc2025-06-27T19:41:31.249722+0000 3=0+3)/n(v0 rc2025-06-27T19:08:43.024940+0000 1=0+1) (isnap sync r=2) (inest lock w=1 dirty) caps={4359=pAsLsXsFs/-@83,4393=pAsLsXsFs/-@1245} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 openingsnapparents=0 dirty=1 authpin=0 0x563e9a6f4b00] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (iversion lock w=1 last_client=4393) on [inode 0x10000000000 [...2,head] /a/ auth v29161 pv29163 ap=3 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000 1=1+0) n(v0 rc2025-06-27T19:41:32.017971+0000 2=1+1) (isnap sync r=2) (inest lock w=1) (ifile lock->sync w=1) (iversion lock w=1 last_client=4393) (iquiesce lock w=1 last_client=4393) caps={4359=pAsLsXs/-@109,4393=pAsLsXs/-@955} | request=1 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563e9a6f5600] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log NO (iversion lock w=1 last_client=4393) on [inode 0x10000000001 [...2,head] /b/ auth v29159 pv29165 ap=1 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000) n(v0 rc2025-06-27T19:41:32.017971+0000 1=0+1) (isnap sync r=1) (inest lock w=1) (ifile excl w=1) (iversion lock w=1 last_client=4393) caps={4393=pAsLsXsFsx/-@955},l=4393 | request=0 lock=4 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x563e9a6f5080] 2025-06-27T19:41:32.043+0000 7f11d99b8640 10 mds.0.locker nudge_log YES (ifile lock->sync w=1) on [inode 0x10000000000 [...2,head] /a/ auth v29161 pv29163 ap=3 DIRTYPARENT f(v0 m2025-06-27T19:41:32.017971+0000 1=1+0) n(v0 rc2025-06-27T19:41:32.017971+0000 2=1+1) (isnap sync r=2) (inest lock w=1) (ifile lock->sync w=1) (iversion lock w=1 last_client=4393) (iquiesce lock w=1 last_client=4393) caps={4359=pAsLsXs/-@109,4393=pAsLsXs/-@955} | request=1 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563e9a6f5600] 2025-06-27T19:41:32.043+0000 7f11d99b8640 20 mds.0.locker : request(client.4393:20702 nref=5 cr=0x563e9b83ae00) 2025-06-27T19:41:32.043+0000 7f11d99b8640 20 mds.0.log _submit_entry EUpdate rename [metablob 0x1, 3 dirs] Easily reproducible with two ceph-fuse clients. One session doing: $ mkdir a b $ touch a/file $ while true; do mv -v a/file b/file; mv -v b/file a/file; done and the other $ while true; do stat a/file; done You can observe the rename/stat stalls (up to 5 seconds; MDS tick interval) without this patch. Fixes: https://tracker.ceph.com/issues/71876 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

vshankar · 2025-07-07T12:09:04Z

This is mostly ready to be merged. (waiting on run wiki update)

vshankar

https://tracker.ceph.com/projects/cephfs/wiki/QA_main_2025#wip-vshankar-testing-20250630165309-debug

github-actions bot added cephfs Ceph File System common config-change labels Jun 27, 2025

batrick force-pushed the i71876 branch from d4d282d to a2fe3d3 Compare June 27, 2025 18:58

batrick force-pushed the i71876 branch 4 times, most recently from b71510a to 9ea4a6e Compare June 27, 2025 20:01

batrick marked this pull request as ready for review June 27, 2025 20:01

batrick added the needs-review label Jun 27, 2025

batrick force-pushed the i71876 branch from 9ea4a6e to d88c28d Compare June 27, 2025 20:02

gregsfortytwo reviewed Jun 27, 2025

View reviewed changes

batrick force-pushed the i71876 branch 2 times, most recently from f0cefd8 to 01280a9 Compare June 28, 2025 01:38

batrick added the performance label Jun 28, 2025

batrick force-pushed the i71876 branch 2 times, most recently from 1cd7c62 to 1756fcb Compare June 28, 2025 17:35

vshankar reviewed Jun 30, 2025

View reviewed changes

vshankar added wip-vshankar-testing wip-vshankar-testing1 wip-vshankar-testing2 and removed wip-vshankar-testing wip-vshankar-testing1 labels Jun 30, 2025

batrick requested a review from a team as a code owner July 1, 2025 01:28

github-actions bot added the documentation label Jul 1, 2025

vshankar added a commit to vshankar/ceph that referenced this pull request Jul 1, 2025

Merge PR ceph#64229 into wip-vshankar-testing-20250630.165309-debug

03d13b7

* refs/pull/64229/head: mds: nudge log for unstable locks after early_reply mds: allow disabling batch ops

batrick force-pushed the i71876 branch from 1ef9321 to 9ac0775 Compare July 1, 2025 13:56

batrick force-pushed the i71876 branch 2 times, most recently from b78cee7 to 9bb3cd8 Compare July 1, 2025 19:33

batrick added 4 commits July 1, 2025 15:41

common/options: chomp whitespace

73792b7

If there are trailing newlines in a string (like long_desc), then the generated C++ code is invalid (because the newline will not be escaped). Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

mds: allow disabling batch ops

0201f86

To address a bug and future ones where batching lookup/getattr does not help "kick" the MDS in switching state more quickly (e.g. flushing the MDS journal). Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

doc/cephfs: add mds_allow_batched_ops to conf ref

80c086a

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

batrick force-pushed the i71876 branch from 9bb3cd8 to 80c086a Compare July 1, 2025 19:42

vshankar approved these changes Jul 2, 2025

View reviewed changes

batrick added needs-qa and removed needs-review labels Jul 3, 2025

vshankar approved these changes Jul 16, 2025

View reviewed changes

vshankar merged commit b7afc12 into ceph:main Jul 16, 2025
13 checks passed

vshankar removed the wip-vshankar-testing2 label Jul 16, 2025

This was referenced Jul 16, 2025

tentacle: mds: nudge log for unstable locks after early reply #64538

Merged

squid: mds: nudge log for unstable locks after early reply #64539

Merged

reef: mds: nudge log for unstable locks after early reply #64540

Merged

Conversation

batrick commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

github-actions bot commented Jun 27, 2025

Config Diff Tool Output

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

batrick Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

batrick Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

batrick commented Jun 28, 2025

Uh oh!

vshankar Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

batrick commented Jun 30, 2025

Uh oh!

vshankar commented Jun 30, 2025

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

batrick commented Jun 30, 2025

Uh oh!

vshankar commented Jun 30, 2025

Uh oh!

vshankar commented Jun 30, 2025

Uh oh!

batrick commented Jul 1, 2025

Uh oh!

vshankar commented Jul 1, 2025

Uh oh!

batrick commented Jul 1, 2025

Uh oh!

vshankar commented Jul 7, 2025

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

batrick commented Jun 27, 2025 •

edited

Loading

batrick Jun 27, 2025 •

edited

Loading