Skip to content

mds: only authpin on wrlock when not a locallock#58861

Merged
vshankar merged 1 commit intoceph:mainfrom
batrick:i65851
Aug 8, 2024
Merged

mds: only authpin on wrlock when not a locallock#58861
vshankar merged 1 commit intoceph:mainfrom
batrick:i65851

Conversation

@batrick
Copy link
Member

@batrick batrick commented Jul 25, 2024

For example:

2024-07-22T21:48:18.372+0000 7f4751a3d700  7 mds.6.server dispatch_client_request client_request(client.4748:62187 create owner_uid=1000, owner_gid=1000 #0x1000000148d/file.mdtest.145.31073 2024-07-22T21:48:18.371416+0000 caller_uid=1000, caller_gid=1000{})
2024-07-22T21:48:18.372+0000 7f4751a3d700  7 mds.6.server open w/ O_CREAT on #0x1000000148d/file.mdtest.145.31073
2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.server rdlock_path_xlock_dentry request(client.4748:62187 nref=2 cr=0x55649752bc00) #0x1000000148d/file.mdtest.145.31073
2024-07-22T21:48:18.372+0000 7f4751a3d700  7 mds.6.cache traverse: opening base ino 0x1000000148d snap head
2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.locker try_rdlock_snap_layout request(client.4748:62187 nref=3 cr=0x55649752bc00) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
2024-07-22T21:48:18.372+0000 7f4751a3d700 12 mds.6.cache traverse: path seg depth 0 'file.mdtest.145.31073' snapid head
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.cache.dir(0x1000000148d.011*) lookup (file.mdtest.145.31073, 'head')
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.cache.dir(0x1000000148d.011*)   hit -> (file.mdtest.145.31073,head)
2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.locker acquire_locks request(client.4748:62187 nref=3 cr=0x55649752bc00)
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  lov = [LockOp(l=(ifile mix w=52 dirty),f=0x2),LockOp(l=(inest mix w=101 dirty),f=0x2),LockOp(l=(iauth sync),f=0x1),LockOp(l=(dn sync),f=0x4)]
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  auth_pin_nonblocking=0
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  must wrlock (ifile mix w=52 dirty) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker need shared quiesce lock for LockOp(l=(ifile mix w=52 dirty),f=0x2) on ifile of 0x556494a71b80
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  must wrlock (iquiesce lock w=52 last_client=4748) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
2024-07-22T21:48:18.372+0000 7f4751a3d700 15 mds.6.locker  will also auth_pin [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80] in case we need to request a scatter

Adding the wrlock on ifile adds iquiesce. Consequently, examining the wrlock on iquiesce will add an authpin for the inode.

The code for adding authpins on a wrlock should normally no-op because the xlock which adds the versionlock (dn or inode) already has added the authpin (and the MDS would be auth for the metadata too). In the case of the quiescelock, we may not be auth so an authpin can be very expensive and unnecessary. We only require an authpin for an xlock on the quiescelock when auth.

Fixes: https://tracker.ceph.com/issues/65851

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

For example:

    2024-07-22T21:48:18.372+0000 7f4751a3d700  7 mds.6.server dispatch_client_request client_request(client.4748:62187 create owner_uid=1000, owner_gid=1000 #0x1000000148d/file.mdtest.145.31073 2024-07-22T21:48:18.371416+0000 caller_uid=1000, caller_gid=1000{})
    2024-07-22T21:48:18.372+0000 7f4751a3d700  7 mds.6.server open w/ O_CREAT on #0x1000000148d/file.mdtest.145.31073
    2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.server rdlock_path_xlock_dentry request(client.4748:62187 nref=2 cr=0x55649752bc00) #0x1000000148d/file.mdtest.145.31073
    2024-07-22T21:48:18.372+0000 7f4751a3d700  7 mds.6.cache traverse: opening base ino 0x1000000148d snap head
    2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.locker try_rdlock_snap_layout request(client.4748:62187 nref=3 cr=0x55649752bc00) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
    2024-07-22T21:48:18.372+0000 7f4751a3d700 12 mds.6.cache traverse: path seg depth 0 'file.mdtest.145.31073' snapid head
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.cache.dir(0x1000000148d.011*) lookup (file.mdtest.145.31073, 'head')
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.cache.dir(0x1000000148d.011*)   hit -> (file.mdtest.145.31073,head)
    2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.locker acquire_locks request(client.4748:62187 nref=3 cr=0x55649752bc00)
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  lov = [LockOp(l=(ifile mix w=52 dirty),f=0x2),LockOp(l=(inest mix w=101 dirty),f=0x2),LockOp(l=(iauth sync),f=0x1),LockOp(l=(dn sync),f=0x4)]
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  auth_pin_nonblocking=0
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  must wrlock (ifile mix w=52 dirty) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker need shared quiesce lock for LockOp(l=(ifile mix w=52 dirty),f=0x2) on ifile of 0x556494a71b80
    2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker  must wrlock (iquiesce lock w=52 last_client=4748) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
    2024-07-22T21:48:18.372+0000 7f4751a3d700 15 mds.6.locker  will also auth_pin [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80] in case we need to request a scatter

Adding the wrlock on ifile adds iquiesce. Consequently, examining the wrlock on
iquiesce will add an authpin for the inode.

The code for adding authpins on a wrlock should normally no-op because the
xlock which adds the versionlock (dn or inode) already has added the authpin
(and the MDS would be auth for the metadata too). In the case of the
quiescelock, we may not be auth so an authpin can be very expensive and
unnecessary. We only require an authpin for an xlock on the quiescelock when
auth.

Fixes: https://tracker.ceph.com/issues/65851
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick batrick requested review from a team and markhpc July 25, 2024 18:57
@markhpc
Copy link
Member

markhpc commented Jul 25, 2024

Good news and bad news. The good news is that in an isolated test of mdtest easy write, this restored performance to at least previous levels (ie we are definitely on the right track here):

[RESULT]    mdtest-easy-write      130.980640 kIOPS : time 340.898 seconds

The bad news is that when I tried to do a full io500 run, io completely stalled at some point. I saw this message in several mds logs:

2024-07-25T20:22:20.512+0000 7fdca33fe700  3 quiesce.mds.8 <quiesce_dispatch> error (-116) submitting q-db[v:(45:0) sets:0/0] from 4365

and this in rank 0:

2024-07-25T20:19:20.975+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-07-25T20:22:15.981+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=5 was never sent ping request.
2024-07-25T20:22:20.982+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=10 was never sent ping request.
2024-07-25T20:22:25.982+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=15 was never sent ping request.
2024-07-25T20:22:30.982+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=20 was never sent ping request.
2024-07-25T20:39:51.666+0000 7f0bee0fb700  0 --1- [v2:172.21.67.18:6860/445211041,v1:172.21.67.18:6861/445211041] >> v1:172.21.67.16:6855/1197928650 conn(0x564a1b11ac00 0x564a1bb64000 :6861 s=OPENED pgs=14 cs=1 l=0).fault initiating reconnect
2024-07-25T20:39:56.198+0000 7f0bef8fe700  0 --1- [v2:172.21.67.18:6860/445211041,v1:172.21.67.18:6861/445211041] >> v1:172.21.67.17:6859/3016582852 conn(0x564a1b858000 0x564a1b81b800 :-1 s=OPENED pgs=17 cs=1 l=0).fault initiating reconnect

I'll start up a new cluster and try again just to see if the behavior repeats.

@markhpc
Copy link
Member

markhpc commented Jul 25, 2024

On a second run, I see <quiesce_dispatch> error (-116) even prior to the start of the mdtest easy write phase, so it's possible that's a red herring.

Edit: I was able to get through a full run this time. Now doing some comparisons with prior builds.

@batrick
Copy link
Member Author

batrick commented Jul 25, 2024

I think that -116 error can be expected when adjusting max_mds.

The I/O stall is probably unrelated although worth looking into.

@batrick
Copy link
Member Author

batrick commented Jul 26, 2024

jenkins test api

@batrick
Copy link
Member Author

batrick commented Jul 26, 2024

jenkins test make check arm64

@markhpc
Copy link
Member

markhpc commented Jul 26, 2024

Some results (not directly comparable to earlier results due to a combination of HW failure and kernel upgrades):

Test Measurement 18.2.2 19.0.0-1762 19.0.0-1807 19.0.0-3623 + PR#58861
ior-easy-write GiB/s 15.06 15.11 15.10 15.33
mdtest-easy-write kIOPS 126.87 130.74 38.67 133.79
timestamp kIOPS 0.00 0.00 0.00 0.00
ior-hard-write GiB/s 5.47 6.29 5.87 6.13
mdtest-hard-write kIOPS 23.80 6.28 6.52 6.11
find kIOPS 299.73 565.14 357.25 570.77
ior-easy-read GiB/s 23.85 24.51 24.87 24.58
mdtest-easy-stat kIOPS 281.24 262.78 357.25 248.37
ior-hard-read GiB/s 8.72 9.58 10.49 9.54
mdtest-hard-stat kIOPS 184.92 35.39 36.07 36.58
mdtest-easy-delete kIOPS 57.01 66.89 31.12 68.53
mdtest-hard-read kIOPS 39.50 35.45 35.03 33.81
mdtest-hard-delete kIOPS 26.34 5.73 5.64 5.80
Bandwidth GiB/s 11.44 12.22 12.33 12.18
IOPS kIOPS 85.26 52.60 38.69 52.42
Total Score 31.23 25.36 21.84 25.27

@batrick
Copy link
Member Author

batrick commented Jul 26, 2024

Let's open a tracker for mdtest-hard-write and mdtest-hard-stat. Those are worth looking into further but it's good we resolved the major regression.

@batrick
Copy link
Member Author

batrick commented Jul 26, 2024

This PR is under test in https://tracker.ceph.com/issues/67214.

@batrick
Copy link
Member Author

batrick commented Aug 6, 2024

This PR is under test in https://tracker.ceph.com/issues/67214.

Ready for merge. cc @vshankar

Copy link
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vshankar vshankar merged commit e656af9 into ceph:main Aug 8, 2024
@batrick batrick deleted the i65851 branch September 18, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants