Bug #65851
closedMDS Squid Metadata Performance Regression
0%
Description
Found during 21 MDS IO500 runs comparing v18.2.2 to v19.0.0.
| Ceph Version | Meaurement | v18.2.2 | v18.2.2 | v19.0.0-2502 | v19.0.0-2502 |
| ---------------------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
| mdtest-easy pinning strategy | n-1 round-robin | n-1 round-robin | n-1 round-robin | n-1 round-robin | |
| API | kernel | libcephfs | kernel | libcephfs | |
| LazyIO | no | yes | no | yes | |
| Msgr Threads | 3 | 3 | 3 | 3 | |
| Test | Result | Result | Result | Result | |
| ior-easy-write | GiB/s | 11.53 | 29.22 | 11.71 | 28.97 |
| mdtest-easy-write | kIOPS | 152.44 | 127.72 | 47.41 | 38.00 |
| timestamp | kIOPS | 0.00 | 0.00 | 0.00 | 0.00 |
| ior-hard-write | GiB/s | 0.39 | 7.49 | 0.40 | 8.82 |
| mdtest-hard-write | kIOPS | 9.69 | 22.09 | 7.05 | 6.09 |
| find | kIOPS | 478.83 | 344.13 | 232.55 | 278.90 |
| ior-easy-read | GiB/s | 27.53 | 33.65 | 31.84 | 33.40 |
| mdtest-easy-stat | kIOPS | 234.07 | 260.74 | 232.42 | 340.22 |
| ior-hard-read | GiB/s | 3.50 | 10.94 | 3.18 | 12.97 |
| mdtest-hard-stat | kIOPS | 112.25 | 119.67 | 36.51 | 35.68 |
| mdtest-easy-delete | kIOPS | 52.00 | 49.41 | 29.20 | 30.02 |
| mdtest-hard-read | kIOPS | 44.67 | 42.09 | 7.19 | 35.57 |
| mdtest-hard-delete | kIOPS | 10.04 | 36.11 | 5.56 | 5.58 |
| Bandwidth | GiB/s | 4.56 | 16.85 | 4.66 | 18.24 |
| IOPS | kIOPS | 67.55 | 83.11 | 30.61 | 38.34 |
| Total | Score | 17.55 | 37.42 | 11.95 | 26.45 |
A bisection of the mdtest-easy results indicates https://github.com/ceph/ceph/commit/a8cbcde03951adabe13442c2e8b244794ebd86dd as the likely culprit.
| Version | Step | mdtest-easy-write (kIOPS) |
| ----------------------- | ---- | ------------------------- |
| 18.2.2 | 1 | 127.72 |
| v19.0.0-898 (1112689d) | 2 | 130.89 |
| v19.0.0-1700 (ga499fbd) | 3 | 136.30 |
| v19.0.0-1762 (0618afe6) | 7 | 133.60 |
| v19.0.0-1794 (7bdc74ec) | 6 | stalled (fast?) |
| v19.0.0-1806 (7dacd283) | 11 | stalled (fast?) |
| v19.0.0-1807 (a8cbcde0) | 14 | 39.03 |
| v19.0.0-1809 (45c18be1) | 13 | 38.92 |
| v19.0.0-1811 (36d314c0) | 12 | 38.59 |
| v19.0.0-1815 (5fdf9c56) | 10 | 38.06 |
| v19.0.0-1816 (9d7f5b04) | 9 | 38.59 |
| v19.0.0-1825 (2f8b96de) | 8 | 37.94 |
| v19.0.0-1893 (cc7ac758) | 5 | 38.27 |
| v19.0.0-2099 (c531a773) | 4 | 38.92 |
| v19.0.0-2502 (742a2451) | 0 | 38.00 |
Files
Updated by Patrick Donnelly almost 2 years ago
- Assignee set to Patrick Donnelly
- Priority changed from Normal to Urgent
- Target version changed from v19.1.0 to v20.0.0
- Source set to Development
- Backport set to squid
- Affected Versions v19.0.0 added
- Component(FS) MDS, quiesce added
Updated by Patrick Donnelly over 1 year ago
Updated by Patrick Donnelly over 1 year ago
- Status changed from New to In Progress
Updated by Mark Nelson over 1 year ago
To further debug this, I spent some time today re-testing a8cbcde0.
One interesting bit: after the benchmarks completed, I see each MDS spinning at about 40% CPU usage on one core.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 140508 root 20 0 3019000 165944 17692 R 40.0 0.1 25:35.78 quiesce_db_mgr
and in uwpmp, here's what that thread is doing:
Thread 140520 (quiesce_db_mgr) - 1000 samples + 100.00% clone + 100.00% start_thread + 100.00% QuiesceDbManager::quiesce_db_thread_main() + 89.80% pthread_cond_timedwait + 6.40% __pthread_mutex_unlock_usercnt |+ 6.40% __lll_unlock_wake + 1.30% std::chrono::_V2::steady_clock::now() |+ 1.20% __clock_gettime | + 1.10% __clock_gettime | + 1.10% ??? + 0.40% QuiesceDbManager::complete_requests() |+ 0.30% __memset_avx2_unaligned |+ 0.10% __memset_avx2_unaligned_erms + 0.40% tc_new + 0.20% std::chrono::_V2::system_clock::now() |+ 0.10% __clock_gettime | + 0.10% __clock_gettime | + 0.10% ??? + 0.10% QuiesceDbManager::membership_upkeep() + 0.10% QuiesceDbManager::db_thread_has_work() const + 0.10% __clock_gettime
during the mdtest easy write run, I see the async messenger threads working very hard on 1 MDS (The authorative MDS for the parent directory):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
139300 root 20 0 10.6g 464264 17552 R 87.4 0.4 22:58.79 msgr-worker-1
139301 root 20 0 10.6g 464264 17552 R 87.4 0.4 23:16.55 msgr-worker-2
139299 root 20 0 10.6g 464264 17552 R 86.0 0.4 23:52.50 msgr-worker-0
139305 root 20 0 10.6g 464264 17552 R 80.7 0.4 25:58.89 ms_dispatch
139425 root 20 0 10.6g 464264 17552 R 35.9 0.4 29:42.90 quiesce_db_mgr
Finally, attached is a full wallclock profile of that MDS during the mdtest-easy-write workload. Looks like considerable time in lock contention. See lines 370-373, 457,477, etc
Updated by Patrick Donnelly over 1 year ago
(Just noting here in response to Mark's last comment that he was probably running into #65276. We discussed offline.)
Updated by Mark Nelson over 1 year ago
Just a side note, that 40% CPU usage appears to have indeed been caused by #65276 and is fixed in squid HEAD, but did not resolve the performance regression.
Updated by Patrick Donnelly over 1 year ago
2024-07-22T21:48:18.372+0000 7f4751a3d700 7 mds.6.server dispatch_client_request client_request(client.4748:62187 create owner_uid=1000, owner_gid=1000 #0x1000000148d/file.mdtest.145.31073 2024-07-22T21:48:18.371416+0000 caller_uid=1000, caller_gid=1000{})
2024-07-22T21:48:18.372+0000 7f4751a3d700 7 mds.6.server open w/ O_CREAT on #0x1000000148d/file.mdtest.145.31073
2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.server rdlock_path_xlock_dentry request(client.4748:62187 nref=2 cr=0x55649752bc00) #0x1000000148d/file.mdtest.145.31073
2024-07-22T21:48:18.372+0000 7f4751a3d700 7 mds.6.cache traverse: opening base ino 0x1000000148d snap head
2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.locker try_rdlock_snap_layout request(client.4748:62187 nref=3 cr=0x55649752bc00) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
2024-07-22T21:48:18.372+0000 7f4751a3d700 12 mds.6.cache traverse: path seg depth 0 'file.mdtest.145.31073' snapid head
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.cache.dir(0x1000000148d.011*) lookup (file.mdtest.145.31073, 'head')
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.cache.dir(0x1000000148d.011*) hit -> (file.mdtest.145.31073,head)
2024-07-22T21:48:18.372+0000 7f4751a3d700 10 mds.6.locker acquire_locks request(client.4748:62187 nref=3 cr=0x55649752bc00)
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker lov = [LockOp(l=(ifile mix w=52 dirty),f=0x2),LockOp(l=(inest mix w=101 dirty),f=0x2),LockOp(l=(iauth sync),f=0x1),LockOp(l=(dn sync),f=0x4)]
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker auth_pin_nonblocking=0
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker must wrlock (ifile mix w=52 dirty) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker need shared quiesce lock for LockOp(l=(ifile mix w=52 dirty),f=0x2) on ifile of 0x556494a71b80
2024-07-22T21:48:18.372+0000 7f4751a3d700 20 mds.6.locker must wrlock (iquiesce lock w=52 last_client=4748) [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80]
2024-07-22T21:48:18.372+0000 7f4751a3d700 15 mds.6.locker will also auth_pin [inode 0x1000000148d [...2,head] /io500/io500/datafiles/2024.07.22-21.44.41/mdtest-easy/test-dir.0-0/mdtest_tree.205.0/ rep@0.1 fragtree_t(*^3) v101539 f(v20 m2024-07-22T21:48:14.075500+0000 30468=30468+0) n(v20 rc2024-07-22T21:48:14.075500+0000 30469=30468+1)/n(v0 rc2024-07-22T21:44:53.839831+0000 155=154+1) (idft lock) (isnap sync r=53) (inest mix w=101 dirty) (ipolicy sync r=53) (ifile mix w=52 dirty) (iquiesce lock w=52 last_client=4748) caps={4748=pAsLsXs/-@30914} | dirtyscattered=2 request=53 lock=5 importing=0 dirfrag=8 caps=1 waiter=0 export_pin=6 0x556494a71b80] in case we need to request a scatter
Adding the wrlock on ifile adds iquiesce. The code for adding authpins on a wrlock would normally no-op because the the xlock which adds the versionlock (dn or inode) already has added the authpin (and the MDS would be auth for the metadata too). In the case of the quiescelock, we may not be auth so an authpin can be very expensive and unnecessary. We only require an authpin for an xlock on the quiescelock when auth.
Updated by Patrick Donnelly over 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 58861
Updated by Venky Shankar over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67438: squid: MDS Squid Metadata Performance Regression added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Upkeep Bot 9 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp set to 2025-07-08T18:46:17+00:00
Updated by Upkeep Bot 8 months ago
- Merge Commit set to e656af98d809f641cb0c7971e9ceb4d81c4a1658
- Fixed In set to v19.3.0-4150-ge656af98d8
- Upkeep Timestamp changed from 2025-07-08T18:46:17+00:00 to 2025-08-02T04:51:25+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2260
- Upkeep Timestamp changed from 2025-08-02T04:51:25+00:00 to 2025-11-01T01:36:14+00:00