Actions
Bug #66219
closed[quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost
% Done:
0%
Source:
Q/A
Backport:
squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
quiesce
Labels (FS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
v19.3.0-2435-g25e4ee2fa7
Released In:
v20.2.0~2814
Upkeep Timestamp:
2025-11-01T01:11:28+00:00
Description
A lone outstanding op:
7724309/teuthology.log:2024-05-24T06:39:00.254 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (a) for 870dafb6-1: 'client_request(client.4773:10 mkdir owner_uid=1000, owner_gid=1316 #0x10000000000/tmp 2024-05-24T06:38:26.594414+0000 caller_uid=1000, caller_gid=1316{6,36,1000,1316,})'
points to a lost ack.
Looking at the db leader (mds.a):
2024-05-24T06:39:43.042+0000 7f82c6df6640 20 quiesce.mgr.4296 <check_peer_reports> [870dafb6@9,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4293,(16:9),QS_QUIESCED=4288,(16:9),QS_QUIESCED=4290,(16:9),QS_QUIESCED=4294,(16:9),QS_QUIESCED=4296,(16:9)}}
gid 4293 is mds.e:
2024-05-24T06:38:13.042+0000 7fc08dad8640 20 quiesce.mgr.4293 <quiesce_db_thread_main> notifying agent with db version (16:9) 2024-05-24T06:38:13.042+0000 7fc08dad8640 20 quiesce.agt <db_update> got a db update version (16:9) with 1 roots 2024-05-24T06:38:13.042+0000 7fc08e2d9640 20 quiesce.agt <agent_thread_main> old = (16:8), current = (16:9) ... 2024-05-24T06:38:13.042+0000 7fc08e2d9640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(16:9) roots:1/0] 2024-05-24T06:38:13.042+0000 7fc08e2d9640 10 quiesce.mds.2 <operator()> sending ack q-map[v:(16:9) roots:1/0] to the leader 4296 ... 2024-05-24T06:38:13.042+0000 7fc08dad8640 20 quiesce.mgr.4293 <quiesce_db_thread_main> synchronous agent ack: q-map[v:(16:9) roots:0/0] 2024-05-24T06:38:13.042+0000 7fc08dad8640 10 quiesce.mds.2 <operator()> sending ack q-map[v:(16:9) roots:0/0] to the leader 4296
we see that the agent main thread took over and completed the quiesce, sending an `async ack` to the leader before the quiesce.mgr got chance to return from the agent update and then send its outdated `sync ack`
Updated by Leonid Usov almost 2 years ago
https://pulpito.ceph.com/leonidus-2024-05-24_04:50:13-fs-wip-lusov-quiesce-distro-default-smithi/7724310/ is the same issue.
Now that we know what to look for, we can see this from the leader log:
=== recording 4308 as QUIESCED:
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <leader_record_ack> ack q-map[v:(17:5) roots:1/0] from peer 4308
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <check_peer_reports> [5c1026b8@5,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4294,(17:5),QS_QUIESCING=4311,(17:5),QS_QUIESCED=4287,(17:5),QS_QUIESCED=4293,(17:5),QS_QUIESCED=4308,(17:5)}}
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <check_peer_reports> [5c1026b8@5,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4294,(17:5),QS_QUIESCING=4311,(17:5),QS_QUIESCED=4287,(17:5),QS_QUIESCED=4293,(17:5),QS_QUIESCED=4308,(17:5)}}
=== overwriting the report as QUIESCING:
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <leader_record_ack> ack q-map[v:(17:5) roots:0/0] from peer 4308
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <check_peer_reports> [5c1026b8@5,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4294,(17:5),QS_QUIESCING=4308,(17:5),QS_QUIESCING=4311,(17:5),QS_QUIESCED=4287,(17:5),QS_QUIESCED=4293,(17:5)}}
Updated by Leonid Usov almost 2 years ago
- Subject changed from QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost to [quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost
Updated by Leonid Usov almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 57579
Updated by Patrick Donnelly almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
- Target version set to v20.0.0
- Source set to Q/A
Updated by Upkeep Bot almost 2 years ago
- Copied to Backport #66256: squid: [quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost added
Updated by Leonid Usov almost 2 years ago
- Status changed from Pending Backport to Resolved
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 25e4ee2fa7e8913bac09af9e43706ddeba1cd14a
- Fixed In set to v19.3.0-2435-g25e4ee2fa7e
- Upkeep Timestamp set to 2025-06-26T20:14:47+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-2435-g25e4ee2fa7e to v19.3.0-2435-g25e4ee2fa7
- Upkeep Timestamp changed from 2025-06-26T20:14:47+00:00 to 2025-07-14T16:44:53+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2814
- Upkeep Timestamp changed from 2025-07-14T16:44:53+00:00 to 2025-11-01T01:11:28+00:00
Actions