Project

General

Profile

Actions

Bug #66219

closed

[quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost

Added by Leonid Usov almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Backport:
squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
quiesce
Labels (FS):
Pull request ID:
Tags (freeform):
Fixed In:
v19.3.0-2435-g25e4ee2fa7
Released In:
v20.2.0~2814
Upkeep Timestamp:
2025-11-01T01:11:28+00:00

Description

from: https://pulpito.ceph.com/leonidus-2024-05-24_04:50:13-fs-wip-lusov-quiesce-distro-default-smithi/7724309/

A lone outstanding op:

7724309/teuthology.log:2024-05-24T06:39:00.254 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (a) for 870dafb6-1: 'client_request(client.4773:10 mkdir owner_uid=1000, owner_gid=1316 #0x10000000000/tmp 2024-05-24T06:38:26.594414+0000 caller_uid=1000, caller_gid=1316{6,36,1000,1316,})'

points to a lost ack.

Looking at the db leader (mds.a):

2024-05-24T06:39:43.042+0000 7f82c6df6640 20 quiesce.mgr.4296 <check_peer_reports> [870dafb6@9,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4293,(16:9),QS_QUIESCED=4288,(16:9),QS_QUIESCED=4290,(16:9),QS_QUIESCED=4294,(16:9),QS_QUIESCED=4296,(16:9)}}

gid 4293 is mds.e:

2024-05-24T06:38:13.042+0000 7fc08dad8640 20 quiesce.mgr.4293 <quiesce_db_thread_main> notifying agent with db version (16:9)
2024-05-24T06:38:13.042+0000 7fc08dad8640 20 quiesce.agt <db_update> got a db update version (16:9) with 1 roots
2024-05-24T06:38:13.042+0000 7fc08e2d9640 20 quiesce.agt <agent_thread_main> old = (16:8), current = (16:9)
...
2024-05-24T06:38:13.042+0000 7fc08e2d9640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(16:9) roots:1/0]
2024-05-24T06:38:13.042+0000 7fc08e2d9640 10 quiesce.mds.2 <operator()> sending ack q-map[v:(16:9) roots:1/0] to the leader 4296
...
2024-05-24T06:38:13.042+0000 7fc08dad8640 20 quiesce.mgr.4293 <quiesce_db_thread_main> synchronous agent ack: q-map[v:(16:9) roots:0/0]
2024-05-24T06:38:13.042+0000 7fc08dad8640 10 quiesce.mds.2 <operator()> sending ack q-map[v:(16:9) roots:0/0] to the leader 4296

we see that the agent main thread took over and completed the quiesce, sending an `async ack` to the leader before the quiesce.mgr got chance to return from the agent update and then send its outdated `sync ack`


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #66256: squid: [quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lostResolvedLeonid UsovActions
Actions #1

Updated by Leonid Usov almost 2 years ago

https://pulpito.ceph.com/leonidus-2024-05-24_04:50:13-fs-wip-lusov-quiesce-distro-default-smithi/7724310/ is the same issue.
Now that we know what to look for, we can see this from the leader log:

=== recording 4308 as QUIESCED:
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <leader_record_ack> ack q-map[v:(17:5) roots:1/0] from peer 4308
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <check_peer_reports> [5c1026b8@5,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4294,(17:5),QS_QUIESCING=4311,(17:5),QS_QUIESCED=4287,(17:5),QS_QUIESCED=4293,(17:5),QS_QUIESCED=4308,(17:5)}}
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <check_peer_reports> [5c1026b8@5,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4294,(17:5),QS_QUIESCING=4311,(17:5),QS_QUIESCED=4287,(17:5),QS_QUIESCED=4293,(17:5),QS_QUIESCED=4308,(17:5)}}

=== overwriting the report as QUIESCING:
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <leader_record_ack> ack q-map[v:(17:5) roots:0/0] from peer 4308
2024-05-24T06:37:26.466+0000 7f9670644640 20 quiesce.mgr.4311 <check_peer_reports> [5c1026b8@5,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCING=4294,(17:5),QS_QUIESCING=4308,(17:5),QS_QUIESCING=4311,(17:5),QS_QUIESCED=4287,(17:5),QS_QUIESCED=4293,(17:5)}}
Actions #2

Updated by Leonid Usov almost 2 years ago

  • Subject changed from QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost to [quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost
Actions #3

Updated by Leonid Usov almost 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 57579
Actions #4

Updated by Patrick Donnelly almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v20.0.0
  • Source set to Q/A
Actions #5

Updated by Upkeep Bot almost 2 years ago

  • Copied to Backport #66256: squid: [quiesce timeout] QuiesceAgent may send an async QUIESCED ack before the QuiesceManager does the sync QUIESCING ack, which causes the QUIESCED ack to be lost added
Actions #7

Updated by Leonid Usov almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions #8

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 25e4ee2fa7e8913bac09af9e43706ddeba1cd14a
  • Fixed In set to v19.3.0-2435-g25e4ee2fa7e
  • Upkeep Timestamp set to 2025-06-26T20:14:47+00:00
Actions #9

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-2435-g25e4ee2fa7e to v19.3.0-2435-g25e4ee2fa7
  • Upkeep Timestamp changed from 2025-06-26T20:14:47+00:00 to 2025-07-14T16:44:53+00:00
Actions #10

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2814
  • Upkeep Timestamp changed from 2025-07-14T16:44:53+00:00 to 2025-11-01T01:11:28+00:00
Actions

Also available in: Atom PDF