Project

General

Profile

Actions

Bug #65977

closed

Quiesce times out while the ops dump shows all existing quiesce ops as complete;

Added by Leonid Usov almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
quiesce
Labels (FS):
Pull request ID:
Tags (freeform):
Fixed In:
v19.3.0-2196-gcbb1da1189
Released In:
v20.2.0~2890
Upkeep Timestamp:
2025-11-01T01:11:35+00:00

Description

From: http://qa-proxy.ceph.com/teuthology/leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi/7704534/

2024-05-13T06:55:03.254 WARNING:tasks.quiescer.fs.[cephfs]:Set '9ce534df' hasn't quiesced after 55.6 seconds (timeout: 90). Dumping ops with locks from all ranks.

2024-05-13T06:55:07.364 DEBUG:tasks.quiescer.fs.[cephfs]:Dumping ops on rank 4 (i) to a remote file /var/run/ceph/1717f1ac-10f0-11ef-bc98-c7b262605968/ops-9ce534df-1-mds.i.json
2024-05-13T06:55:12.292 DEBUG:tasks.quiescer.fs.[cephfs]:Dumping ops on rank 3 (h) to a remote file /var/run/ceph/1717f1ac-10f0-11ef-bc98-c7b262605968/ops-9ce534df-1-mds.h.json
2024-05-13T06:55:20.018 DEBUG:tasks.quiescer.fs.[cephfs]:Dumping ops on rank 2 (g) to a remote file /var/run/ceph/1717f1ac-10f0-11ef-bc98-c7b262605968/ops-9ce534df-1-mds.g.json
2024-05-13T06:55:24.991 DEBUG:tasks.quiescer.fs.[cephfs]:Dumping ops on rank 1 (a) to a remote file /var/run/ceph/1717f1ac-10f0-11ef-bc98-c7b262605968/ops-9ce534df-1-mds.a.json
2024-05-13T06:55:29.578 DEBUG:tasks.quiescer.fs.[cephfs]:Dumping ops on rank 0 (b) to a remote file /var/run/ceph/1717f1ac-10f0-11ef-bc98-c7b262605968/ops-9ce534df-1-mds.b.json

2024-05-13T06:55:42.118 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 3 (h) for 9ce534df-1: 'client_request(client.25126:364918 unlink #0x100000079bc/gspca_se401.ko.xz 2024-05-13T06:54:18.557051+0000 caller_uid=0, caller_gid=0{0,})'

The only pending op is

{
  "description": "client_request(client.25126:364918 unlink #0x100000079bc/gspca_se401.ko.xz 2024-05-13T06:54:18.557051+0000 caller_uid=0, caller_gid=0{0,})",
  "initiated_at": "2024-05-13T06:54:18.558547+0000",
  "age": 59.436941433,
  "duration": 59.650470112,
  "continuous": false,
  "type_data": {
    "result": null,
    "flag_point": "failed to acquire quiesce lock",
    "reqid": {
      "entity": {
        "type": "client",
        "num": 25126
      },
      "tid": 364918
    },

and all quiesce_inode ops in all other dumps are complete.

Potentially, the same issue: http://qa-proxy.ceph.com/teuthology/leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi/7704563/


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #66070: squid: Quiesce times out while the ops dump shows all existing quiesce ops as complete;ResolvedLeonid UsovActions
Actions #1

Updated by Leonid Usov almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by Leonid Usov almost 2 years ago · Edited

This looks like a quiesce db issue, it's missing an ack though we clearly see them all coming

2024-05-13T06:54:19.158+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(102:73) roots:1/0] from 24512
2024-05-13T06:54:19.998+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(103:73) roots:1/0] from 24512
2024-05-13T06:54:19.998+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(103:73) roots:1/0] from 24512
2024-05-13T06:54:20.229+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(103:73) roots:1/0] from 24478
2024-05-13T06:54:21.010+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(104:73) roots:1/0] from 24478
2024-05-13T06:54:21.010+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(104:73) roots:1/0] from 24478
2024-05-13T06:54:21.011+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(104:73) roots:1/0] from 24512
2024-05-13T06:54:21.011+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(104:73) roots:1/0] from 24512
2024-05-13T06:54:21.878+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(104:73) roots:1/0] from 24491
2024-05-13T06:54:22.556+0000 7fa2d1209700 10 quiesce.mds.0 <quiesce_dispatch> got ack q-map[v:(104:73) roots:1/0] from 24506

Unfortunately, the logging from the quiesce db is insufficient to understand what prevents it from registering this set as quiesced. Along with the fix, I'll add better logging to the quiesce db manager where it is looking at the peer reports

Actions #3

Updated by Leonid Usov almost 2 years ago · Edited

  • Status changed from New to In Progress

This is a real quiesce db issue which I could reproduce in a unit test.

It happens when the peer receives a state update while on a newer epoch than the leader. The fix is coming soon

Actions #4

Updated by Leonid Usov almost 2 years ago

  • Pull request ID set to 57454
Actions #5

Updated by Leonid Usov almost 2 years ago

  • Status changed from In Progress to Pending Backport
  • Backport set to squid
Actions #6

Updated by Upkeep Bot almost 2 years ago

  • Copied to Backport #66070: squid: Quiesce times out while the ops dump shows all existing quiesce ops as complete; added
Actions #8

Updated by Leonid Usov almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions #9

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to cbb1da11892bf2915bbe3a1a03e2755cf33b9035
  • Fixed In set to v19.3.0-2196-gcbb1da11892
  • Upkeep Timestamp set to 2025-06-30T19:34:12+00:00
Actions #10

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-2196-gcbb1da11892 to v19.3.0-2196-gcbb1da1189
  • Upkeep Timestamp changed from 2025-06-30T19:34:12+00:00 to 2025-07-14T16:45:00+00:00
Actions #11

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2890
  • Upkeep Timestamp changed from 2025-07-14T16:45:00+00:00 to 2025-11-01T01:11:35+00:00
Actions

Also available in: Atom PDF