Project

General

Profile

Actions

Bug #56577

closed

mds: client request may complete without queueing next replay request

Added by Patrick Donnelly over 3 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Tags (freeform):
Fixed In:
v18.0.0-7007-g97961ae81a
Released In:
v19.2.0~1322
Upkeep Timestamp:
2025-08-02T04:59:49+00:00

Description

We received a report of a situation of a cluster with a single active MDS stuck in up:clientreplay. The status was:

> ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
> {
>     "cluster_fsid": "XXX",
>     "whoami": 0,
>     "id": 19987341,
>     "want_state": "up:clientreplay",
>     "state": "up:clientreplay",
>     "fs_name": "ocs-storagecluster-cephfilesystem",
>     "clientreplay_status": {
>         "clientreplay_queue": 125048,
>         "active_replay": 0
>     },
>     "rank_uptime": 191060.81145907301,
>     "mdsmap_epoch": 8735,
>     "osdmap_epoch": 4421,
>     "osdmap_epoch_barrier": 3296,
>     "uptime": 191061.807527136
> }

The MDS had no outstanding ops or objecter requests. An increase in debugging did not indicate any client request activity.

It's not clear how this could happen other than the MDS failed to call MDSRank::queue_one_replay during some error handling of a request. The most likely place for this I believe to be here:

https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2253-L2262

If !(mdr->has_completed || reply->get_result() < 0) then the request is cleaned up without queuing the next request. I don't know a scenario in which that condition may be false in this code path.

I think for now a reasonable fix is to move this to MDCache::request_cleanup which is generally called on every client request during cleanup of any kind. We do need to maintain Server::journal_and_reply may queue the next op even if the current request is not yet safe.


Related issues 3 (0 open3 closed)

Copied to CephFS - Backport #63418: reef: mds: client request may complete without queueing next replay requestResolvedPatrick DonnellyActions
Copied to CephFS - Backport #63419: pacific: mds: client request may complete without queueing next replay requestResolvedPatrick DonnellyActions
Copied to CephFS - Backport #63420: quincy: mds: client request may complete without queueing next replay requestResolvedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly over 3 years ago

  • Pull request ID set to 47121
Actions #2

Updated by Patrick Donnelly over 2 years ago

  • Target version deleted (v18.0.0)
Actions #3

Updated by Patrick Donnelly over 2 years ago

  • Category set to Correctness/Safety
  • Status changed from In Progress to Fix Under Review
  • Target version set to v19.0.0
  • Backport changed from quincy,pacific to reef,quincy,pacific
Actions #4

Updated by Patrick Donnelly over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Upkeep Bot over 2 years ago

  • Copied to Backport #63418: reef: mds: client request may complete without queueing next replay request added
Actions #6

Updated by Upkeep Bot over 2 years ago

  • Copied to Backport #63419: pacific: mds: client request may complete without queueing next replay request added
Actions #7

Updated by Upkeep Bot over 2 years ago

  • Copied to Backport #63420: quincy: mds: client request may complete without queueing next replay request added
Actions #9

Updated by Upkeep Bot 9 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp set to 2025-07-09T16:45:04+00:00
Actions #10

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 97961ae81a218ed7f8a8e1336c095b9a16144df7
  • Fixed In set to v18.0.0-7007-g97961ae81a
  • Released In set to v19.2.0~1322
  • Upkeep Timestamp changed from 2025-07-09T16:45:04+00:00 to 2025-08-02T04:59:49+00:00
Actions

Also available in: Atom PDF