Bug #56577
closedmds: client request may complete without queueing next replay request
0%
Description
We received a report of a situation of a cluster with a single active MDS stuck in up:clientreplay. The status was:
> ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
> {
> "cluster_fsid": "XXX",
> "whoami": 0,
> "id": 19987341,
> "want_state": "up:clientreplay",
> "state": "up:clientreplay",
> "fs_name": "ocs-storagecluster-cephfilesystem",
> "clientreplay_status": {
> "clientreplay_queue": 125048,
> "active_replay": 0
> },
> "rank_uptime": 191060.81145907301,
> "mdsmap_epoch": 8735,
> "osdmap_epoch": 4421,
> "osdmap_epoch_barrier": 3296,
> "uptime": 191061.807527136
> }
The MDS had no outstanding ops or objecter requests. An increase in debugging did not indicate any client request activity.
It's not clear how this could happen other than the MDS failed to call MDSRank::queue_one_replay during some error handling of a request. The most likely place for this I believe to be here:
If !(mdr->has_completed || reply->get_result() < 0) then the request is cleaned up without queuing the next request. I don't know a scenario in which that condition may be false in this code path.
I think for now a reasonable fix is to move this to MDCache::request_cleanup which is generally called on every client request during cleanup of any kind. We do need to maintain Server::journal_and_reply may queue the next op even if the current request is not yet safe.
Updated by Patrick Donnelly over 2 years ago
- Category set to Correctness/Safety
- Status changed from In Progress to Fix Under Review
- Target version set to v19.0.0
- Backport changed from quincy,pacific to reef,quincy,pacific
Updated by Patrick Donnelly over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #63418: reef: mds: client request may complete without queueing next replay request added
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #63419: pacific: mds: client request may complete without queueing next replay request added
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #63420: quincy: mds: client request may complete without queueing next replay request added
Updated by Upkeep Bot 9 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp set to 2025-07-09T16:45:04+00:00
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 97961ae81a218ed7f8a8e1336c095b9a16144df7
- Fixed In set to v18.0.0-7007-g97961ae81a
- Released In set to v19.2.0~1322
- Upkeep Timestamp changed from 2025-07-09T16:45:04+00:00 to 2025-08-02T04:59:49+00:00