Bug #71631: Commands using Mgr Modules fail if run immediately post a mgr failover/ restart - mgr - Ceph

bbfedafcf532f649edc771d5d03fcc8207b806f4

Category:

Target version:

% Done:

Source:

Backport:

squid,tentacle

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

63859

Tags (freeform):

backport_processed

Merge Commit:

Fixed In:

v20.3.0-6266-gbbfedafcf5

Released In:

Upkeep Timestamp:

2026-03-20T21:42:02+00:00

Description

Reproduced by:

$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it

Related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2314146

Related issues 7 (6 open — 1 closed)

Related to CephFS - Bug #67230: mgr: should be declared available only after all python modules have been loaded

Fix Under Review

Mahesh Mohan

Related to mgr - Bug #68657: squid: mgr/balancer preventing orchestrator and dashboard functionality

Resolved

Related to CephFS - Bug #70456: qa: Command failed on smithi012 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs volume ls'

Triaged

Mahesh Mohan

Related to Orchestrator - Bug #71830: Upgrade tests stuck when upgrading ceph-mgr daemon

New

Redouane Kachach Elhicou

Related to mgr - Bug #75422: Rocky10 - Module 'orchestrator' is not enabled/loaded

New

Copied to mgr - Backport #75564: tentacle: Commands using Mgr Modules fail if run immediately post a mgr failover/ restart

In Progress

Copied to mgr - Backport #75565: squid: Commands using Mgr Modules fail if run immediately post a mgr failover/ restart

New

Updated by Laura Flores 9 months ago

Assignee set to Laura Flores

Actions

Updated by Laura Flores 9 months ago

Description updated (diff)

Actions

Updated by Laura Flores 9 months ago

Description updated (diff)

Actions

Updated by Laura Flores 9 months ago

Status changed from New to Fix Under Review
Pull request ID set to 63859

Actions

Updated by Laura Flores 9 months ago

Related to Bug #67230: mgr: should be declared available only after all python modules have been loaded added

Actions

Updated by Laura Flores 9 months ago

Related to deleted (Bug #67230: mgr: should be declared available only after all python modules have been loaded)

Actions

Updated by Laura Flores 9 months ago

Related to Bug #67230: mgr: should be declared available only after all python modules have been loaded added

Actions

Updated by Laura Flores 9 months ago

Related to Bug #68657: squid: mgr/balancer preventing orchestrator and dashboard functionality added

Actions

Updated by Venky Shankar 9 months ago

Related to Bug #70456: qa: Command failed on smithi012 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs volume ls' added

Actions

#10

Updated by Venky Shankar 9 months ago

https://pulpito.ceph.com/vshankar-2025-06-13_17:03:06-fs-wip-vshankar-testing-20250613.134551-debug-testing-default-smithi/8327080/ is likely another instance of this issue.

$ zgrep -v "client\." ./remote/smithi159/log/ceph-mgr.x.log.gz | egrep "_handle_command|ceph-mgr, pid" 
...
...
...
025-06-15T01:22:33.693+0000 7f07749cd100  0 ceph version 20.3.0-896-g1a8c963f (1a8c963f6e5d0aa68a79fd7c4ea3e0bb861d7d90) tentacle (dev - Debug), process ceph-mgr, pid 62269
2025-06-15T01:22:38.107+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=4 prefix=fs subvolumegroup create
2025-06-15T01:22:38.108+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs subvolumegroup create' size 4
2025-06-15T01:23:36.228+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=7 prefix=fs snap-schedule add
2025-06-15T01:23:36.229+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs snap-schedule add' size 7
2025-06-15T01:23:36.697+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=6 prefix=fs snap-schedule retention add
2025-06-15T01:23:36.698+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs snap-schedule retention add' size 6
2025-06-15T01:23:37.069+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=7 prefix=fs snap-schedule remove
2025-06-15T01:23:37.069+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs snap-schedule remove' size 7
2025-06-15T01:23:37.536+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=5 prefix=fs subvolume getpath
2025-06-15T01:23:37.536+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs subvolume getpath' size 5
2025-06-15T01:23:40.762+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=5 prefix=fs subvolume rm
2025-06-15T01:23:40.763+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs subvolume rm' size 5
2025-06-15T01:23:41.153+0000 7f07132f2640 10 mgr.server _handle_command decoded-size=4 prefix=fs subvolumegroup rm
2025-06-15T01:23:41.154+0000 7f07132f2640 10 mgr.server _handle_command passing through command 'fs subvolumegroup rm' size 4
2025-06-15T01:23:42.365+0000 7f16e2a0d100  0 ceph version 20.3.0-896-g1a8c963f (1a8c963f6e5d0aa68a79fd7c4ea3e0bb861d7d90) tentacle (dev - Debug), process ceph-mgr, pid 62269
2025-06-15T01:24:02.460+0000 7f16811a4640 10 mgr.server _handle_command decoded-size=3 prefix=pg dump
2025-06-15T01:24:02.842+0000 7f16811a4640 10 mgr.server _handle_command decoded-size=3 prefix=pg dump
2025-06-15T01:24:03.222+0000 7f16811a4640 10 mgr.server _handle_command decoded-size=3 prefix=pg dump
2025-06-15T01:24:04.030+0000 7f16811a4640 10 mgr.server _handle_command decoded-size=3 prefix=pg dump
2025-06-15T01:24:07.390+0000 7f16811a4640 10 mgr.server _handle_command decoded-size=3 prefix=pg dump
2025-06-15T01:24:10.816+0000 7f16811a4640 10 mgr.server _handle_command decoded-size=2 prefix=fs volume ls
2025-06-15T01:24:10.816+0000 7f16811a4640 10 mgr.server _handle_command passing through command 'fs volume ls' size 2

In this case volume ls command didn't make progress after ceph-mgr got restarted. The command timeout (120 seconds) thereby failing the test.

Actions

#11

Updated by Venky Shankar 9 months ago

@Laura Flores I see that the command run just after ceph-mgr restart could fail, however, as I mention in note-10, the command was blocked. Is that also a possibility?

Actions