Bug #72386
openThe following counters failed to be set on mds daemons: {'mds.imported', 'mds.exported'} (even with 20% distribution randomness)
0%
Updated by Jos Collin 7 months ago
- Status changed from New to In Progress
Debugging Update:
The passed and failed tests of 'mds.export' take two different paths from here: https://github.com/ceph/ceph/blob/main/src/mds/MDBalancer.cc#L220
In the mds logs, Passed test shows:
remote/smithi176/log/fbeceffc-73d3-11f0-8734-adfe0268badd/ceph-mds.c.log:2025-08-07T21:39:56.942+0000 7f9f4ddc3640 10 mds.2.bal handle_export_pins set auxsubtree bit on [dir 0x10000000005 /volumes/qa/sv_1/cd3fb8a2-26cb-4dd0-9985-dd5ea390d342/client.0/ [2,head] auth{0=1} pv=38 v=37 cv=0/0 dir_auth=2 ap=1+0 state=1610874881|complete f(v0 m2025-08-07T21:39:50.953430+0000 1=0+1)->f(v0 m2025-08-07T21:39:50.953430+0000 1=0+1) n(v0 rc2025-08-07T21:39:56.045299+0000 b32 2=1+1)/n(v0 rc2025-08-07T21:39:52.989377+0000 b64 2=1+1)->n(v1 rc2025-08-07T21:39:56.045299+0000 b32 2=1+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=0 subtree=1 importing=0 replicated=1 dirty=1 authpin=1 scrubqueue=0 0x55b7bffdc880]
Failed test shows:
remote/smithi144/log/94cab758-6d63-11f0-8731-adfe0268badd/ceph-mds.b.log:2025-07-30T16:53:33.098+0000 7fa2371d1640 10 mds.0.bal handle_export_pins create aux subtree on [dir 0x10000000236 /volumes/qa/sv_0/21635501-fbe7-4f05-a39a-66ab58aa910a/client.0/tmp/ffsb/.git/logs/refs/remotes/ [2,head] auth v=5 cv=0/0 dir_auth=0 state=1611399169|complete|auxsubtree f(v0 m2025-07-30T16:53:31.694171+0000 1=0+1) n(v0 rc2025-07-30T16:53:31.696758+0000 2=1+1)/n(v0 rc2025-07-30T16:53:31.694171+0000 1=0+1) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 authpin=0 0x5645c0e02900]
As shown above, the failed test has "dir_auth == CDIR_AUTH_DEFAULT" and obviously the "state" differs too.
Debugging continues...
Updated by Jos Collin 7 months ago ยท Edited
@Venky Shankar
We need to increase the 'mds_beacon_grace' to a higher value for the random.yaml test, as we have 132 missed calls to MDBalancer::tick() --> handle_export_pins().
This is not a good sign, as handle_export_pins() subsequently calls Migrator::export_dir() and then increments l_mds_exported.
Another thing to consider is: https://github.com/ceph/ceph/blob/main/src/mds/CInode.cc#L5650.
We generate a random number between 0.0 and 1.0. So if the test keeps a threshold of 0.02, then very less number of items are getting added to the queue and that prevents incrementing the counters too (less number of iterations in MDBalancer::handle_export_pins and even fewer export_dir hits).
So we need to keep a higher threshold in the test, so that the counters could make an increment.
Updated by Jos Collin 7 months ago
- Status changed from In Progress to Fix Under Review