Project

General

Profile

Actions

Bug #72386

open

The following counters failed to be set on mds daemons: {'mds.imported', 'mds.exported'} (even with 20% distribution randomness)

Added by Venky Shankar 8 months ago. Updated 4 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Testing
Target version:
% Done:

0%

Source:
Q/A
Backport:
tentacle,squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Actions #2

Updated by Jos Collin 7 months ago

  • Status changed from New to In Progress

Debugging Update:

The passed and failed tests of 'mds.export' take two different paths from here: https://github.com/ceph/ceph/blob/main/src/mds/MDBalancer.cc#L220

In the mds logs, Passed test shows:

remote/smithi176/log/fbeceffc-73d3-11f0-8734-adfe0268badd/ceph-mds.c.log:2025-08-07T21:39:56.942+0000 7f9f4ddc3640 10 mds.2.bal handle_export_pins  set auxsubtree bit on [dir 0x10000000005 /volumes/qa/sv_1/cd3fb8a2-26cb-4dd0-9985-dd5ea390d342/client.0/ [2,head] auth{0=1} pv=38 v=37 cv=0/0 dir_auth=2 ap=1+0 state=1610874881|complete f(v0 m2025-08-07T21:39:50.953430+0000 1=0+1)->f(v0 m2025-08-07T21:39:50.953430+0000 1=0+1) n(v0 rc2025-08-07T21:39:56.045299+0000 b32 2=1+1)/n(v0 rc2025-08-07T21:39:52.989377+0000 b64 2=1+1)->n(v1 rc2025-08-07T21:39:56.045299+0000 b32 2=1+1) hs=1+0,ss=0+0 dirty=1 | ptrwaiter=0 child=1 frozen=0 subtree=1 importing=0 replicated=1 dirty=1 authpin=1 scrubqueue=0 0x55b7bffdc880]

Failed test shows:

remote/smithi144/log/94cab758-6d63-11f0-8731-adfe0268badd/ceph-mds.b.log:2025-07-30T16:53:33.098+0000 7fa2371d1640 10 mds.0.bal handle_export_pins  create aux subtree on [dir 0x10000000236 /volumes/qa/sv_0/21635501-fbe7-4f05-a39a-66ab58aa910a/client.0/tmp/ffsb/.git/logs/refs/remotes/ [2,head] auth v=5 cv=0/0 dir_auth=0 state=1611399169|complete|auxsubtree f(v0 m2025-07-30T16:53:31.694171+0000 1=0+1) n(v0 rc2025-07-30T16:53:31.696758+0000 2=1+1)/n(v0 rc2025-07-30T16:53:31.694171+0000 1=0+1) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 authpin=0 0x5645c0e02900]

As shown above, the failed test has "dir_auth == CDIR_AUTH_DEFAULT" and obviously the "state" differs too.
Debugging continues...

Actions #3

Updated by Jos Collin 7 months ago ยท Edited

@Venky Shankar
We need to increase the 'mds_beacon_grace' to a higher value for the random.yaml test, as we have 132 missed calls to MDBalancer::tick() --> handle_export_pins().
This is not a good sign, as handle_export_pins() subsequently calls Migrator::export_dir() and then increments l_mds_exported.

Another thing to consider is: https://github.com/ceph/ceph/blob/main/src/mds/CInode.cc#L5650.
We generate a random number between 0.0 and 1.0. So if the test keeps a threshold of 0.02, then very less number of items are getting added to the queue and that prevents incrementing the counters too (less number of iterations in MDBalancer::handle_export_pins and even fewer export_dir hits).
So we need to keep a higher threshold in the test, so that the counters could make an increment.

Actions #4

Updated by Jos Collin 7 months ago

  • Pull request ID set to 65109
Actions #5

Updated by Jos Collin 7 months ago

  • Status changed from In Progress to Fix Under Review
Actions

Also available in: Atom PDF