Project

General

Profile

Actions

Bug #72260

open

[CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full state

Added by Kotresh Hiremath Ravishankar 8 months ago. Updated 7 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
other
Backport:
tentacle,squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-2695-g100415d600
Released In:
Upkeep Timestamp:
2025-09-01T13:21:24+00:00

Description

This is raised as part downstream BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2313740

Description of problem:

When Cluster Storage space is in almost full state, tried listing subvolumes in filesystem. The Expectation was it generates required output which is list of subvolumes. This can be a use case where user can plan to free up some space in certain subvolumes from list when in full state.

But, output for 'ceph fs subvolume ls cephfs' prints partial list of subvolumes.

Logs:

*****************************************************************

[root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE   AVAIL     USED  RAW USED  %RAW USED
hdd    560 GiB  76 GiB  484 GiB   484 GiB      86.45
TOTAL  560 GiB  76 GiB  484 GiB   484 GiB      86.45

--- POOLS ---
POOL                ID   PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
.mgr                 1     1  598 KiB        2  1.8 MiB  100.00        0 B
cephfs.cephfs.meta   2    16  613 MiB   24.75k  1.8 GiB  100.00        0 B
cephfs.cephfs.data   3  1024  147 GiB    1.15M  446 GiB  100.00        0 B

root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph fs status
cephfs - 6 clients
======
RANK      STATE                            MDS                          ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      cephfs.ceph-sumar-tfa-fix-8nhx13-node4.swgmdk  Reqs:   28 /s   485k   459k  11.7k  30.7k  
 1        active      cephfs.ceph-sumar-tfa-fix-8nhx13-node7.njetga  Reqs:    0 /s   101k  48.1k   471      5   
 2        active      cephfs.ceph-sumar-tfa-fix-8nhx13-node2.jcaybs  Reqs:    0 /s   170    175    130      2   
0-s   standby-replay  cephfs.ceph-sumar-tfa-fix-8nhx13-node5.hyskuk  Evts:   80 /s   203k   163k  8571      0   
1-s   standby-replay  cephfs.ceph-sumar-tfa-fix-8nhx13-node8.cilvdm  Evts:    0 /s   136k  44.5k   464      0   
2-s   standby-replay  cephfs.ceph-sumar-tfa-fix-8nhx13-node6.wjzpkf  Evts:    0 /s     0      3      1      0   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata  1733M     0   
cephfs.cephfs.data    data     473G     0   
                 STANDBY MDS                   
cephfs.ceph-sumar-tfa-fix-8nhx13-node3.qxvevu  
MDS version: ceph version 19.1.1-62.el9cp (15005be6f81af48462a2de37e490f5d6a6d2e860) squid (rc)

[root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph fs subvolume ls cephfs
[
    {
        "name": "sv1" 
    },
    {
        "name": "sv2" 
    }
]
[root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph fs subvolume ls cephfs svg1
[
    {
        "name": "sv3" 
    }
]
[root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# 

root@ceph-sumar-tfa-fix-8nhx13-node10 mnt]# cd cephfs/
[root@ceph-sumar-tfa-fix-8nhx13-node10 cephfs]# ls
volumes
[root@ceph-sumar-tfa-fix-8nhx13-node10 cephfs]# cd volumes/
[root@ceph-sumar-tfa-fix-8nhx13-node10 volumes]# ls
_:sv1.meta  _:sv2.meta  _deleting  _index  _nogroup  _svg1:sv4.meta  svg1
[root@ceph-sumar-tfa-fix-8nhx13-node10 volumes]# cd _deleting/
[root@ceph-sumar-tfa-fix-8nhx13-node10 _deleting]# ls
[root@ceph-sumar-tfa-fix-8nhx13-node10 _deleting]# cd ..
[root@ceph-sumar-tfa-fix-8nhx13-node10 volumes]# cd _nogroup/
[root@ceph-sumar-tfa-fix-8nhx13-node10 _nogroup]# ls
sv1  sv2  sv3_clone
[root@ceph-sumar-tfa-fix-8nhx13-node10 _nogroup]# cd sv3_clone
[root@ceph-sumar-tfa-fix-8nhx13-node10 sv3_clone]# ls
fd6926f6-68a8-469d-9b94-375238da6df9
[root@ceph-sumar-tfa-fix-8nhx13-node10 sv3_clone]# cd fd6926f6-68a8-469d-9b94-375238da6df9/
[root@ceph-sumar-tfa-fix-8nhx13-node10 fd6926f6-68a8-469d-9b94-375238da6df9]# ls
smallfile_dir1  smallfile_dir2
[root@ceph-sumar-tfa-fix-8nhx13-node10 fd6926f6-68a8-469d-9b94-375238da6df9]# cd ..

[root@ceph-sumar-tfa-fix-8nhx13-node10 sv3_clone]# cd ..
[root@ceph-sumar-tfa-fix-8nhx13-node10 _nogroup]# cd ../svg1
[root@ceph-sumar-tfa-fix-8nhx13-node10 svg1]# ls
sv1_clone  sv3  sv4

[root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph fs subvolume ls cephfs
[
    {
        "name": "sv1" 
    },
    {
        "name": "sv2" 
    },
    {
        "name": "sv3_clone" 
    }
]
[root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph fs subvolume ls cephfs svg1
[
    {
        "name": "sv3" 
    }
]

root@ceph-sumar-tfa-fix-8nhx13-node9 ~]# ceph fs subvolume ls cephfs svg1
[
    {
        "name": "sv3" 
    },
    {
        "name": "sv4" 
    },
    {
        "name": "sv1_clone" 
    }
]
*******************************************************

Version-Release number of selected component (if applicable): 19.1.1-62.el9cp

How reproducible:

Steps to Reproduce:
1. Create few subvolumes across default and non-default groups
2. Add data to fill storage space
3. List subvolumes when cluster is full.

Actual results: Not all subvolumes were listed

Expected results: All subvolumes should be listed

Additional info: Upon multiple retries and traversing the filesystem mount path for corresponding subvolumes search, the missing subvolumes were later listed in 'ceph fs subvolume ls cephfs' command output.


Related issues 3 (1 open2 closed)

Copied to CephFS - Backport #72802: squid: [CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full stateQA TestingJos CollinActions
Copied to CephFS - Backport #72803: tentacle: [CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full stateResolvedJos CollinActions
Copied to CephFS - Backport #72804: reef: [CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full stateResolvedJos CollinActions
Actions #1

Updated by Kotresh Hiremath Ravishankar 8 months ago

Bug Reproduction and cluster state:

When osd is full, the following subvolume ls command returned one less subvolume.
The subvolume 'sv5' is not listed.

[khiremat@vossi01 build]$ ceph fs subvolume ls a svg3
2025-07-23T07:25:34.376+0000 7f55c302b640 -1 WARNING: all dangerous and experimental features are enabled.
2025-07-23T07:25:34.390+0000 7f55c302b640 -1 WARNING: all dangerous and experimental features are enabled.
[
    {
        "name": "sv3" 
    },
    {
        "name": "sv1" 
    },
    {
        "name": "sv4" 
    },
    {
        "name": "sv2" 
    }
]

The subvolumegroup directory 'svg3' is pinned to rank0 (mds.d) and
mgr.x is the libcephfs client for the readdir

[khiremat@vossi01 BZ2313740]$ history | grep setfattr
 2406  setfattr -n ceph.dir.pin -v 0 ~/mnt/volumes/svg1
 2407  sudo setfattr -n ceph.dir.pin -v 0 ~/mnt/volumes/svg1
 2408  sudo setfattr -n ceph.dir.pin -v 0 ~/mnt/volumes/svg2
 2409  sudo setfattr -n ceph.dir.pin -v 0 ~/mnt/volumes/svg3
 2410  sudo setfattr -n ceph.dir.pin -v 1 ~/mnt/volumes/svg4
 2411  sudo setfattr -n ceph.dir.pin -v 1 ~/mnt/volumes/svg5

Cluster state:

[khiremat@vossi01 build]$ ceph df
2025-07-23T07:25:08.464+0000 7f571e909640 -1 WARNING: all dangerous and experimental features are enabled.
2025-07-23T07:25:08.471+0000 7f571e909640 -1 WARNING: all dangerous and experimental features are enabled.
--- RAW STORAGE ---
CLASS   SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    6 GiB  491 MiB  5.5 GiB   5.5 GiB      92.00
TOTAL  6 GiB  491 MiB  5.5 GiB   5.5 GiB      92.00

--- POOLS ---
POOL           ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
.mgr            1    1  577 KiB        2  580 KiB  100.00        0 B
cephfs.a.meta   2   16  3.2 MiB      107  3.3 MiB  100.00        0 B
cephfs.a.data   3  128  4.5 GiB    1.24k  4.5 GiB  100.00        0 B

[khiremat@vossi01 build]$ ceph fs status
2025-07-23T07:29:47.652+0000 7f9b98ba2640 -1 WARNING: all dangerous and experimental features are enabled.
2025-07-23T07:29:47.763+0000 7f9b98ba2640 -1 WARNING: all dangerous and experimental features are enabled.
a - 2 clients
=
RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active   d   Reqs: 2214 /s   161    145     59     69
 1    active   c   Reqs: 2213 /s   105     88     59      0
     POOL        TYPE     USED  AVAIL
cephfs.a.meta  metadata  3406k     0
cephfs.a.data    data    4624M     0
STANDBY MDS
     a
     b
MDS version: ceph version Development (no_version) tentacle (dev - Debug)

[khiremat@vossi01 build]$ ceph status
2025-07-23T07:29:59.139+0000 7ff94954b640 -1 WARNING: all dangerous and experimental features are enabled.
2025-07-23T07:29:59.721+0000 7ff94954b640 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     9b77c6f6-a19b-4192-a3cc-c1ce5222d158
    health: HEALTH_ERR
            Module 'telemetry' has failed dependency:
            1 full osd(s)
            3 pool(s) full
            3 pool(s) have no replicas configured

  services:
    mon: 1 daemons, quorum a (age 71m) [leader: a]
    mgr: x(active, since 4m)
    mds: 2/2 daemons up, 2 standby
    osd: 1 osds: 1 up (since 70m), 1 in (since 71m)

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 145 pgs
    objects: 1.35k objects, 4.5 GiB
    usage:   5.5 GiB used, 491 MiB / 6 GiB avail
    pgs:     145 active+clean

I will update the analysis.

Thanks,
Kotresh HR

Actions #2

Updated by Kotresh Hiremath Ravishankar 8 months ago

Analysing the client log, in this case, it's mgr libcephfs client, it's found out that all the five subvolumes are returned by the mds during readdir. Since rstats are enabled, it fetches'Fa' cap for each dentry to fill the size information. While getting 'Fa' cap for the 'sv5', the getattr request is gone to non-auth mds (mds.1) with the mask CEPH_STAT_RSTAT i.e. want_auth=True,
instead of returning the auth mds to the client, it returned ESTALE causing the issue. Check the mgr client logs below.

mgr.x.log:

57487 2025-07-23T07:25:34.627+0000 7f34c3da2640  0 [volumes INFO volumes.module] Starting _cmd_fs_subvolume_ls(group_name:svg3, prefix:fs subvolume ls, targe        t:['mon-mgr', ''], vol_name:a) < "" 
...
...
  58036 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 send_request client_request(unknown.0:6 readdir #0x10000000003 2025-07-23T07:25:34.707019+0000         caller_uid=0, caller_gid=0{}) to mds.0
  58037 2025-07-23T07:25:34.706+0000 7f34c3da2640  1 -- 172.21.10.1:0/903661256 --> [v2:172.21.10.1:6826/267184127,v1:172.21.10.1:6827/267184127] -- client_req        uest(unknown.0:6 readdir #0x10000000003 2025-07-23T07:25:34.707019+0000 caller_uid=0, caller_gid=0{}) -- 0x56374ccc4a80 con 0x56374cdae000
  58038 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 awaiting reply|forward|kick on 0x7f34c3d9c4e0
  58039 2025-07-23T07:25:34.706+0000 7f34b6d88640  1 -- 172.21.10.1:0/903661256 <== mds.0 v2:172.21.10.1:6826/267184127 8 ==== client_reply(???:6 = 0 (0) Succe        ss) ==== 2304+0+0 (secure 0 0 0) 0x56374ce95600 con 0x56374cdae000
  58040 2025-07-23T07:25:34.706+0000 7f34b6d88640 20 client.4390 handle_client_reply got a reply. Safe:1 tid 6
  58041 2025-07-23T07:25:34.706+0000 7f34b6d88640 10 client.4390 insert_trace from 2025-07-23T07:25:34.707035+0000 mds.0 is_target=1 is_dentry=0

...
...
  58053 2025-07-23T07:25:34.706+0000 7f34b6d88640 10 client.4390 insert_readdir_results 5 readdir items, end=1, hash_order=1, readdir_start , last_hash 0, next        _offset 2
  58054 2025-07-23T07:25:34.706+0000 7f34b6d88640 10 client.4390 _readdir_drop_dirp_buffer 0x56374cd9f340
  58055 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 0: 'sv3'
...
  58064 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 insert_readdir_results  ff234b150000002: 'sv3' -> 0x1000000023f
  58065 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 1: 'sv1'
...
  58074 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 insert_readdir_results  ff3216cd0000002: 'sv1' -> 0x10000000237
  58075 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 2: 'sv4'
...
  58084 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 insert_readdir_results  ffc8a7160000002: 'sv4' -> 0x10000000243
  58085 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 3: 'sv2'
...
  58094 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 insert_readdir_results  ffcc49d80000002: 'sv2' -> 0x1000000023b
  58095 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 4: 'sv5'
...
  58104 2025-07-23T07:25:34.706+0000 7f34b6d88640 15 client.4390 insert_readdir_results  fffe0de50000002: 'sv5' -> 0x10000000247
  58105 2025-07-23T07:25:34.706+0000 7f34b6d88640 20 client.4390 put_snap_realm 0x1 0x56374c5af180 4 -> 3
  58106 2025-07-23T07:25:34.706+0000 7f34b6d88640 20 client.4390 handle_client_reply signalling caller 0x7f34c3d9c4e0
  58107 2025-07-23T07:25:34.706+0000 7f34b6d88640 20 client.4390 handle_client_reply awaiting kickback on tid 6 0x7f34b6d84060
  58108 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 sendrecv kickback on tid 6 0x7f34b6d84060
  58109 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 lat 0.000590
  58110 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 _readdir_get_frag 0x56374cd9f340 got frag * size 5
  58111 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 _readdir_r_cb frag * buffer size 5 offset 2
  58112 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 fill_statx on 0x1000000023f snap/devhead mode 040777 mtime 2025-07-23T06:53:42.699049+0000 cti        me 2025-07-23T06:53:42.699049+0000 change_attr 5
  58113 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 _unwrap_name: (sv3, ) on 0x10000000003.head(faked_ino=0 nref=17 ll_ref=0 cap_refs={} open={} m        ode=40755 size=0/0 nlink=1 btime=2025-07-23T06:52:49.343505+0000 mtime=2025-07-23T06:53:43.705400+0000 ctime=2025-07-23T07:14:52.204869+0000 change_att        r=7 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["svg3" ref=2 ino=0x10000000003.head csg=1] has_dir_layout 0x56374ce45e00)
  58114 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 fill_dirent 'sv3' -> 0x1000000023f type 4 w/ next_off ff234b150000003
  58115 2025-07-23T07:25:34.706+0000 7f34c3da2640 15 client.4390 _readdir_r_cb de sv3 off ff234b150000002 snap head = 1
...
  58120 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 _readdir_r_cb frag * buffer size 5 offset ff234b150000003
  58121 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 _getattr mask Fa issued=0
...
  58123 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 choose_target_mds 0x10000000237.head(faked_ino=0 nref=5 ll_ref=0 cap_refs={} open={} mode=4077        7 size=0/0 nlink=1 btime=2025-07-23T06:53:41.420920+0000 mtime=2025-07-23T06:53:41.541765+0000 ctime=2025-07-23T06:53:41.541765+0000 change_attr=5 caps        =- parents=0x10000000003.head["sv1" ref=1 ino=0x10000000237.head csg=1] 0x56374ce46a00) is_hash=0 hash=0
  58124 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 did not get mds through better means, so chose random mds 1
  58125 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 mds is 1
  58126 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 send_request rebuilding request 7 for mds.1
  58127 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 encode_cap_releases enter (req: 0x56374ccc4380, mds: 1)
  58128 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 send_request set sent_stamp to 2025-07-23T07:25:34.707676+0000
  58129 2025-07-23T07:25:34.706+0000 7f34c3da2640 10 client.4390 send_request client_request(unknown.0:7 getattr Fa #0x10000000237 2025-07-23T07:25:34.707663+0        000 caller_uid=0, caller_gid=0{}) to mds.1
  58130 2025-07-23T07:25:34.706+0000 7f34c3da2640  1 -- 172.21.10.1:0/903661256 --> [v2:172.21.10.1:6824/4130805393,v1:172.21.10.1:6825/4130805393] -- client_r        equest(unknown.0:7 getattr Fa #0x10000000237 2025-07-23T07:25:34.707663+0000 caller_uid=0, caller_gid=0{}) -- 0x56374ccc5500 con 0x56374ccd1800
  58131 2025-07-23T07:25:34.706+0000 7f34c3da2640 20 client.4390 awaiting reply|forward|kick on 0x7f34c3d9c4c0
  58132 2025-07-23T07:25:34.707+0000 7f34b6d88640  1 -- 172.21.10.1:0/903661256 <== mds.1 v2:172.21.10.1:6824/4130805393 3 ==== client_request_forward(7 to mds        .0 num_fwd=1 client_must_resend) ==== 9+0+0 (secure 0 0 0) 0x56374cdc8b60 con 0x56374ccd1800
  58133 2025-07-23T07:25:34.707+0000 7f34b6d88640 10 client.4390 handle_client_request_forward tid 7 fwd 1 to mds.0, resending to 0
  58134 2025-07-23T07:25:34.707+0000 7f34c3da2640 10 client.4390 choose_target_mds resend_mds specified as mds.0
...
  58139 2025-07-23T07:25:34.707+0000 7f34c3da2640 10 client.4390 send_request client_request(unknown.0:7 getattr Fa #0x10000000237 2025-07-23T07:25:34.707663+0        000 FWD=1 caller_uid=0, caller_gid=0{}) to mds.0
...
  58142 2025-07-23T07:25:34.707+0000 7f34b6d88640  1 -- 172.21.10.1:0/903661256 <== mds.0 v2:172.21.10.1:6826/267184127 9 ==== client_reply(???:7 = 0 (0) Succe        ss) ==== 511+0+0 (secure 0 0 0) 0x56374ce95800 con 0x56374cdae000
...
...
  58178 2025-07-23T07:25:34.707+0000 7f34c3da2640 10 client.4390 fill_statx on 0x10000000237 snap/devhead mode 040777 mtime 2025-07-23T06:53:41.541765+0000 cti        me 2025-07-23T06:53:41.541765+0000 change_attr 5
  58179 2025-07-23T07:25:34.707+0000 7f34c3da2640 20 client.4390 _unwrap_name: (sv1, ) on 0x10000000003.head(faked_ino=0 nref=17 ll_ref=0 cap_refs={} open={} m        ode=40755 size=0/0 nlink=1 btime=2025-07-23T06:52:49.343505+0000 mtime=2025-07-23T06:53:43.705400+0000 ctime=2025-07-23T07:14:52.204869+0000 change_att        r=7 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["svg3" ref=2 ino=0x10000000003.head csg=1] has_dir_layout 0x56374ce45e00)
...
...      --------------------Same for other 3 entries ---------------------------
...
  58306 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 _readdir_r_cb 0x10000000003.head(faked_ino=0 nref=17 ll_ref=0 cap_refs={} open={} mode=40755 s        ize=0/0 nlink=1 btime=2025-07-23T06:52:49.343505+0000 mtime=2025-07-23T06:53:43.705400+0000 ctime=2025-07-23T07:14:52.204869+0000 change_attr=7 caps=pA        sLsXsFs(0=pAsLsXsFs) parents=0x10000000000.head["svg3" ref=2 ino=0x10000000003.head csg=1] has_dir_layout 0x56374ce45e00) offset ffcc49d80000003 at_end        =0 hash_order=1
  58307 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 _readdir_r_cb offset ffcc49d80000003 snapid head (complete && ordered) 0 issued pAsLsXsFs
  58308 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 _readdir_r_cb frag * buffer size 5 offset ffcc49d80000003
  58309 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 _getattr mask Fa issued=0
  58310 2025-07-23T07:25:34.708+0000 7f34c3da2640 20 client.4390 choose_target_mds starting with req->inode 0x10000000247.head(faked_ino=0 nref=5 ll_ref=0 cap_        refs={} open={} mode=40777 size=0/0 nlink=1 btime=2025-07-23T06:53:43.705400+0000 mtime=2025-07-23T06:53:43.872969+0000 ctime=2025-07-23T06:53:43.87296        9+0000 change_attr=5 caps=- parents=0x10000000003.head["sv5" ref=1 ino=0x10000000247.head csg=1] 0x56374ceba000)
  58311 2025-07-23T07:25:34.708+0000 7f34c3da2640 20 client.4390 choose_target_mds 0x10000000247.head(faked_ino=0 nref=5 ll_ref=0 cap_refs={} open={} mode=4077        7 size=0/0 nlink=1 btime=2025-07-23T06:53:43.705400+0000 mtime=2025-07-23T06:53:43.872969+0000 ctime=2025-07-23T06:53:43.872969+0000 change_attr=5 caps        =- parents=0x10000000003.head["sv5" ref=1 ino=0x10000000247.head csg=1] 0x56374ceba000) is_hash=0 hash=0
  58312 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 did not get mds through better means, so chose random mds 1
  58313 2025-07-23T07:25:34.708+0000 7f34c3da2640 20 client.4390 mds is 1
  58314 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 send_request rebuilding request 10 for mds.1
  58315 2025-07-23T07:25:34.708+0000 7f34c3da2640 20 client.4390 encode_cap_releases enter (req: 0x56374cf56380, mds: 1)
  58316 2025-07-23T07:25:34.708+0000 7f34c3da2640 20 client.4390 send_request set sent_stamp to 2025-07-23T07:25:34.709338+0000
  58317 2025-07-23T07:25:34.708+0000 7f34c3da2640 10 client.4390 send_request client_request(unknown.0:10 getattr Fa #0x10000000247 2025-07-23T07:25:34.709326+0000 caller_uid=0, caller_gid=0{}) to mds.1
  58318 2025-07-23T07:25:34.708+0000 7f34c3da2640  1 -- 172.21.10.1:0/903661256 --> [v2:172.21.10.1:6824/4130805393,v1:172.21.10.1:6825/4130805393] -- client_request(unknown.0:10 getattr Fa #0x10000000247 2025-07-23T07:25:34.709326+0000 caller_uid=0, caller_gid=0{}) -- 0x56374cf56a80 con 0x56374ccd1800
  58319 2025-07-23T07:25:34.708+0000 7f34c3da2640 20 client.4390 awaiting reply|forward|kick on 0x7f34c3d9c4c0
  58320 2025-07-23T07:25:34.708+0000 7f34b6d88640  1 -- 172.21.10.1:0/903661256 <== mds.1 v2:172.21.10.1:6824/4130805393 5 ==== client_reply(???:10 = -116 (116) Stale file handle) ==== 27+0+0 (secure 0 0 0) 0x56374cb53200 con 0x56374ccd1800

...  ------ The mds1 didn't ask the client to forward it to mds.0 but instead returned ESTALE -------------------
...
...
  58341 2025-07-23T07:25:34.708+0000 7f34c3da2640  0 [volumes DEBUG volumes.fs.operations.lock] exited global lock
  58342 2025-07-23T07:25:34.708+0000 7f34c3da2640  0 [volumes INFO volumes.module] Finishing _cmd_fs_subvolume_ls(group_name:svg3, prefix:fs subvolume ls, targ        et:['mon-mgr', ''], vol_name:a) < "" 
...

Going through the mds.1 (mds.c) logs for the getattr request which received ESTALE. The 'dispatch_client_request'
returns with ESTALE. It has not reached 'handle_client_getattr' which would have done the path_traversal with want_auth=True
and returned auth mds so that the client can forward the request to auth mds.

Please see the logs below.

mds.c.log

...
2025-07-23T07:25:34.708+0000 7f62dfc4e640  1 -- [v2:172.21.10.1:6824/4130805393,v1:172.21.10.1:6825/4130805393] <== client.4390 172.21.10.1:0/903661256 5 ==== client_request(client.4390:10 getatt
r Fa #0x10000000247 2025-07-23T07:25:34.709326+0000 caller_uid=0, caller_gid=0{}) ==== 168+0+0 (secure 0 0 0) 0x55b9644cc000 con 0x55b964434000
2025-07-23T07:25:34.708+0000 7f62dfc4e640  4 mds.1.server handle_client_request client_request(client.4390:10 getattr Fa #0x10000000247 2025-07-23T07:25:34.709326+0000 caller_uid=0, caller_gid=0{
})
2025-07-23T07:25:34.708+0000 7f62dfc4e640 20 mds.1.8 get_session have 0x55b9644f8000 client.4390 172.21.10.1:0/903661256 state open
2025-07-23T07:25:34.708+0000 7f62dfc4e640 15 mds.1.server  oldest_client_tid=10
2025-07-23T07:25:34.708+0000 7f62dfc4e640  7 mds.1.cache request_start request(client.4390:10 nref=2 cr=0x55b9644cc000)
2025-07-23T07:25:34.708+0000 7f62dfc4e640  7 mds.1.server dispatch_client_request client_request(client.4390:10 getattr Fa #0x10000000247 2025-07-23T07:25:34.709326+0000 caller_uid=0, caller_gid=
0{})
2025-07-23T07:25:34.708+0000 7f62dfc4e640  7 mds.1.server reply_client_request -116 ((116) Stale file handle) client_request(client.4390:10 getattr Fa #0x10000000247 2025-07-23T07:25:34.709326+00
00 caller_uid=0, caller_gid=0{})
...

Looking further into the code.
The mds.1 didn't have the inode #0x10000000247 of 'sv5' in the mdcache. When the osd is full, the inode
is fetched to validate the FULL cap access for certain operations during which it failed with ESTALE.
Please check code below.

void Server::dispatch_client_request(const MDRequestRef& mdr)
{
...
...
   if (is_full) {
    CInode *cur = try_get_auth_inode(mdr, req->get_filepath().get_ino());
    if (!cur) {
      // the request is already responded to
      return;
    }
...
...
}

CInode* Server::try_get_auth_inode(const MDRequestRef& mdr, inodeno_t ino)
{
  CInode *in = mdcache->get_inode(ino);
  if (!in || in->state_test(CInode::STATE_PURGING)) {
    respond_to_request(mdr, -ESTALE);
    return nullptr;
  }
  if (!in->is_auth()) {
    mdcache->request_forward(mdr, in->authority().first);
    return nullptr;
  }

  return in;
}

So, this looks like introduced by the commit - 6db81d8479b539d3ca6b98dc244c525e71a36437
which introduced full cap to allow subvolume deletion when osds are full.
I think the code to fetch the inode should be under the operations which are affected
by it.

Thanks and Regards,
Kotresh H R

Actions #3

Updated by Kotresh Hiremath Ravishankar 8 months ago

  • Pull request ID set to 64663
Actions #4

Updated by Kotresh Hiremath Ravishankar 8 months ago

  • Description updated (diff)
Actions #5

Updated by Kotresh Hiremath Ravishankar 8 months ago

  • Status changed from In Progress to Fix Under Review
Actions #6

Updated by Venky Shankar 7 months ago

  • Category set to Correctness/Safety
  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v21.0.0
  • Source set to other
  • Component(FS) MDS added
Actions #7

Updated by Upkeep Bot 7 months ago

  • Copied to Backport #72802: squid: [CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full state added
Actions #8

Updated by Upkeep Bot 7 months ago

  • Copied to Backport #72803: tentacle: [CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full state added
Actions #9

Updated by Upkeep Bot 7 months ago

  • Copied to Backport #72804: reef: [CephFS - Subvolume] Not all subvolumes are listed when Cluster Storage is in almost full state added
Actions #10

Updated by Upkeep Bot 7 months ago

  • Tags (freeform) set to backport_processed
Actions #11

Updated by Upkeep Bot 7 months ago

  • Merge Commit set to 100415d600d74c48258c0a07063ca140737c278a
  • Fixed In set to v20.3.0-2695-g100415d600
  • Upkeep Timestamp set to 2025-09-01T13:21:24+00:00
Actions

Also available in: Atom PDF