Project

General

Profile

Actions

Bug #73183

open

join_fscid is incorrectly reset for active MDS in remaining filesystems when filesystem is removed

Added by ethan wu 6 months ago. Updated 6 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Backport:
tentacle,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-3386-gf8aa413815
Released In:
Upkeep Timestamp:
2025-10-03T11:20:05+00:00

Description

Active MDS daemons in remaining filesystems incorrectly have their join_fscid cleared to FS_CLUSTER_ID_NONE when any other
filesystem is removed.

The issue was caused by variable name shadowing in erase_filesystem()
where the loop variable 'fscid' shadowed the function parameter 'fscid':
Inside loop: if (info.join_fscid == fscid) compared against the
loop variable (remaining FS ID) instead of parameter (removed FS ID)

../src/vstart.sh --new -x --localhost --bluestore
FS=b
./bin/ceph osd pool create cephfs.${FS}.meta 64 64 replicated
./bin/ceph osd pool create cephfs.${FS}.data 64 64 replicated
./bin/ceph fs new ${FS} cephfs.${FS}.meta cephfs.${FS}.data
./bin/ceph config set mds.a mds_join_fs a
./bin/ceph config set mds.b mds_join_fs a
./bin/ceph fs fail ${FS}
./bin/ceph fs rm ${FS} --yes-i-really-mean-it

Then from ./bin/ceph fs dump
We can see join_fscid in all active mds filesystem 'a' are reset.
Since there are standby mds with join_fscid=1
MDSMonitor think they have better affinity and trigger switch over

./bin/ceph fs dump 18
2025-09-23T10:11:33.491+0800 7e02ba0fe6c0 -1 WARNING: all dangerous and experimental features are enabled.
e18
btime 2025-09-23T09:53:01:858886+0800
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
legacy client fscid: 1

Filesystem 'a' (1)
fs_name a
epoch 11
flags 12 joinable allow_snaps allow_multimds_snaps
created 2025-09-23T09:44:48.563241+0800
modified 2025-09-23T09:50:23.574683+0800
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
max_xattr_size 65536
required_client_features {}
last_failure 0
last_failure_osd_epoch 58
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=md
s uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
max_mds 1
in 0
up {0=4224}
failed
damaged
stopped
data_pools [3]
metadata_pool 2
inline_data disabled
balancer
bal_rank_mask -1
standby_count_wanted 1
qdb_cluster leader: 4224 members: 4224
[mds.a{0:4224} state up:active seq 89 join_fscid=1 addr [v2:127.0.0.1:6826/1935635411,v1:127.0.0.1:6827/1935635411] compat {c=[1],r=[1],i=[1fff]}]

Filesystem 'b' (2)
fs_name b
epoch 17
flags 13 allow_snaps allow_multimds_snaps
created 2025-09-23T09:52:05.137094+0800
modified 2025-09-23T09:53:00.840319+0800
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
max_xattr_size 65536
required_client_features {}
last_failure 0
last_failure_osd_epoch 90
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=md
s uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
max_mds 1
in 0
up {}
failed 0
damaged
stopped
data_pools [4]
metadata_pool 5
inline_data disabled
balancer
bal_rank_mask -1
standby_count_wanted 1
qdb_cluster leader: 0 members:

Standby daemons:

[mds.c{-1:4316} state up:standby seq 9 join_fscid=1 addr [v2:127.0.0.1:6830/171423485,v1:127.0.0.1:6831/171423485] compat {c=[1],r=[1],i=[1fff]}]
[mds.b{-1:4395} state up:standby seq 1 join_fscid=1 addr [v2:127.0.0.1:6828/1636512126,v1:127.0.0.1:6829/1636512126] compat {c=[1],r=[1],i=[1fff]}]
dumped fsmap epoch 18

./bin/ceph fs dump 19
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    2025-09-23T10:13:01.327+0800 70f9d1c2b6c0 -1 WARNING: all dangerous and experimental features are enabled.
    2025-09-23T10:13:01.340+0800 70f9d1c2b6c0 -1 WARNING: all dangerous and experimental features are enabled.
    e19
    btime 2025-09-23T09:53:32:668544+0800
    enable_multiple, ever_enabled_multiple: 1,1
    default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
    legacy client fscid: 1

Filesystem 'a' (1)
fs_name a
epoch 19
flags 12 joinable allow_snaps allow_multimds_snaps
created 2025-09-23T09:44:48.563241+0800
modified 2025-09-23T09:53:32.352779+0800
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
max_xattr_size 65536
required_client_features {}
last_failure 0
last_failure_osd_epoch 58
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
max_mds 1
in 0
up {0=4224}
failed
damaged
stopped
data_pools [3]
metadata_pool 2
inline_data disabled
balancer
bal_rank_mask -1
standby_count_wanted 1
qdb_cluster leader: 4224 members: 4224
[mds.a{0:4224} state up:active seq 89 addr [v2:127.0.0.1:6826/1935635411,v1:127.0.0.1:6827/1935635411] compat {c=[1],r=[1],i=[1fff]}]
//*join_fscid in mds.a is missing*

Standby daemons:

[mds.c{-1:4316} state up:standby seq 9 join_fscid=1 addr [v2:127.0.0.1:6830/171423485,v1:127.0.0.1:6831/171423485] compat {c=[1],r=[1],i=[1fff]}]
[mds.b{-1:4395} state up:standby seq 1 join_fscid=1 addr [v2:127.0.0.1:6828/1636512126,v1:127.0.0.1:6829/1636512126] compat {c=[1],r=[1],i=[1fff]}]

./bin/ceph fs dump 20
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    2025-09-23T10:14:19.894+0800 7d41290c36c0 -1 WARNING: all dangerous and experimental features are enabled.
    2025-09-23T10:14:19.907+0800 7d41290c36c0 -1 WARNING: all dangerous and experimental features are enabled.
    e20
    btime 2025-09-23T09:53:32:750063+0800
    enable_multiple, ever_enabled_multiple: 1,1
    default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
    legacy client fscid: 1

Filesystem 'a' (1)
fs_name a
epoch 20
flags 12 joinable allow_snaps allow_multimds_snaps
created 2025-09-23T09:44:48.563241+0800
modified 2025-09-23T09:53:32.750059+0800
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
max_xattr_size 65536
required_client_features {}
last_failure 0
last_failure_osd_epoch 116
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
max_mds 1
in 0
up {0=4316}
failed
damaged
stopped
data_pools [3]
metadata_pool 2
inline_data disabled
balancer
bal_rank_mask -1
standby_count_wanted 1
qdb_cluster leader: 0 members:
[mds.c{0:4316} state up:replay seq 9 join_fscid=1 addr [v2:127.0.0.1:6830/171423485,v1:127.0.0.1:6831/171423485] compat {c=[1],r=[1],i=[1fff]}]


Related issues 2 (1 open1 closed)

Copied to CephFS - Backport #73349: squid: join_fscid is incorrectly reset for active MDS in remaining filesystems when filesystem is removedQA TestingJos CollinActions
Copied to CephFS - Backport #73350: tentacle: join_fscid is incorrectly reset for active MDS in remaining filesystems when filesystem is removedResolvedJos CollinActions
Actions #1

Updated by ethan wu 6 months ago

  • Pull request ID set to 65640
Actions #2

Updated by Rishabh Dave 6 months ago

  • Status changed from New to Pending Backport
  • Backport set to tentacle,squid

@Venky Shankar I believe we are not taking new reef backports anymore. Is that correct? I've set this ticket only to squid and tentacle.

Actions #3

Updated by Upkeep Bot 6 months ago

  • Status changed from Pending Backport to Resolved
  • Merge Commit set to f8aa4138150778a417f64eea2ec64de55c676dba
  • Fixed In set to v20.3.0-3386-gf8aa413815
  • Upkeep Timestamp set to 2025-10-03T11:20:05+00:00
Actions #4

Updated by Rishabh Dave 6 months ago

  • Status changed from Resolved to Pending Backport
Actions #5

Updated by Upkeep Bot 6 months ago

  • Copied to Backport #73349: squid: join_fscid is incorrectly reset for active MDS in remaining filesystems when filesystem is removed added
Actions #6

Updated by Upkeep Bot 6 months ago

  • Copied to Backport #73350: tentacle: join_fscid is incorrectly reset for active MDS in remaining filesystems when filesystem is removed added
Actions #7

Updated by Upkeep Bot 6 months ago

  • Tags (freeform) set to backport_processed
Actions

Also available in: Atom PDF