Skip to content

mds: check relevant caps for fs include root_squash#57192

Merged
batrick merged 13 commits intoceph:mainfrom
batrick:i65733
May 7, 2024
Merged

mds: check relevant caps for fs include root_squash#57192
batrick merged 13 commits intoceph:mainfrom
batrick:i65733

Conversation

@batrick
Copy link
Member

@batrick batrick commented May 1, 2024

I'm not satisfied with this approach completely:

  • There should be a way to override the MDS and allow the broken root_squash caps to continue working. Evicting all affected clients is not acceptable.
  • Setting the client feature bit on the MDSMap has no effect.

Will think about this more tomorrow and start some discussions with the team.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
  • Component impact
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@batrick batrick added the cephfs Ceph File System label May 1, 2024
@batrick batrick marked this pull request as ready for review May 3, 2024 00:57
@batrick batrick requested review from a team as code owners May 3, 2024 00:57
@batrick batrick requested review from a team, lxbsz and vshankar May 3, 2024 00:57
@batrick
Copy link
Member Author

batrick commented May 3, 2024

This PR is under test in https://tracker.ceph.com/issues/65771.

@github-actions
Copy link

github-actions bot commented May 3, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A speedy turnaround with this one, great job, Patrick!
As usual, I'm cautious about adding a managed cache to the internal state where const calculation is cheap. We can afford to scan a list of a few tens of sessions once in 5 seconds and thus avoid inflation of the managed state and potential bugs this can bring.

n += bits_per_block;
n /= bits_per_block;
_vec.resize(n, 0);
n += bits_per_block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: something's wrong with the indentation in this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@batrick batrick requested a review from rishabh-d-dave May 3, 2024 18:27
and a client upgrade is required.

This is a HEALTH_ERR warning because of the danger of inconsistency and lost
data. It is recommended to either upgrade your clients, discontinue using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lost data

What sort of scenario would this happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-d-dave
Copy link
Contributor

@batrick I know it's going to be squashed so it doesn't matter, but sign-off in last commit is missing. Signed-off-by check in CI failed due to it

Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@batrick batrick requested a review from chrisphoffman May 6, 2024 15:22
Comment on lines 583 to +584
self.update_attrs(**kwargs)

retval = self.mount(mntopts=mntopts, check_status=check_status)
retval = self.mount(mntopts=mntopts, check_status=check_status, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing kwargs is redundant because each k-v pair in kwargs are set object attributes by update_attrs()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit message in qa: pass kwargs to mount from remount.

# should succeed
with self.assert_cluster_log("with broken root_squash implementation"):
keyring_path = self.mount_a.client_remote.mktemp(data=keyring)
self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.


CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK = 21
# all but CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK
features = ",".join([str(i) for i in range(CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.


def test_rootsquash_nofeature(self):
"""
That having root_squash on an fs without the feature bit raises an HEALTH_ERR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this line would currently look in teuthology.log -

2024-05-01T05:49:57.520 INFO:tasks.cephfs_test_runner:Starting test: test_add_already_in_use_data_pool (tasks.cephfs.test_admin.TestAddDataPool)
That command try to add data pool which is already in use with another fs.

So I suggest slight modification -

Suggested change
That having root_squash on an fs without the feature bit raises an HEALTH_ERR.
Test that having root_squash on an fs without the feature bit raises an HEALTH_ERR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Test ..." is implicit in this construction. I'm using the style begun by John Spray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but it's not printed in teuthology.log. I've copied an example from recent run.

Anyways, I don't really mind. Since this PR is urgent, I've approved it regardless of these nits.

# should succeed
keyring_path = self.mount_a.client_remote.mktemp(data=keyring)
self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs1.name)
with self.assert_cluster_log("report clients with broken root_squash", present=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.

self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs1.name)
with self.assert_cluster_log("report clients with broken root_squash", present=False):
keyring_path = self.mount_a.client_remote.mktemp(data=keyring)
self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs1.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.

@batrick
Copy link
Member Author

batrick commented May 6, 2024

Folks, I'm not considering further nits on this PR in the interest of time as this PR is blocking v18.2.3.

@batrick
Copy link
Member Author

batrick commented May 6, 2024

I will be rebasing and squashing fixes now.

For testing purposes.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
batrick added a commit to batrick/ceph that referenced this pull request May 7, 2024
* refs/pull/57192/head:
	PendingReleaseNotes: add note on the client incompatibility health warning and feature bit
	doc/cephfs: add client_mds_auth_caps client feature bit
	doc/cephfs: add missing client feature bits
	doc/cephfs: document MDS_CLIENTS_BROKEN_ROOTSQUASH health error
	qa: add tests for MDS_CLIENTS_BROKEN_ROOTSQUASH
	mds: raise health warning if client lacks feature for root_squash
	mon/MDSMonitor: add note about missing metadata inclusion
	mds: check relevant caps for fs include root_squash
	mds: refactor out fs_name match in MDSAuthCaps
	qa: test for root_squash with multiple caps
	qa: pass kwargs to mount from remount
	qa: simplify update_attrs and only update relevant keys
	client: allow overriding client features
@vshankar
Copy link
Contributor

vshankar commented May 7, 2024

Folks, I'm not considering further nits on this PR in the interest of time as this PR is blocking v18.2.3.

Agreed and +1 for this.

batrick added 12 commits May 7, 2024 08:19
So we can just pass the caller's kwargs to update_attrs.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
So we can pass mntargs.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Where the client has root_squash for one cap but not for another. The fs
without root_squash should not necessarily reject the client.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
When denying client reconnects because the MDS caps include root_squash and the
client features do not include CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK, ensure those
caps are only for the file system the MDS is joined to.

Fixes: https://tracker.ceph.com/issues/65733
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
There is a "client_count" metadata on the health warning that apparently was
intended to be used for aggregating warnings but never was. Add a TODO item for
that.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Rather than evict all clients lacking this feature bit, raise a health error
that pushes the administrator to address it. This avoids the surprise of having
all affected clients suddenly evicted in the cluster.

Fixes: https://tracker.ceph.com/issues/65733
Fixes: 954ed30
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
…rning and feature bit

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented May 7, 2024

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented May 7, 2024

@batrick
Copy link
Member Author

batrick commented May 7, 2024

jenkins test make check

@batrick
Copy link
Member Author

batrick commented May 7, 2024

jenkins test make check arm64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants