Skip to content

mds: fix client isn't responding to mclientcaps(revoke)#47752

Merged
vshankar merged 4 commits intoceph:mainfrom
lxbsz:wip-57244
May 15, 2023
Merged

mds: fix client isn't responding to mclientcaps(revoke)#47752
vshankar merged 4 commits intoceph:mainfrom
lxbsz:wip-57244

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Aug 23, 2022

When revoking caps from clients and if the clients could release
some of the caps references and the clients still could send cap
update request back to MDS, while the confirm_receipt() will clear
the _revokes list anyway. But this cap will still be kept in revoking_caps
list.

Fixes: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li xiubli@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@lxbsz lxbsz requested a review from a team August 23, 2022 09:15
@github-actions github-actions bot added the cephfs Ceph File System label Aug 23, 2022
@vshankar
Copy link
Contributor

vshankar commented Sep 5, 2022

@vshankar
Copy link
Contributor

vshankar commented Sep 6, 2022

@lxbsz -- this change seems to be causing MDS_CLIENT_LATE_RELEASE warnings to be generated. E.g.: https://pulpito.ceph.com/vshankar-2022-09-06_06:13:42-fs-wip-vshankar-testing-20220901-100101-testing-default-smithi/7013838/

@lxbsz
Copy link
Member Author

lxbsz commented Sep 7, 2022

@lxbsz -- this change seems to be causing MDS_CLIENT_LATE_RELEASE warnings to be generated. E.g.: https://pulpito.ceph.com/vshankar-2022-09-06_06:13:42-fs-wip-vshankar-testing-20220901-100101-testing-default-smithi/7013838/

Fixed it by calling calc_issued(); just after adding the revoking back. b118147#diff-e108a3d821c93ed8411c681edfe5adff57ef41ef021e83e0a769aa6c8e6bbcc9R198

We need to calculate the isssued again after that.

@vshankar vshankar added the wip-rishabh-testing Rishabh's testing label label Sep 16, 2022
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A synthetic test case would be helpful.

I think we also need to have the MDS give better information at dout(0) in the debug log when this situation happens:

  • dump the inode
  • dump the cap list for the inode
  • dump the session holding the cap

@rishabh-d-dave
Copy link
Contributor

@lxbsz @vshankar I haven't included this PR in last QA testing batch since changes have been requested on it. I am removing my testing label. Please let me know when it's ready.

@rishabh-d-dave
Copy link
Contributor

jenkins test make check

@rishabh-d-dave
Copy link
Contributor

jenkins test windows

@lxbsz
Copy link
Member Author

lxbsz commented Oct 8, 2022

A synthetic test case would be helpful.

Not sure how hard it is to reproduce this. Locally I haven't successfully done that yet.

I think we also need to have the MDS give better information at dout(0) in the debug log when this situation happens:

  • dump the inode
  • dump the cap list for the inode
  • dump the session holding the cap

Yeah, I will add this.

@gregsfortytwo
Copy link
Member

bump @vshankar @batrick

@vshankar vshankar added the wip-vshankar-backlog Backlog of CephFS PRs to pick for testing label Mar 2, 2023
@lxbsz
Copy link
Member Author

lxbsz commented Mar 3, 2023

jenkins test api

lxbsz added 4 commits March 8, 2023 21:44
Will add the debug logs later in confirm_receipt().

Fixes: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
When revoking caps from clients and if the clients could release
some of the caps references and the clients still could send cap
update request back to MDS, while the confirm_receipt() will clear
the _revokes list anyway.

But this cap will still be kept in revoking_caps list.

At the same time add one debug log when revocation is not totally
finished.

Fixes: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
In case:

           mds                             client
	                        - Releases cap and put Inode
  - Increase cap->seq and sends
    revokes req to the client
  - Receives release req and    - Receives & drops the revoke req
    skip removing the cap and
    then eval the CInode and
    issue or revoke caps again.
                                - Receives & drops the caps update
			          or revoke req
  - Health warning for client
    isn't responding to
    mclientcaps(revoke)

Fixes: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
When writing to a file and the max_size is approaching the client
will try to trigger to call check_caps() and flush the caps to MDS.
But just in case the MDS is revoking Fsxrw caps, since the client
keeps writing and holding the Fw caps it may only release part of
the caps but the Fw.

Fixes: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@vshankar
Copy link
Contributor

@vshankar
Copy link
Contributor

jenkins test make check arm64

@vshankar
Copy link
Contributor

This would need a retest since there are failing tests in main branch which is being tracked by: https://tracker.ceph.com/issues/59534

@vshankar
Copy link
Contributor

vshankar commented May 9, 2023

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@candychencan
Copy link
Contributor

This change will cause inconsistent cap_id in client:

src/client/Client.cc: 4073: FAILED ceph_assert(cap.cap_id == cap_id)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7fa326d6ade1]
2: (()+0x263008) [0x7fa326d6b008]
3: (Client::add_update_cap(Inode*, MetaSession*, unsigned long, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, UserPerm const&)+0xae1) [0x559dc29a6c81]
4: (Client::add_update_inode(InodeStat*, utime_t, MetaSession*, UserPerm const&)+0xb73) [0x559dc29ab083]
5: (Client::insert_trace(MetaRequest*, MetaSession*)+0xf2b) [0x559dc29bef8b]
6: (Client::handle_client_reply(boost::intrusive_ptr const&)+0x16f) [0x559dc29c84ff]
7: (Client::ms_dispatch2(boost::intrusive_ptr const&)+0x17f) [0x559dc2a0a5cf]
8: (DispatchQueue::entry()+0x137a) [0x7fa326fa784a]
9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa32705fcbd]
10: (()+0x7ea5) [0x7fa32500fea5]
11: (clone()+0x6d) [0x7fa323ee896d]

I have append a picture to description to this question:
https://tracker.ceph.com/issues/57244

Do you have some suggestion to solve this problem? Thanks!

@vshankar
Copy link
Contributor

This change will cause inconsistent cap_id in client:

src/client/Client.cc: 4073: FAILED ceph_assert(cap.cap_id == cap_id) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7fa326d6ade1] 2: (()+0x263008) [0x7fa326d6b008] 3: (Client::add_update_cap(Inode*, MetaSession*, unsigned long, unsigned int, unsigned int, unsigned int, unsigned int, inodeno_t, int, UserPerm const&)+0xae1) [0x559dc29a6c81] 4: (Client::add_update_inode(InodeStat*, utime_t, MetaSession*, UserPerm const&)+0xb73) [0x559dc29ab083] 5: (Client::insert_trace(MetaRequest*, MetaSession*)+0xf2b) [0x559dc29bef8b] 6: (Client::handle_client_reply(boost::intrusive_ptr const&)+0x16f) [0x559dc29c84ff] 7: (Client::ms_dispatch2(boost::intrusive_ptr const&)+0x17f) [0x559dc2a0a5cf] 8: (DispatchQueue::entry()+0x137a) [0x7fa326fa784a] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa32705fcbd] 10: (()+0x7ea5) [0x7fa32500fea5] 11: (clone()+0x6d) [0x7fa323ee896d]

I have append a picture to description to this question: https://tracker.ceph.com/issues/57244

Do you have some suggestion to solve this problem? Thanks!

That's why the change was reverted: #51661

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants