Skip to content

mds/cache: don't assume non-auth xlocks to be remote locks#57020

Merged
batrick merged 1 commit intomainfrom
wip-lusov-foreign-xlock
Apr 30, 2024
Merged

mds/cache: don't assume non-auth xlocks to be remote locks#57020
batrick merged 1 commit intomainfrom
wip-lusov-foreign-xlock

Conversation

@leonid-s-usov
Copy link
Contributor

@leonid-s-usov leonid-s-usov commented Apr 21, 2024

A few places in the code assumed that non-auth xlocks
must be remote, which prevented a proper drop lock procedure
when those locks turned out to be locallocks.

Fixes: https://tracker.ceph.com/issues/65606

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Apr 21, 2024
@leonid-s-usov leonid-s-usov requested review from a team and batrick April 21, 2024 19:18
@petrutlucian94
Copy link
Contributor

jenkins test windows

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-foreign-xlock branch from 4386926 to e3f10dc Compare April 22, 2024 09:35
@leonid-s-usov
Copy link
Contributor Author

We know we're on the right track by seeing all tests fail after the first fix: https://pulpito.ceph.com/leonidus-2024-04-22_05:52:53-fs-wip-lusov-quiescer-distro-default-smithi/ :D

I will be running again soon with the second fix in drop_locks, pending on the shaman to publish the builds. The failures above show how often it was the case that we didn't properly release the local xlock by stripping it in request_drop_foreign_locks, leaving requests to wait forever.

@leonid-s-usov leonid-s-usov requested a review from a team April 22, 2024 15:00
@leonid-s-usov
Copy link
Contributor Author

These changes have been validated by https://pulpito.ceph.com/leonidus-2024-04-22_12:36:42-fs-wip-lusov-quiescer-distro-default-smithi/. That run shows a few failures, but no signs of what this PR should have prevented, as opposed to the prior similar runs.

However, to be on the safe side, this PR should be run separately as part of the fs workload to make sure it doesn't break anything.

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-foreign-xlock branch from e3f10dc to 6c9a4ad Compare April 22, 2024 15:31
@leonid-s-usov leonid-s-usov changed the title mds/cache: request_drop_foreign_locks shouldn't strip local xlocks mds/cache: non-auth xlocks may be locallocks Apr 22, 2024
@leonid-s-usov leonid-s-usov changed the title mds/cache: non-auth xlocks may be locallocks mds/cache: don't assume non-auth xlocks as remote Apr 22, 2024
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-foreign-xlock branch from 6c9a4ad to e0784bd Compare April 22, 2024 15:44
A few places in the code assumed that non-auth xlocks
must be remote, which prevented a proper drop lock procedure
when those locks turned out to be locallocks.

Fixes: https://tracker.ceph.com/issues/65606
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-foreign-xlock branch from e0784bd to 3aa055d Compare April 22, 2024 15:45
@leonid-s-usov leonid-s-usov changed the title mds/cache: don't assume non-auth xlocks as remote mds/cache: don't assume non-auth xlocks to be remote locks Apr 22, 2024
@batrick
Copy link
Member

batrick commented Apr 22, 2024

I'll add this to my next qa run.

Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are talking about locks ...

  static const uint64_t WAIT_RD          = (1<<0);  // to read
  static const uint64_t WAIT_WR          = (1<<1);  // to write
  static const uint64_t WAIT_XLOCK       = (1<<2);  // to xlock   (** dup)
  static const uint64_t WAIT_STABLE      = (1<<2);  // for a stable state
  static const uint64_t WAIT_REMOTEXLOCK = (1<<3);  // for a remote xlock
  static const int WAIT_BITS        = 4;
  static const uint64_t WAIT_ALL         = ((1<<WAIT_BITS)-1);

why do WAIT_XLOCK and WAIT_STABLE occupy the same bit pos ?

I found this commit, but no explanation about this specific change:

commit a6f5abd95e80a2137c1e3e463fb4bdbcc95e49d2
Author: Sage Weil <sweil@redhat.com>
Date:   Tue Jun 19 16:11:50 2007 +0000

    * force trim of replicated null dentries that sync to non-null
    * fixed authpinnable waits in server (now wait only if frozen; locker->acquire_locks will wait while freezing, and handle auth_pins properly)
    
    
    git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1428 29311d96-e01e-0410-9327-a35deaab8ce9

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Apr 24, 2024

why do WAIT_XLOCK and WAIT_STABLE occupy the same bit pos ?

I think it's a combination of history, limited number of wait bits (until Patrick's change to a 128 bitmask in the context of the quiesce project), and the fact that up until now these two wait bits weren't ever needed as separate wait tags. WAIT_STABLE is a superset of the functionality from the perspective of mirrored locks (SimpleLock, ScatterLock, FileLock), and then locallock which also supports xlocking doesn't have the "stable" notion. I think that's what's pinning this wait bit define.

In a few places around Locker.cc one can still see a mask like ScatterLock::WAIT_XLOCK|ScatterLock::WAIT_WR|ScatterLock::WAIT_STABLE which are of course superfluous given that two of the three operands define the same bit, which suggests that maybe historically they were distinct bits until someone realized they can harmlessly occupy the same one

P.S.: this question is not related to the PR

@mchangir
Copy link
Contributor

P.S.: this question is not related to the PR

thanks for answering this

@leonid-s-usov
Copy link
Contributor Author

Jenkins test windows

@batrick
Copy link
Member

batrick commented Apr 25, 2024

jenkins test windows

@batrick
Copy link
Member

batrick commented Apr 25, 2024

jenkins test make check arm64

@batrick
Copy link
Member

batrick commented Apr 25, 2024

This PR is under test in https://tracker.ceph.com/issues/65661.

Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

batrick added a commit to batrick/ceph that referenced this pull request Apr 29, 2024
* refs/pull/57020/head:
	mds/cache: don't assume non-auth xlocks to be remote locks
@batrick
Copy link
Member

batrick commented Apr 29, 2024

jenkins test make check arm64

@batrick
Copy link
Member

batrick commented Apr 29, 2024

This PR is under test in https://tracker.ceph.com/issues/65694.

@batrick
Copy link
Member

batrick commented Apr 30, 2024

@batrick batrick merged commit 5658db7 into main Apr 30, 2024
@batrick batrick deleted the wip-lusov-foreign-xlock branch April 30, 2024 16:21
leonid-s-usov added a commit that referenced this pull request May 1, 2024
A few places in the code assumed that non-auth xlocks
must be remote, which prevented a proper drop lock procedure
when those locks turned out to be locallocks.

Fixes: https://tracker.ceph.com/issues/65710
Original-Issue: https://tracker.ceph.com/issues/65606
Original-PR: #57020
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 3aa055d)
leonid-s-usov added a commit that referenced this pull request May 1, 2024
A few places in the code assumed that non-auth xlocks
must be remote, which prevented a proper drop lock procedure
when those locks turned out to be locallocks.

Fixes: https://tracker.ceph.com/issues/65710
Original-Issue: https://tracker.ceph.com/issues/65606
Original-PR: #57020
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 3aa055d)
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Aug 7, 2024
A few places in the code assumed that non-auth xlocks
must be remote, which prevented a proper drop lock procedure
when those locks turned out to be locallocks.

Fixes: https://tracker.ceph.com/issues/65710
Original-Issue: https://tracker.ceph.com/issues/65606
Original-PR: ceph#57020
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 3aa055d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants