Skip to content

mds: prevent deadlocks between quiesce and fragmenting#57250

Closed
leonid-s-usov wants to merge 3 commits intomainfrom
wip-lusov-quiesce-bypass-freezing
Closed

mds: prevent deadlocks between quiesce and fragmenting#57250
leonid-s-usov wants to merge 3 commits intomainfrom
wip-lusov-quiesce-bypass-freezing

Conversation

@leonid-s-usov
Copy link
Contributor

@leonid-s-usov leonid-s-usov commented May 2, 2024

Quiesce requires revocation of capabilities,
which is not working for a freezing/frozen nodes.
Since it is best effort, abort an ongoing fragmenting
for the sake of a faster quiesce.

Fixes: https://tracker.ceph.com/issues/65716

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@leonid-s-usov leonid-s-usov requested a review from batrick May 2, 2024 20:14
@github-actions github-actions bot added cephfs Ceph File System common tests labels May 2, 2024
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch from 9b6adf6 to 00c97f7 Compare May 2, 2024 20:55
@leonid-s-usov
Copy link
Contributor Author

jenkins test api

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch from 00c97f7 to a74db31 Compare May 2, 2024 22:53
self.mount_a.run_shell_payload("mkdir -p root/sub2")
self.mount_a.write_file("root/sub2/file2", "file2")

self.config_set('mds', 'mds_freeze_delay_ms', '15000') # fragments will spend at least 15 seconds freezing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting on this before I head to bed for the night: did the procedure I outlined in slack not work? Why? I like this config for debugging but I don't think it should be necessary for this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure which procedure you're referring to, would you mind repeating?

In this test, the delay is crucial. It reproduces the situation when a directory stays freezing for a long time and it does so in a predictable way. As a result, I can be sure that if quiesce completes faster than a directory would reach the frozen state we overcome the issue in the ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://ceph-storage.slack.com/archives/D05LGDB5DQE/p1714661056831499?thread_ts=1714581766.305309&cid=D05LGDB5DQE

Here is that test that works for me:

    def test_quiesce_parent_frag(self):
        """
        """

        self._configure_subvolume()

        self.mount_a.run_shell_payload("for i in `seq 1 100`; do mkdir -p foo/$i; done")

        path_parent = self.mount_a.cephfs_mntpt + "/foo"

        path_q1 = path_parent + "/1"
        J = self.fs.rank_tell("quiesce", "path", path_q1)
        log.debug(f"{J}")
        reqid = self._reqid_tostr(J['op']['reqid'])
        self._wait_for_quiesce_complete(reqid)

        self.fs.rank_tell("dirfrag", "split", path_parent, "0/0", "1")

        path_q2 = path_parent + "/2"
        J = self.fs.rank_tell("quiesce", "path", path_q2)
        log.debug(f"{J}")
        reqid = self._reqid_tostr(J['op']['reqid'])
        self._wait_for_quiesce_complete(reqid) # will fail without fix

        # TODO: verify dir is fragmented

and it can go in the TestQuiesce class because it does not require multiple actives. I believe this simpler, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find it simpler, TBH, though it's similar. Additionally, my version predictably goes into an endless balancing war of tug with as few as two directories and two files, which I think is a stable approach.

Finally, my current test can be moved to TestQuiesce as is, too, it's not dependent on the pinning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled you don't find this simpler. It conceptually is three operations:

  • quiesce one dir below a parent
  • start fragmenting the parent (will reliably block because step 1 has auth pins for duration of quiesce)
  • quiesce another dir below the parent

Your test relies on delaying freezing, configuration-driven fragmentation, and numerous quiesce ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick I have combined our approaches. Thanks for the trick with two quiesce ops under the same parent, that's the key. I used my infinite rebalancing approach and your double quiesce approach together in a single test with just two subdirectories, and it works well to reproduce the issue. I will push the new version shortly and let you review

@leonid-s-usov
Copy link
Contributor Author

jenkins test api

@leonid-s-usov
Copy link
Contributor Author

jenkins test windows

@leonid-s-usov
Copy link
Contributor Author

This is under test with-quiesce (thrasher), x3
and the functional quiesce, x10

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch from a74db31 to 7669374 Compare May 7, 2024 18:38
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise the bypass_freezing changes look good to me.

self.mount_a.run_shell_payload("mkdir -p root/sub2")
self.mount_a.write_file("root/sub2/file2", "file2")

self.config_set('mds', 'mds_freeze_delay_ms', '15000') # fragments will spend at least 15 seconds freezing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://ceph-storage.slack.com/archives/D05LGDB5DQE/p1714661056831499?thread_ts=1714581766.305309&cid=D05LGDB5DQE

Here is that test that works for me:

    def test_quiesce_parent_frag(self):
        """
        """

        self._configure_subvolume()

        self.mount_a.run_shell_payload("for i in `seq 1 100`; do mkdir -p foo/$i; done")

        path_parent = self.mount_a.cephfs_mntpt + "/foo"

        path_q1 = path_parent + "/1"
        J = self.fs.rank_tell("quiesce", "path", path_q1)
        log.debug(f"{J}")
        reqid = self._reqid_tostr(J['op']['reqid'])
        self._wait_for_quiesce_complete(reqid)

        self.fs.rank_tell("dirfrag", "split", path_parent, "0/0", "1")

        path_q2 = path_parent + "/2"
        J = self.fs.rank_tell("quiesce", "path", path_q2)
        log.debug(f"{J}")
        reqid = self._reqid_tostr(J['op']['reqid'])
        self._wait_for_quiesce_complete(reqid) # will fail without fix

        # TODO: verify dir is fragmented

and it can go in the TestQuiesce class because it does not require multiple actives. I believe this simpler, no?

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch from 7669374 to 89f0b13 Compare May 9, 2024 01:41
@leonid-s-usov leonid-s-usov requested a review from batrick May 9, 2024 01:43
@leonid-s-usov leonid-s-usov changed the title mds: add want_bypass_freezing flag to MDR and set it for quiesce ops mds: prevent deadlocks between quiesce and fragmenting May 9, 2024
@leonid-s-usov
Copy link
Contributor Author

Here's what the merge detector traces look like:

2024-05-09T09:06:02.553+0300 16b573000  5 mds.0.cache dispatch_quiesce_path: dispatching

2024-05-09T09:06:02.554+0300 16b573000 10 mds.0.locker  can't auth_pin (freezing?), waiting to authpin [dentry #0x1/volumes/_nogroup/subvol_quiesce/ff03569a-fe67-4aa5-8c6c-20efafed0e7d/root/sub2 [2,head] auth (dversion lock) v=14 ino=0x100000007d9 state=1073741826 | request=0 lock=0 fragmenting=1 inodepin=1 dirty=0 authpin=0 0x107601d40]
2024-05-09T09:06:02.554+0300 16b573000 10 mds.0.locker  can't auth_pin dir: [dir 0x100000005e2.01* /volumes/_nogroup/subvol_quiesce/ff03569a-fe67-4aa5-8c6c-20efafed0e7d/root/ [2,head] auth v=18 cv=18/18 state=1074024457|complete|frozendir|fragmenting|dnpinnedfrag f(v0 1=0+1) n(v1 rc2024-05-09T09:03:19.022916+0300 b10 2=1+1) hs=1+0,ss=0+0 | child=1 frozen=1 dirty=0 authpin=0 0x109e15f80]
2024-05-09T09:06:02.554+0300 16b573000 10 mds.0.cache traverse: failed to rdlock (dn sync) [dentry #0x1/volumes/_nogroup/subvol_quiesce/ff03569a-fe67-4aa5-8c6c-20efafed0e7d/root/sub2 [2,head] auth (dversion lock) v=14 ino=0x100000007d9 state=1073741826 | request=0 lock=0 fragmenting=1 inodepin=1 dirty=0 authpin=0 0x107601d40]
2024-05-09T09:06:02.554+0300 16b573000 20 mds.0.cache abort_fragmenting_if_merging: will check for fragmenting ops
2024-05-09T09:06:02.554+0300 16b573000 20 mds.0.cache abort_fragmenting_if_merging: found a fragmenting dir [dir 0x100000005e2.01* /volumes/_nogroup/subvol_quiesce/ff03569a-fe67-4aa5-8c6c-20efafed0e7d/root/ [2,head] auth v=18 cv=18/18 state=1074024457|complete|frozendir|fragmenting|dnpinnedfrag f(v0 1=0+1) n(v1 rc2024-05-09T09:03:19.022916+0300 b10 2=1+1) hs=1+0,ss=0+0 | child=1 frozen=1 dirty=0 waiter=1 authpin=0 0x109e15f80]
2024-05-09T09:06:02.554+0300 16b573000 20 mds.0.cache abort_fragmenting_if_merging: detected dir merge, dirfrag 0x100000005e2 contains my dirfrag 0x100000005e2.01*
2024-05-09T09:06:02.554+0300 16b573000 10 mds.0.cache abort_fragmenting_if_merging: aborting the merge op request(mds.0:188 nref=3)

2024-05-09T09:06:02.554+0300 16bd1b000  7 mds.0.cache fragment_frozen 0x100000005e2 must have aborted
2024-05-09T09:06:02.554+0300 16bd1b000  7 mds.0.cache request_finish request(mds.0:188 nref=4)
2024-05-09T09:06:02.554+0300 16bd1b000 15 mds.0.cache request_cleanup request(mds.0:188 nref=4)

2024-05-09T09:06:02.555+0300 16bd1b000 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-05-09T09:06:02.555+0300 16bd1b000  5 mds.0.cache dispatch_quiesce_path: dispatching

2024-05-09T09:06:02.555+0300 16bd1b000 10 mds.0.locker  auth_pinning [dentry #0x1/volumes/_nogroup/subvol_quiesce/ff03569a-fe67-4aa5-8c6c-20efafed0e7d/root/sub2 [2,head] auth (dversion lock) v=14 ino=0x100000007d9 state=1073741824 | request=0 lock=0 fragmenting=0 inodepin=1 dirty=0 authpin=0 0x107601d40]

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch 2 times, most recently from ecb5a06 to 1b6a7dd Compare May 10, 2024 22:02
@leonid-s-usov
Copy link
Contributor Author

jenkins test make check

@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch 3 times, most recently from 7802755 to e263233 Compare May 13, 2024 05:51
@leonid-s-usov
Copy link
Contributor Author

jenkins test make check

@leonid-s-usov
Copy link
Contributor Author

This was tested together with #57332 in https://pulpito.ceph.com/leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi/. The results are promising: the only two quiesce timeouts are instances of a different issue https://tracker.ceph.com/issues/65977

the EMEDIUMTYPE are timeouts from the teuthology command runner, and are usually signs of unrelated issues.

[leonidus@vossi04 leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi]$ grep "ERROR:tasks.quiescer" */teuthology.log
7704534/teuthology.log:2024-05-13T06:56:08.322 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout:
7704534/teuthology.log:2024-05-13T06:56:08.323 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704542/teuthology.log:2024-05-13T07:29:20.358 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:
7704542/teuthology.log:2024-05-13T07:29:20.358 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704543/teuthology.log:2024-05-13T07:34:41.685 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:
7704543/teuthology.log:2024-05-13T07:34:41.686 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704544/teuthology.log:2024-05-13T06:38:37.093 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:
7704544/teuthology.log:2024-05-13T06:38:37.093 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704563/teuthology.log:2024-05-13T06:50:21.065 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout:
7704563/teuthology.log:2024-05-13T06:50:21.065 ERROR:tasks.quiescer.fs.[cephfs]:exception:

This reverts a9964a7
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Quiesce requires revocation of capabilities,
which is not working for a freezing/frozen nodes.
Since it is best effort, abort an ongoing fragmenting
for the sake of a faster quiesce.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Fixes: https://tracker.ceph.com/issues/65716
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-bypass-freezing branch from e263233 to 164db2d Compare May 14, 2024 13:11
@leonid-s-usov leonid-s-usov requested a review from a team May 15, 2024 11:14
@leonid-s-usov
Copy link
Contributor Author

@batrick please approve for merge. See #57332 (comment)

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from our discussion:

  • Please open a tracker to check that the fragmentation/merging is occurring between the quiesce calls. (I feel this one is tricky because you want to check that fragmentation starts after the first but before the second. That's one reason I think the test I originally provided is a good additional test to your changes to test_quiesce_dir_fragment.

  • Please update the commit message for revert: mds: provide mechanism to authpin while freezing to explain why this approach didn't work.

No code changes should be required.

@leonid-s-usov
Copy link
Contributor Author

I am closing this PR and will submit the three commits via #57332. These two PRs have a dependency, and it's easier to manage as a single stream of commits.

Please update the commit message for revert: mds: provide a mechanism to authpin while freezing to explain why this approach didn't work.

I'll apply the requested change in the next revision of #57332

@leonid-s-usov leonid-s-usov deleted the wip-lusov-quiesce-bypass-freezing branch May 16, 2024 06:59
@leonid-s-usov
Copy link
Contributor Author

  • Please open a tracker to check that the fragmentation/merging is occurring between the quiesce calls. (I feel this one is tricky because you want to check that fragmentation starts after the first but before the second. That's one reason I think the test I originally provided is a good additional test to your changes to test_quiesce_dir_fragment.

@batrick
https://tracker.ceph.com/issues/66074

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants