qa: failfast mount for better performance and unblock `fs volume ls` by mchangir · Pull Request #58547 · ceph/ceph

mchangir · 2024-07-12T03:44:01Z

During teuthology tests, the tearing down of the cluster between two tests causes the config to be reset and a config_notify generated. This leads to a race to create a new mount using the old fscid. But by the time the mount is attempted the new fs gets created with a new fscid. This situation leads to the client mount waiting for a connection completion notification from the mds for 5 minutes (default timeout) and eventually giving up.
However, the default teuthology command timeout is 2 minutes. So, teuthology fails the command and declares the job as failed way before the mount can timeout.

The resolution to this case is to lower the client mount timeout to 30 seconds so that the config_notify fails fast paving the way for successive commands to get executed with the new fs.

An unhandled cluster warning about an unresponsive client also gets emitted later during qa job termination which leads to teuthology declaring the job as failed. As of now this warning seems harmless since it is emitted during cluster cleanup phase.
So, this warning is added to the log-ignorelist section in the snap-schedule YAML.

Fixes: https://tracker.ceph.com/issues/66009
Signed-off-by: Milind Changire mchangir@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

vshankar · 2024-07-12T04:43:19Z

qa/suites/fs/functional/tasks/snap-schedule.yaml

      - is full \(reached quota
      - POOL_FULL
      - POOL_BACKFILLFULL
+      - cluster \[WRN\] evicting unresponsive client


You did mention in https://tracker.ceph.com/issues/66009#note-35 that the mount is unresponsive in the mgr. What is the reason for the unresponsiveness?

That unresponsiveness is due to mgr respawning without formally closing client connections before respawn. The old client connection leak is then evicted by the mds while emitting a warning in the logs.

oh, and given that we respawn everything after each test, this would show up even more, yes?

yes, that's right

should I move this item to a more generic log-ignorelist ?

The mgr respawning is due to enabling/disabling the snap-schedule module, no?

yeah, but the mgr across respawns discards cephfs clients ... whereas the mds eventually notices a client without any activity from among the older discarded clients and throws up this warning

mchangir · 2024-07-12T07:21:43Z

jenkins test make check

mchangir · 2024-07-12T08:36:44Z

jenkins test make check

mchangir · 2024-07-12T09:46:23Z

jenkins test make check

mchangir · 2024-07-12T16:44:20Z

small correction to the commit message

batrick

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

vshankar · 2024-07-16T06:51:12Z

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request.

mchangir · 2024-07-17T10:58:10Z

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request.

ceph mgr fail isn't helping
The sequence followed to fail the mgr was:

CephfsTestCase.tearDown: fail mgr as the very first thing when tearing down
CephfsTestCase.setUp: revive the last active mgr as the very last thing when starting up

vshankar · 2024-07-23T05:11:44Z

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request.

ceph mgr fail isn't helping The sequence followed to fail the mgr was:

CephfsTestCase.tearDown: fail mgr as the very first thing when tearing down

CephfsTestCase.setUp: revive the last active mgr as the very last thing when starting up

@mchangir pointed me to https://pulpito.ceph.com/mchangir-2024-07-16_15:51:23-fs-main-distro-default-smithi/ which uses mgr fail to workaround the issue. In this failed job: /a/mchangir-2024-07-16_15:51:23-fs-main-distro-default-smithi/7803943

The standby mgr (mgr.y) switches to active:

024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 active (starting) mon_map magic: 0
2024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 mon_map magic: 0
2024-07-16T16:28:46.816+0000 7f3b9ce00640 20 mgr handle_mon_map handle_mon_map
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 --> [v2:172.21.15.26:3300/0,v1:172.21.15.26:6789/0] -- mon_command({"prefix": "mon metadata", "id": "a"} v 0) -- 0x55a3174ba540 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 --> [v2:172.21.15.26:3300/0,v1:172.21.15.26:6789/0] -- mon_command({"prefix": "mon metadata", "id": "b"} v 0) -- 0x55a3174ba700 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 --> [v2:172.21.15.26:3300/0,v1:172.21.15.26:6789/0] -- mon_command({"prefix": "mon metadata", "id": "c"} v 0) -- 0x55a3174ba8c0 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 <== mon.0 v2:172.21.15.26:3300/0 2 ==== config(2 keys) ==== 97+0+0 (secure 0 0 0) 0x55a3174ba000 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 <== mon.0 v2:172.21.15.26:3300/0 3 ==== fsmap(e 19) ==== 2893+0+0 (secure 0 0 0) 0x55a3174bc000 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 active (starting) fsmap(e 19)
2024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 fsmap(e 19)

followed by ceph-mgr starting plugins:

2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting iostat
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting nfs
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting orchestrator
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting pg_autoscaler
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting progress
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting rbd_support
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting restful
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting status
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting telemetry
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting volumes

... and nothing much after that. So, some thread is still not making progress. @mchangir FYI

mchangir · 2024-07-23T11:24:19Z

my bad ... my code had a typo which resulted in an Exception in the volumes module
following up on the issue and retesting

mchangir · 2024-07-23T11:26:05Z

btw, the typo was in a much later version of the patch I was testing

mchangir · 2024-07-24T02:42:58Z

This PR has been superceded by #58771

vshankar · 2024-07-29T14:09:21Z

This PR has been superceded by #58771

I had a conversation with @mchangir regarding this alternate approach (pr #58771) and there is a bit more work to get the snap-schedule tests (and other cephfs tests in general) before the PR can reviewed. I propose that we take in this change (after running tests of course) and then work on stabalizing pr #58771 post that. @batrick WDYT?

vshankar · 2024-07-30T05:18:47Z

jenkins test make check

vshankar · 2024-07-30T10:56:50Z

This PR is under test in https://tracker.ceph.com/issues/67257.

vshankar · 2024-08-02T09:19:45Z

This PR is under test in https://tracker.ceph.com/issues/67318.

* refs/pull/58547/head: qa: failfast mount for better performance Reviewed-by: Venky Shankar <vshankar@redhat.com>

vshankar

https://tracker.ceph.com/projects/cephfs/wiki/Main#wip-vshankar-testing-20240814051955-debug

vshankar · 2024-08-22T09:22:12Z

jenkins test make check

vshankar · 2024-08-23T09:25:16Z

@mchangir just rebase and push maybe (jenkins test acting up)

During teuthology tests, the tearing down of the cluster between two tests causes the config to be reset and a config_notify generated. This leads to a race to create a new mount using the old fscid. But by the time the mount is attempted the new fs gets created with a new fscid. This situation leads to the client mount waiting for a connection completion notification from the mds for 5 minutes (default timeout) and eventually giving up. However, the default teuthology command timeout is 2 minutes. So, teuthology fails the command and declares the job as failed way before the mount can timeout. The resolution to this case is to lower the client mount timeout to 30 seconds so that the config_notify fails fast paving the way for successive commands to get executed with the new fs. An unhandled cluster warning about an unresponsive client also gets emitted later during qa job termination which leads to teuthology declaring the job as failed. As of now this warning seems harmless since it is emitted during cluster cleanup phase. So, this warning is added to the log-ignorelist section in the snap-schedule YAML. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>

mchangir · 2024-08-23T09:36:51Z

@mchangir just rebase and push maybe (jenkins test acting up)

rebased

mchangir · 2024-08-23T11:03:27Z

jenkins test make check

mchangir · 2024-08-23T12:00:20Z

jenkins test api

vshankar · 2024-08-26T05:18:19Z

jenkins test api

vshankar · 2024-08-26T12:16:36Z

https://tracker.ceph.com/projects/cephfs/wiki/Main#wip-vshankar-testing-20240814051955-debug

mchangir · 2024-08-28T06:00:18Z

@vshankar looks like all the required jobs have passed and this can be merged ?

vshankar · 2024-08-28T06:02:53Z

@vshankar looks like all the required jobs have passed and this can be merged ?

I requested a review from @batrick since a change was requested earlier. Otherwise this is good to go.

batrick

This looks good to me.

batrick · 2024-08-29T12:37:41Z

jenkins test make check arm64

github-actions bot added cephfs Ceph File System tests labels Jul 12, 2024

mchangir mentioned this pull request Jul 12, 2024

Revert "qa/tasks/cephfs: don't create a new CephManager if there is o… #58369

Closed

14 tasks

vshankar reviewed Jul 12, 2024

View reviewed changes

vshankar requested a review from a team July 12, 2024 09:43

vshankar approved these changes Jul 12, 2024

View reviewed changes

mchangir force-pushed the qa-failfast-mount-for-better-test-performance branch from 576c160 to 956e055 Compare July 12, 2024 16:43

batrick requested changes Jul 15, 2024

View reviewed changes

vshankar added the wip-vshankar-testing5 label Jul 30, 2024

mchangir mentioned this pull request Aug 5, 2024

mgr/snap_schedule: correctly fetch mds_max_snaps_per_dir from mds #57388

Merged

14 tasks

joscollin pushed a commit to joscollin/ceph that referenced this pull request Aug 7, 2024

Merge PR ceph#58547 into wip-vshankar-testing-20240802.091913-debug

4b75976

* refs/pull/58547/head: qa: failfast mount for better performance Reviewed-by: Venky Shankar <vshankar@redhat.com>

vshankar added a commit to vshankar/ceph that referenced this pull request Aug 20, 2024

Merge PR ceph#58547 into wip-vshankar-testing-20240814.051955-debug

aa9f4eb

* refs/pull/58547/head: qa: failfast mount for better performance Reviewed-by: Venky Shankar <vshankar@redhat.com>

vshankar approved these changes Aug 22, 2024

View reviewed changes

mchangir force-pushed the qa-failfast-mount-for-better-test-performance branch from 956e055 to daf4798 Compare August 23, 2024 09:36

vshankar requested a review from batrick August 26, 2024 12:16

vshankar added ready-to-merge and removed wip-vshankar-testing5 labels Aug 28, 2024

batrick approved these changes Aug 29, 2024

View reviewed changes

vshankar merged commit 52deba6 into ceph:main Aug 30, 2024

vshankar removed the ready-to-merge label Aug 30, 2024

vshankar mentioned this pull request Sep 11, 2024

fix client mount in mgr/volumes and clean up snapshots in test_snap_schedules.py #58771

Closed

14 tasks

mchangir mentioned this pull request Sep 16, 2024

qa: avoid reusing CephManager object across tests #58158

Closed

14 tasks

Conversation

mchangir commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

vshankar Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

mchangir Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshankar Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

mchangir Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

mchangir Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

batrick Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

mchangir Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

mchangir commented Jul 12, 2024

Uh oh!

mchangir commented Jul 12, 2024

Uh oh!

mchangir commented Jul 12, 2024

Uh oh!

mchangir commented Jul 12, 2024

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented Jul 16, 2024

Uh oh!

mchangir commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar commented Jul 23, 2024

Uh oh!

mchangir commented Jul 23, 2024

Uh oh!

mchangir commented Jul 23, 2024

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

vshankar commented Jul 29, 2024

Uh oh!

vshankar commented Jul 30, 2024

Uh oh!

vshankar commented Jul 30, 2024

Uh oh!

vshankar commented Aug 2, 2024

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented Aug 22, 2024

Uh oh!

vshankar commented Aug 23, 2024

Uh oh!

mchangir commented Aug 23, 2024

Uh oh!

mchangir commented Aug 23, 2024

Uh oh!

mchangir commented Aug 23, 2024

Uh oh!

vshankar commented Aug 26, 2024

Uh oh!

vshankar commented Aug 26, 2024

Uh oh!

mchangir commented Aug 28, 2024

Uh oh!

vshankar commented Aug 28, 2024

mchangir commented Jul 12, 2024 •

edited

Loading

mchangir Jul 12, 2024 •

edited

Loading

mchangir commented Jul 17, 2024 •

edited

Loading