Skip to content

qa: failfast mount for better performance and unblock fs volume ls#58547

Merged
vshankar merged 1 commit intoceph:mainfrom
mchangir:qa-failfast-mount-for-better-test-performance
Aug 30, 2024
Merged

qa: failfast mount for better performance and unblock fs volume ls#58547
vshankar merged 1 commit intoceph:mainfrom
mchangir:qa-failfast-mount-for-better-test-performance

Conversation

@mchangir
Copy link
Contributor

@mchangir mchangir commented Jul 12, 2024

During teuthology tests, the tearing down of the cluster between two tests causes the config to be reset and a config_notify generated. This leads to a race to create a new mount using the old fscid. But by the time the mount is attempted the new fs gets created with a new fscid. This situation leads to the client mount waiting for a connection completion notification from the mds for 5 minutes (default timeout) and eventually giving up.
However, the default teuthology command timeout is 2 minutes. So, teuthology fails the command and declares the job as failed way before the mount can timeout.

The resolution to this case is to lower the client mount timeout to 30 seconds so that the config_notify fails fast paving the way for successive commands to get executed with the new fs.

An unhandled cluster warning about an unresponsive client also gets emitted later during qa job termination which leads to teuthology declaring the job as failed. As of now this warning seems harmless since it is emitted during cluster cleanup phase.
So, this warning is added to the log-ignorelist section in the snap-schedule YAML.

Fixes: https://tracker.ceph.com/issues/66009
Signed-off-by: Milind Changire mchangir@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

- is full \(reached quota
- POOL_FULL
- POOL_BACKFILLFULL
- cluster \[WRN\] evicting unresponsive client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did mention in https://tracker.ceph.com/issues/66009#note-35 that the mount is unresponsive in the mgr. What is the reason for the unresponsiveness?

Copy link
Contributor Author

@mchangir mchangir Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That unresponsiveness is due to mgr respawning without formally closing client connections before respawn. The old client connection leak is then evicted by the mds while emitting a warning in the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, and given that we respawn everything after each test, this would show up even more, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I move this item to a more generic log-ignorelist ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mgr respawning is due to enabling/disabling the snap-schedule module, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but the mgr across respawns discards cephfs clients ... whereas the mds eventually notices a client without any activity from among the older discarded clients and throws up this warning

@mchangir
Copy link
Contributor Author

jenkins test make check

1 similar comment
@mchangir
Copy link
Contributor Author

jenkins test make check

@vshankar vshankar requested a review from a team July 12, 2024 09:43
@mchangir
Copy link
Contributor Author

jenkins test make check

@mchangir mchangir force-pushed the qa-failfast-mount-for-better-test-performance branch from 576c160 to 956e055 Compare July 12, 2024 16:43
@mchangir
Copy link
Contributor Author

small correction to the commit message

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

@vshankar
Copy link
Contributor

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request.

@mchangir
Copy link
Contributor Author

mchangir commented Jul 17, 2024

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request.

ceph mgr fail isn't helping
The sequence followed to fail the mgr was:

  • CephfsTestCase.tearDown: fail mgr as the very first thing when tearing down
  • CephfsTestCase.setUp: revive the last active mgr as the very last thing when starting up

@vshankar
Copy link
Contributor

It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?

That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request.

ceph mgr fail isn't helping The sequence followed to fail the mgr was:

  • CephfsTestCase.tearDown: fail mgr as the very first thing when tearing down
  • CephfsTestCase.setUp: revive the last active mgr as the very last thing when starting up

@mchangir pointed me to https://pulpito.ceph.com/mchangir-2024-07-16_15:51:23-fs-main-distro-default-smithi/ which uses mgr fail to workaround the issue. In this failed job: /a/mchangir-2024-07-16_15:51:23-fs-main-distro-default-smithi/7803943

The standby mgr (mgr.y) switches to active:

024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 active (starting) mon_map magic: 0
2024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 mon_map magic: 0
2024-07-16T16:28:46.816+0000 7f3b9ce00640 20 mgr handle_mon_map handle_mon_map
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 --> [v2:172.21.15.26:3300/0,v1:172.21.15.26:6789/0] -- mon_command({"prefix": "mon metadata", "id": "a"} v 0) -- 0x55a3174ba540 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 --> [v2:172.21.15.26:3300/0,v1:172.21.15.26:6789/0] -- mon_command({"prefix": "mon metadata", "id": "b"} v 0) -- 0x55a3174ba700 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 --> [v2:172.21.15.26:3300/0,v1:172.21.15.26:6789/0] -- mon_command({"prefix": "mon metadata", "id": "c"} v 0) -- 0x55a3174ba8c0 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 <== mon.0 v2:172.21.15.26:3300/0 2 ==== config(2 keys) ==== 97+0+0 (secure 0 0 0) 0x55a3174ba000 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640  1 -- 172.21.15.26:0/3512266401 <== mon.0 v2:172.21.15.26:3300/0 3 ==== fsmap(e 19) ==== 2893+0+0 (secure 0 0 0) 0x55a3174bc000 con 0x55a316665800
2024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 active (starting) fsmap(e 19)
2024-07-16T16:28:46.816+0000 7f3b9ce00640 10 mgr ms_dispatch2 fsmap(e 19)

followed by ceph-mgr starting plugins:

2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting iostat
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting nfs
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting orchestrator
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting pg_autoscaler
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting progress
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting rbd_support
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting restful
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting status
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting telemetry
2024-07-16T16:28:46.824+0000 7f3b48a00640  4 mgr[py] Starting volumes

... and nothing much after that. So, some thread is still not making progress. @mchangir FYI

@mchangir
Copy link
Contributor Author

my bad ... my code had a typo which resulted in an Exception in the volumes module
following up on the issue and retesting

@mchangir
Copy link
Contributor Author

btw, the typo was in a much later version of the patch I was testing

@mchangir
Copy link
Contributor Author

This PR has been superceded by #58771

@vshankar
Copy link
Contributor

This PR has been superceded by #58771

I had a conversation with @mchangir regarding this alternate approach (pr #58771) and there is a bit more work to get the snap-schedule tests (and other cephfs tests in general) before the PR can reviewed. I propose that we take in this change (after running tests of course) and then work on stabalizing pr #58771 post that. @batrick WDYT?

@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/67257.

@vshankar
Copy link
Contributor

vshankar commented Aug 2, 2024

This PR is under test in https://tracker.ceph.com/issues/67318.

joscollin pushed a commit to joscollin/ceph that referenced this pull request Aug 7, 2024
* refs/pull/58547/head:
	qa: failfast mount for better performance

Reviewed-by: Venky Shankar <vshankar@redhat.com>
vshankar added a commit to vshankar/ceph that referenced this pull request Aug 20, 2024
* refs/pull/58547/head:
	qa: failfast mount for better performance

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

@mchangir just rebase and push maybe (jenkins test acting up)

During teuthology tests, the tearing down of the cluster between two
tests causes the config to be reset and a config_notify generated. This
leads to a race to create a new mount using the old fscid. But by the
time the mount is attempted the new fs gets created with a new fscid.
This situation leads to the client mount waiting for a connection
completion notification from the mds for 5 minutes (default timeout)
and eventually giving up.
However, the default teuthology command timeout is 2 minutes. So,
teuthology fails the command and declares the job as failed way before
the mount can timeout.

The resolution to this case is to lower the client mount timeout to 30
seconds so that the config_notify fails fast paving the way for
successive commands to get executed with the new fs.

An unhandled cluster warning about an unresponsive client also gets
emitted later during qa job termination which leads to teuthology
declaring the job as failed. As of now this warning seems harmless since
it is emitted during cluster cleanup phase.
So, this warning is added to the log-ignorelist section in the
snap-schedule YAML.

Fixes: https://tracker.ceph.com/issues/66009
Signed-off-by: Milind Changire <mchangir@redhat.com>
@mchangir mchangir force-pushed the qa-failfast-mount-for-better-test-performance branch from 956e055 to daf4798 Compare August 23, 2024 09:36
@mchangir
Copy link
Contributor Author

@mchangir just rebase and push maybe (jenkins test acting up)

rebased

@mchangir
Copy link
Contributor Author

jenkins test make check

@mchangir
Copy link
Contributor Author

jenkins test api

1 similar comment
@vshankar
Copy link
Contributor

jenkins test api

@vshankar vshankar requested a review from batrick August 26, 2024 12:16
@vshankar
Copy link
Contributor

@mchangir
Copy link
Contributor Author

@vshankar looks like all the required jobs have passed and this can be merged ?

@vshankar
Copy link
Contributor

@vshankar looks like all the required jobs have passed and this can be merged ?

I requested a review from @batrick since a change was requested earlier. Otherwise this is good to go.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@batrick
Copy link
Member

batrick commented Aug 29, 2024

jenkins test make check arm64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants