Conversation
| - is full \(reached quota | ||
| - POOL_FULL | ||
| - POOL_BACKFILLFULL | ||
| - cluster \[WRN\] evicting unresponsive client |
There was a problem hiding this comment.
You did mention in https://tracker.ceph.com/issues/66009#note-35 that the mount is unresponsive in the mgr. What is the reason for the unresponsiveness?
There was a problem hiding this comment.
That unresponsiveness is due to mgr respawning without formally closing client connections before respawn. The old client connection leak is then evicted by the mds while emitting a warning in the logs.
There was a problem hiding this comment.
oh, and given that we respawn everything after each test, this would show up even more, yes?
There was a problem hiding this comment.
should I move this item to a more generic log-ignorelist ?
There was a problem hiding this comment.
The mgr respawning is due to enabling/disabling the snap-schedule module, no?
There was a problem hiding this comment.
yeah, but the mgr across respawns discards cephfs clients ... whereas the mds eventually notices a client without any activity from among the older discarded clients and throws up this warning
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
jenkins test make check |
576c160 to
956e055
Compare
|
small correction to the commit message |
batrick
left a comment
There was a problem hiding this comment.
It seems to me the issue centers around the file system getting deleted / created while the ceph-mgr snap-schedule/volumes are "running". Perhaps the better fix here is to ceph mgr fail whenever we recreate the file system?
That works too - and its kind of a fix all for anything in ceph-mgr that's blocked due to a stuck mds request. |
|
@mchangir pointed me to https://pulpito.ceph.com/mchangir-2024-07-16_15:51:23-fs-main-distro-default-smithi/ which uses The standby mgr (mgr.y) switches to active: followed by ceph-mgr starting plugins: ... and nothing much after that. So, some thread is still not making progress. @mchangir FYI |
|
my bad ... my code had a typo which resulted in an Exception in the volumes module |
|
btw, the typo was in a much later version of the patch I was testing |
|
This PR has been superceded by #58771 |
I had a conversation with @mchangir regarding this alternate approach (pr #58771) and there is a bit more work to get the snap-schedule tests (and other cephfs tests in general) before the PR can reviewed. I propose that we take in this change (after running tests of course) and then work on stabalizing pr #58771 post that. @batrick WDYT? |
|
jenkins test make check |
|
This PR is under test in https://tracker.ceph.com/issues/67257. |
|
This PR is under test in https://tracker.ceph.com/issues/67318. |
* refs/pull/58547/head: qa: failfast mount for better performance Reviewed-by: Venky Shankar <vshankar@redhat.com>
* refs/pull/58547/head: qa: failfast mount for better performance Reviewed-by: Venky Shankar <vshankar@redhat.com>
|
jenkins test make check |
|
@mchangir just rebase and push maybe (jenkins test acting up) |
During teuthology tests, the tearing down of the cluster between two tests causes the config to be reset and a config_notify generated. This leads to a race to create a new mount using the old fscid. But by the time the mount is attempted the new fs gets created with a new fscid. This situation leads to the client mount waiting for a connection completion notification from the mds for 5 minutes (default timeout) and eventually giving up. However, the default teuthology command timeout is 2 minutes. So, teuthology fails the command and declares the job as failed way before the mount can timeout. The resolution to this case is to lower the client mount timeout to 30 seconds so that the config_notify fails fast paving the way for successive commands to get executed with the new fs. An unhandled cluster warning about an unresponsive client also gets emitted later during qa job termination which leads to teuthology declaring the job as failed. As of now this warning seems harmless since it is emitted during cluster cleanup phase. So, this warning is added to the log-ignorelist section in the snap-schedule YAML. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>
956e055 to
daf4798
Compare
rebased |
|
jenkins test make check |
|
jenkins test api |
1 similar comment
|
jenkins test api |
|
@vshankar looks like all the required jobs have passed and this can be merged ? |
|
jenkins test make check arm64 |
During teuthology tests, the tearing down of the cluster between two tests causes the config to be reset and a config_notify generated. This leads to a race to create a new mount using the old fscid. But by the time the mount is attempted the new fs gets created with a new fscid. This situation leads to the client mount waiting for a connection completion notification from the mds for 5 minutes (default timeout) and eventually giving up.
However, the default teuthology command timeout is 2 minutes. So, teuthology fails the command and declares the job as failed way before the mount can timeout.
The resolution to this case is to lower the client mount timeout to 30 seconds so that the config_notify fails fast paving the way for successive commands to get executed with the new fs.
An unhandled cluster warning about an unresponsive client also gets emitted later during qa job termination which leads to teuthology declaring the job as failed. As of now this warning seems harmless since it is emitted during cluster cleanup phase.
So, this warning is added to the log-ignorelist section in the snap-schedule YAML.
Fixes: https://tracker.ceph.com/issues/66009
Signed-off-by: Milind Changire mchangir@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e