Skip to content

qa/tasks/mgr: clean crash reports before waiting for clean#42438

Merged
neha-ojha merged 1 commit intoceph:masterfrom
tchaikov:wip-qa-test_module_selftest
Jul 23, 2021
Merged

qa/tasks/mgr: clean crash reports before waiting for clean#42438
neha-ojha merged 1 commit intoceph:masterfrom
tchaikov:wip-qa-test_module_selftest

Conversation

@tchaikov
Copy link
Contributor

otherwise we have following warning in health report

{"status":"HEALTH_WARN","checks":{"RECENT_MGR_MODULE_CRASH":{"severity":"HEALTH_WARN","summary":{"message":"1 mgr modules have recently crashed","count":1},"muted":false}},"mutes":[]}

and it does not disappear after the test waits for 30 seconds.
and the tasks.mgr.test_module_selftest.TestModuleSelftest test
fails like:

2021-07-21T09:59:52.560 INFO:tasks.cephfs_test_runner:======================================================================
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/mgr/test_module_selftest.py", line 201, in
test_mo
dule_commands
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner: self.wait_for_health_clear(timeout=30)
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/ceph_test_case.py", line 172, in
wait_for_health_c
lear
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner: self.wait_until_true(is_clear, timeout)
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/ceph_test_case.py", line 209, in
wait_until_true
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2021-07-21T09:59:52.564 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries

in this change, the crash reports are nuked right after
we see the warning, so that we can have a clean health
report.

Signed-off-by: Kefu Chai kchai@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@tchaikov tchaikov requested review from a team, avanthakkar and callithea and removed request for a team July 21, 2021 10:51
@github-actions github-actions bot added the tests label Jul 21, 2021
@tchaikov
Copy link
Contributor Author

@tchaikov
Copy link
Contributor Author

jenkins test api

@tchaikov tchaikov force-pushed the wip-qa-test_module_selftest branch from a2bd8a5 to 03e1c31 Compare July 21, 2021 14:34
@neha-ojha
Copy link
Member

@tchaikov Looks like this PR will address https://tracker.ceph.com/issues/51743?

otherwise we have following warning in health report

{"status":"HEALTH_WARN","checks":{"RECENT_MGR_MODULE_CRASH":{"severity":"HEALTH_WARN","summary":{"message":"1 mgr modules have recently crashed","count":1},"muted":false}},"mutes":[]}

and it does not disappear after the test waits for 30 seconds.
and the tasks.mgr.test_module_selftest.TestModuleSelftest test
fails like:

2021-07-21T09:59:52.560 INFO:tasks.cephfs_test_runner:======================================================================
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/mgr/test_module_selftest.py", line 201, in
test_mo
dule_commands
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner:    self.wait_for_health_clear(timeout=30)
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/ceph_test_case.py", line 172, in
wait_for_health_c
lear
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner:    self.wait_until_true(is_clear, timeout)
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/ceph_test_case.py", line 209, in
wait_until_true
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2021-07-21T09:59:52.564 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries

in this change, the crash reports are nuked right after
we see the warning, so that we can have a clean health
report.

Fixes: https://tracker.ceph.com/issues/51743
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov tchaikov force-pushed the wip-qa-test_module_selftest branch from 03e1c31 to ec8a40b Compare July 21, 2021 14:46
@tchaikov
Copy link
Contributor Author

@tchaikov Looks like this PR will address https://tracker.ceph.com/issues/51743?

yes, it should address this issue. updated the commit message and tracker ticket.

@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 21, 2021

@tchaikov tchaikov requested a review from neha-ojha July 21, 2021 15:31
# prune the crash reports, so that the health report is back to
# clean
self.mgr_cluster.mon_manager.raw_cluster_cmd(
"crash", "prune", "0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov do you know why we started seeing this issue recently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a regression introduced by #41937. and because i thought that ignorelist would do the trick. see #41937 (comment), hence i merged it without retesting the changeset with 3edc04a. my bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this issue is not deterministic, probably that's why the TestTimeoutError (https://tracker.ceph.com/issues/51743) did not show up in your runs.

It showed up in https://pulpito.ceph.com/yuriw-2021-07-16_18:39:18-rados-wip-yuri-testing-master-7.16.21-distro-basic-smithi/, but not in the rerun https://pulpito.ceph.com/yuriw-2021-07-17_14:59:42-rados-wip-yuri-testing-master-7.16.21-distro-basic-smithi/.

It didn't show up in 10 runs of the test https://pulpito.ceph.com/nojha-2021-07-22_19:56:40-rados:mgr-wip-39871-distro-basic-smithi/

In any case, I don't see any issues with pruning the crashing reports, @liewegas WDYT?

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine.

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a reasonable workaround for the test to me!

@neha-ojha neha-ojha merged commit c9ad86e into ceph:master Jul 23, 2021
@tchaikov tchaikov deleted the wip-qa-test_module_selftest branch July 24, 2021 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants