Skip to content

mgr/cephadm: don't include agents in CEPHADM_FAILED_DAEMON#44158

Merged
sebastian-philipp merged 1 commit intoceph:masterfrom
adk3798:agent-failed
Jan 5, 2022
Merged

mgr/cephadm: don't include agents in CEPHADM_FAILED_DAEMON#44158
sebastian-philipp merged 1 commit intoceph:masterfrom
adk3798:agent-failed

Conversation

@adk3798
Copy link
Contributor

@adk3798 adk3798 commented Dec 1, 2021

They already have their own, more strict health warning.
There's very few cases they would show up in failed daemon
health check but not agent down health check and even if
they did it would be temporary. Also, agents marked as down
will automatically (before this change) be marked as failed
even if they don't meet the typical criteria for failed
(systemd status is in error)

Fixes: https://tracker.ceph.com/issues/53448

Signed-off-by: Adam King adking@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@adk3798 adk3798 requested a review from a team as a code owner December 1, 2021 09:15
@sebastian-philipp
Copy link
Contributor

Still gettin:

2021-12-03T10:32:00.602 INFO:teuthology.orchestra.run.smithi094.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_AGENT_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 Cephadm Agent(s) are not reporting. Hosts may be offline","count":1},"muted":false},"CEPHADM_FAILED_DAEMON":{"severity":"HEALTH_WARN","summary":{"message":"1 failed cephadm daemon(s)","count":1},"muted":false}},"mutes":[]}
2021-12-03T10:32:01.362 INFO:journalctl@ceph.mon.a.smithi094.stdout:Dec 03 10:32:00 smithi094 bash[14604]: audit 2021-12-03T10:32:00.595778+0000 mon.a (mon.0) 348 : audit [DBG] from='client.? 172.21.15.94:0/3020743751' entity='client.admin' cmd=[{"prefix": "health", "format": "json"}]: dispatch
2021-12-03T10:32:01.803 INFO:tasks.cephadm:Teardown begin
2021-12-03T10:32:01.803 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_844d13778e221f3c1f3cb6626445c4bc2f71766d/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/cephadm.py", line 1548, in task
    healthy(ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph_manager.py", line 3146, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy

https://pulpito.ceph.com/swagner-2021-12-03_09:41:55-orch:cephadm-wip-swagner-testing-2021-12-02-1454-distro-default-smithi/6542308

@adk3798
Copy link
Contributor Author

adk3798 commented Dec 13, 2021

Still gettin:

2021-12-03T10:32:00.602 INFO:teuthology.orchestra.run.smithi094.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_AGENT_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 Cephadm Agent(s) are not reporting. Hosts may be offline","count":1},"muted":false},"CEPHADM_FAILED_DAEMON":{"severity":"HEALTH_WARN","summary":{"message":"1 failed cephadm daemon(s)","count":1},"muted":false}},"mutes":[]}
2021-12-03T10:32:01.362 INFO:journalctl@ceph.mon.a.smithi094.stdout:Dec 03 10:32:00 smithi094 bash[14604]: audit 2021-12-03T10:32:00.595778+0000 mon.a (mon.0) 348 : audit [DBG] from='client.? 172.21.15.94:0/3020743751' entity='client.admin' cmd=[{"prefix": "health", "format": "json"}]: dispatch
2021-12-03T10:32:01.803 INFO:tasks.cephadm:Teardown begin
2021-12-03T10:32:01.803 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_844d13778e221f3c1f3cb6626445c4bc2f71766d/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/cephadm.py", line 1548, in task
    healthy(ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph_manager.py", line 3146, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy

https://pulpito.ceph.com/swagner-2021-12-03_09:41:55-orch:cephadm-wip-swagner-testing-2021-12-02-1454-distro-default-smithi/6542308

looking at the pulpito link it looks like that run was testing https://github.com/ceph/ceph-ci/tree/wip-swagner-testing-2021-12-02-1454 which doesn't include this PR? I could see how #44031 (which it does look like was included) could cause this error but I'm confused how this one would contribute to it.

Am I just getting confused with the branches and what was tested here? @sebastian-philipp

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

They already have their own, more strict health warning.
There's very few cases they would show up in failed daemon
health check but not agent down health check and even if
they did it would be temporary. Also, agents marked as down
will automatically (before this change) be marked as failed
even if they don't meet the typical criteria for failed
(systemd status is in error)

Fixes: https://tracker.ceph.com/issues/53448

Signed-off-by: Adam King <adking@redhat.com>
@sebastian-philipp
Copy link
Contributor

@sebastian-philipp
Copy link
Contributor

jenkins test api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants