mgr/cephadm: don't include agents in CEPHADM_FAILED_DAEMON by adk3798 · Pull Request #44158 · ceph/ceph

adk3798 · 2021-12-01T09:15:29Z

They already have their own, more strict health warning.
There's very few cases they would show up in failed daemon
health check but not agent down health check and even if
they did it would be temporary. Also, agents marked as down
will automatically (before this change) be marked as failed
even if they don't meet the typical criteria for failed
(systemd status is in error)

Fixes: https://tracker.ceph.com/issues/53448

Signed-off-by: Adam King adking@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

sebastian-philipp · 2021-12-08T14:46:42Z

Still gettin:

2021-12-03T10:32:00.602 INFO:teuthology.orchestra.run.smithi094.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_AGENT_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 Cephadm Agent(s) are not reporting. Hosts may be offline","count":1},"muted":false},"CEPHADM_FAILED_DAEMON":{"severity":"HEALTH_WARN","summary":{"message":"1 failed cephadm daemon(s)","count":1},"muted":false}},"mutes":[]}
2021-12-03T10:32:01.362 INFO:journalctl@ceph.mon.a.smithi094.stdout:Dec 03 10:32:00 smithi094 bash[14604]: audit 2021-12-03T10:32:00.595778+0000 mon.a (mon.0) 348 : audit [DBG] from='client.? 172.21.15.94:0/3020743751' entity='client.admin' cmd=[{"prefix": "health", "format": "json"}]: dispatch
2021-12-03T10:32:01.803 INFO:tasks.cephadm:Teardown begin
2021-12-03T10:32:01.803 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_844d13778e221f3c1f3cb6626445c4bc2f71766d/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/cephadm.py", line 1548, in task
    healthy(ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph_manager.py", line 3146, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy

https://pulpito.ceph.com/swagner-2021-12-03_09:41:55-orch:cephadm-wip-swagner-testing-2021-12-02-1454-distro-default-smithi/6542308

adk3798 · 2021-12-13T20:33:46Z

Still gettin:

2021-12-03T10:32:00.602 INFO:teuthology.orchestra.run.smithi094.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_AGENT_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 Cephadm Agent(s) are not reporting. Hosts may be offline","count":1},"muted":false},"CEPHADM_FAILED_DAEMON":{"severity":"HEALTH_WARN","summary":{"message":"1 failed cephadm daemon(s)","count":1},"muted":false}},"mutes":[]}
2021-12-03T10:32:01.362 INFO:journalctl@ceph.mon.a.smithi094.stdout:Dec 03 10:32:00 smithi094 bash[14604]: audit 2021-12-03T10:32:00.595778+0000 mon.a (mon.0) 348 : audit [DBG] from='client.? 172.21.15.94:0/3020743751' entity='client.admin' cmd=[{"prefix": "health", "format": "json"}]: dispatch
2021-12-03T10:32:01.803 INFO:tasks.cephadm:Teardown begin
2021-12-03T10:32:01.803 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_844d13778e221f3c1f3cb6626445c4bc2f71766d/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/cephadm.py", line 1548, in task
    healthy(ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph_manager.py", line 3146, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy

https://pulpito.ceph.com/swagner-2021-12-03_09:41:55-orch:cephadm-wip-swagner-testing-2021-12-02-1454-distro-default-smithi/6542308

looking at the pulpito link it looks like that run was testing https://github.com/ceph/ceph-ci/tree/wip-swagner-testing-2021-12-02-1454 which doesn't include this PR? I could see how #44031 (which it does look like was included) could cause this error but I'm confused how this one would contribute to it.

Am I just getting confused with the branches and what was tested here? @sebastian-philipp

github-actions · 2021-12-16T10:32:50Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

They already have their own, more strict health warning. There's very few cases they would show up in failed daemon health check but not agent down health check and even if they did it would be temporary. Also, agents marked as down will automatically (before this change) be marked as failed even if they don't meet the typical criteria for failed (systemd status is in error) Fixes: https://tracker.ceph.com/issues/53448 Signed-off-by: Adam King <adking@redhat.com>

sebastian-philipp · 2022-01-05T09:18:45Z

https://pulpito.ceph.com/swagner-2022-01-03_15:37:23-orch:cephadm-wip-swagner-testing-2022-01-03-1230-distro-default-smithi/

Failures are due to the environment

sebastian-philipp · 2022-01-05T09:27:46Z

jenkins test api

adk3798 requested a review from a team as a code owner December 1, 2021 09:15

github-actions bot added cephadm pybind labels Dec 1, 2021

sebastian-philipp approved these changes Dec 1, 2021

View reviewed changes

sebastian-philipp added the wip-swagner3-testing label Dec 1, 2021

adk3798 added the wip-sage-testing label Dec 1, 2021

sebastian-philipp removed the wip-swagner3-testing label Dec 8, 2021

sebastian-philipp added the wip-swagner4-testing label Dec 15, 2021

github-actions bot added the needs-rebase label Dec 16, 2021

adk3798 force-pushed the agent-failed branch from 0c62a9c to c9d8de3 Compare December 16, 2021 13:18

github-actions bot removed the needs-rebase label Dec 16, 2021

sebastian-philipp added the wip-swagner-testing My Teuthology tests label Jan 3, 2022

sebastian-philipp added ready-to-merge and removed wip-swagner-testing My Teuthology tests wip-swagner4-testing labels Jan 5, 2022

sebastian-philipp merged commit dc8f3be into ceph:master Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/cephadm: don't include agents in CEPHADM_FAILED_DAEMON#44158

mgr/cephadm: don't include agents in CEPHADM_FAILED_DAEMON#44158
sebastian-philipp merged 1 commit intoceph:masterfrom
adk3798:agent-failed

adk3798 commented Dec 1, 2021

Uh oh!

sebastian-philipp commented Dec 8, 2021

Uh oh!

adk3798 commented Dec 13, 2021

Uh oh!

github-actions bot commented Dec 16, 2021

Uh oh!

sebastian-philipp commented Jan 5, 2022

Uh oh!

sebastian-philipp commented Jan 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adk3798 commented Dec 1, 2021

Checklist

Uh oh!

sebastian-philipp commented Dec 8, 2021

Uh oh!

adk3798 commented Dec 13, 2021

Uh oh!

github-actions bot commented Dec 16, 2021

Uh oh!

sebastian-philipp commented Jan 5, 2022

Uh oh!

sebastian-philipp commented Jan 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants