Skip to content

mgr/cephadm: agent: simplify handling of agent reports#44031

Merged
sebastian-philipp merged 1 commit intoceph:masterfrom
adk3798:agent-thrash
Dec 16, 2021
Merged

mgr/cephadm: agent: simplify handling of agent reports#44031
sebastian-philipp merged 1 commit intoceph:masterfrom
adk3798:agent-thrash

Conversation

@adk3798
Copy link
Contributor

@adk3798 adk3798 commented Nov 19, 2021

Don't try to do extra things like checking other agents
or updating health checks when agents report. Rely on
serve loop for that

This also should help with thrashing we've been seeing

Signed-off-by: Adam King adking@redhat.com

@sebastian-philipp
Copy link
Contributor

Adam, would it make sense to redeploy them after a day or so?

@adk3798
Copy link
Contributor Author

adk3798 commented Dec 1, 2021

Adam, would it make sense to redeploy them after a day or so?

it wouldn't hurt as long as we're okay with having to track how long they're been down

@sebastian-philipp
Copy link
Contributor

Adam, would it make sense to redeploy them after a day or so?

it wouldn't hurt as long as we're okay with having to track how long they're been down

For https://tracker.ceph.com/issues/47038 we need to track this anyway.

@sebastian-philipp
Copy link
Contributor

actually I think we can merge this as it is

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Dec 2, 2021
@adk3798 adk3798 changed the title mgr/cephadm: agent: don't attempt redeploy of down agents mgr/cephadm: agent: simplify handling of agent reports Dec 2, 2021
@adk3798 adk3798 added the DNM label Dec 13, 2021
Don't try to do extra things like checking other agents
or updating health checks when agents report. Rely on
serve loop for that

This also should help with thrashing we've been seeing

Signed-off-by: Adam King <adking@redhat.com>
@adk3798
Copy link
Contributor Author

adk3798 commented Dec 15, 2021

This was causing

2021-12-03T10:32:00.602 INFO:teuthology.orchestra.run.smithi094.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_AGENT_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 Cephadm Agent(s) are not reporting. Hosts may be offline","count":1},"muted":false},"CEPHADM_FAILED_DAEMON":{"severity":"HEALTH_WARN","summary":{"message":"1 failed cephadm daemon(s)","count":1},"muted":false}},"mutes":[]}
2021-12-03T10:32:01.362 INFO:journalctl@ceph.mon.a.smithi094.stdout:Dec 03 10:32:00 smithi094 bash[14604]: audit 2021-12-03T10:32:00.595778+0000 mon.a (mon.0) 348 : audit [DBG] from='client.? 172.21.15.94:0/3020743751' entity='client.admin' cmd=[{"prefix": "health", "format": "json"}]: dispatch
2021-12-03T10:32:01.803 INFO:tasks.cephadm:Teardown begin
2021-12-03T10:32:01.803 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_844d13778e221f3c1f3cb6626445c4bc2f71766d/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/cephadm.py", line 1548, in task
    healthy(ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_d7478168d0435fd739f8f5f54dca33609f9cb014/qa/tasks/ceph_manager.py", line 3146, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy
https://pulpito.ceph.com/swagner-2021-12-03_09:41:55-orch:cephadm-wip-swagner-testing-2021-12-02-1454-distro-default-smithi/6542308

but I think it should be okay now

@sebastian-philipp
Copy link
Contributor

@sebastian-philipp sebastian-philipp merged commit aa37f5c into ceph:master Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants