mgr/cephadm: Add k8s-style event system by sebastian-philipp · Pull Request #35456 · ceph/ceph

sebastian-philipp · 2020-06-06T23:44:45Z

blocked by

mgr/orch: increase readability for yaml representation #35537

We're getting repeated complains that cephadm is not transparent. We still have not enough visibility of what cephadm is actually doing.

Adding progress events is going to be complicated, as cephadm contains a declarative state and a loop that tries to make the reality match the configured state. Adding progress events requires some non-trivial bookkeeping, which I'm not willing to do by myself right now.

Well, this is how kubernetes solve this problem? https://kubernetes.io/docs/tasks/debug-application-cluster/

Does this also work for us? Let's give it a try:

$ ceph orch ls --format yaml
events:
- 2020-06-06T23:23:46.910845 service:crash [INFO] "service was created"
placement:
  host_pattern: '*'
service_name: crash
service_type: crash
status:
  container_image_id: 74803e884bea289d2d2d3ebdf6d37cd560499e955595695b1390a89800f4e37a
  container_image_name: docker.io/ceph/daemon-base:latest-master-devel
  created: '2020-06-06T23:23:46.744005'
  last_refresh: '2020-06-06T23:23:57.109881'
  running: 1
  size: 1

$ ceph orch ps --format yaml
container_id: c804e5b26fdd
container_image_id: 74803e884bea289d2d2d3ebdf6d37cd560499e955595695b1390a89800f4e37a
container_image_name: docker.io/ceph/daemon-base:latest-master-devel
created: '2020-06-06T23:23:55.416572'
daemon_id: ubuntu
daemon_type: crash
events:
- 2020-06-06T23:23:55.634997 daemon:crash.ubuntu [INFO] "Deployed crash.ubuntu on
  host 'ubuntu'"
hostname: ubuntu
last_refresh: '2020-06-06T23:44:18.474924'
started: '2020-06-06T23:23:55.503624'
status: 1
status_desc: running
version: 16.0.0-901-g713ef3c

$ ceph orch apply --service_type node-exporter                                                                                                       
Scheduled node-exporter update...

$ ceph orch ls --format yaml                  
events:
- 2020-06-06T23:23:46.910845 service:crash [INFO] "service was created"
placement:
  host_pattern: '*'
service_name: crash
service_type: crash
status:
  container_image_id: 74803e884bea289d2d2d3ebdf6d37cd560499e955595695b1390a89800f4e37a
  container_image_name: docker.io/ceph/daemon-base:latest-master-devel
  created: '2020-06-06T23:23:46.744005'
  last_refresh: '2020-06-06T23:23:57.109881'
  running: 1
  size: 1
---
events:
- 2020-06-06T23:24:10.214343 service:node-exporter [INFO] "service was created"
- '2020-06-06T23:24:10.606714 service:node-exporter [ERROR] "cephadm exited with an
  error code: 1, stderr:INFO:cephadm:Deploy daemon node-exporter.ubuntu ...
  INFO:cephadm:Verifying port 9100 ...
  WARNING:cephadm:Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already
  in use
  ERROR: TCP Port(s) ''9100'' required for node-exporter is already in use"'
placement:
  host_pattern: '*'
service_name: node-exporter
service_type: node-exporter
status:
  running: 0
  size: 1

There are obviously many open questions. Some of them:

Is this a bad idea in general, and we should use HEALTH warnings instead?
How to convert those events into a meaningful HEALTH warning? Especially clearing the warnings again will be interesting: When exactly is the port conflict solved?
How to properly present the events in a way that looks good?
How to properly manage exceptions in serve()
Is storing lots of events in config-key a bad idea?
Are we going to overload config-key, if we generate events for all OSDs at once? How to avoid this?
This needs testing obviously
The yaml representation is bad. how to show the events then?
store the events in memory only.

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

mgfritch · 2020-06-07T18:49:57Z

This is an interesting idea!

jschmid1 · 2020-06-08T08:13:26Z

Indeed interesting!

A couple of random thoughts:

This is super helpful for troubleshooting/debugging and adds a lot of transparency.
The way that kubectl displays it looks great too, we could probably adopt this in any non-yaml/json format.

Otoh, this feels a bit redundant, the 'events' should be present in the logs. (If not, we did a bad job at logging). It should be rather easy to filter out 'events' from the logs and print them upon request.

Using a separate 'events' implementation would enable us to react on them This however adds a lot of complexity and probably requires a more sophisticated solution.

sebastian-philipp · 2020-06-08T08:23:44Z

Otoh, this feels a bit redundant, the 'events' should be present in the logs.

Right. This is mainly for diagnosing obvious errors. I'm just keeping 5 events per subject in the system.

(If not, we did a bad job at logging). It should be rather easy to filter out 'events' from the logs and print them upon request.

We already have

ceph/src/pybind/mgr/k8sevents/module.py

Line 159 in 779de8c

class LogEntry(object):

but I don't see this as particular helpful for finding domain-specific issues. We'd probably have to wring somethinge else.

Using a separate 'events' implementation would enable us to react on them This however adds a lot of complexity and probably requires a more sophisticated solution.

Adding context information is already pretty hard to do right. No idea if that would also work for a generic solution.

src/pybind/mgr/cephadm/inventory.py

sebastian-philipp · 2020-06-08T13:39:07Z

It would be great to see that error in the events list.

So, you think we should go ahead with this approach?

votdev · 2020-06-08T14:48:07Z

It would be great to see that error in the events list.

So, you think we should go ahead with this approach?

Got another problem while testing my PR to deploy services via Dashboard where your PR was really helpful to identify the problem easily.

# ceph orch ls --format yaml
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2020-06-08T14:44:37.004+0000 7f62acc81700 -1 WARNING: all dangerous and experimental features are enabled.
2020-06-08T14:44:37.048+0000 7f62a759e700 -1 WARNING: all dangerous and experimental features are enabled.
events:
- 2020-06-08T14:19:08.879649 service:grafana [INFO] "service was created"
- '2020-06-08T14:19:09.792884 service:grafana [ERROR] "cephadm exited with an error
  code: 1, stderr:INFO:cephadm:Deploy daemon grafana.mgr0 ...
  INFO:cephadm:Verifying port 3000 ...
  WARNING:cephadm:Cannot bind to IP 0.0.0.0 port 3000: [Errno 98] Address already
  in use
  ERROR: TCP Port(s) ''3000'' required for grafana is already in use"'
- '2020-06-08T14:19:10.299231 service:grafana [ERROR] "cephadm exited with an error
  code: 1, stderr:INFO:cephadm:Deploy daemon grafana.mgr0 ...
  INFO:cephadm:Verifying port 3000 ...
  WARNING:cephadm:Cannot bind to IP 0.0.0.0 port 3000: [Errno 98] Address already
  in use
  ERROR: TCP Port(s) ''3000'' required for grafana is already in use"'
- 2020-06-08T14:43:51.040352 service:grafana [INFO] "service was created"
- '2020-06-08T14:43:51.667427 service:grafana [ERROR] "cephadm exited with an error
  code: 1, stderr:INFO:cephadm:Deploy daemon grafana.mgr0 ...
  INFO:cephadm:Verifying port 3000 ...
  WARNING:cephadm:Cannot bind to IP 0.0.0.0 port 3000: [Errno 98] Address already
  in use
  ERROR: TCP Port(s) ''3000'' required for grafana is already in use"'
placement:
  label: aaaa
service_name: grafana
service_type: grafana
status:
  running: 0
  size: 1

sebastian-philipp · 2020-06-10T08:42:13Z

@neha-ojha + @jecluis . Are you ok with putting O(daemons) objects into config-key ? Or do I need a different way to store the data?

neha-ojha · 2020-06-12T00:36:46Z

@neha-ojha + @jecluis . Are you ok with putting O(daemons) objects into config-key ? Or do I need a different way to store the data?

@sebastian-philipp I need to understand the context for this, will revisit it next week.

sebastian-philipp · 2020-06-15T13:11:14Z

ok the context is: cephadm is really opaque and no one knows what cephadm is doing and what went wrong except for looking into the MGR log file. We're talking about things like

deployed daemon x on host y,
will remove daemon x from host y.
service x was created,
failed to deploy daemon x on host y, cause foobar.

With the exception of the last message, this is nothing we can push to the user via HEALTH, which is a different topic.

I'd like to store those messages for each daemon and right now, I'm storing at max 5 messages. Now the Q: is config-key capable to storing a json document per daemon?

sebastian-philipp · 2020-06-29T12:29:32Z

https://pulpito.ceph.com/swagner-2020-06-29_09:26:42-rados:cephadm-wip-swagner-testing-2020-06-26-1524-distro-basic-smithi/

Failure: https://tracker.ceph.com/issues/46178

sebastian-philipp · 2020-07-01T10:14:01Z

https://pulpito.ceph.com/swagner-2020-07-01_09:26:21-rados:cephadm-wip-swagner-testing-2020-07-01-0956-distro-basic-smithi/

https://tracker.ceph.com/issues/46299

sebastian-philipp · 2020-07-01T13:18:49Z

jenkins retest this please

sebastian-philipp · 2020-07-02T08:25:13Z

jenkins test make check

sebastian-philipp · 2020-07-15T12:28:46Z

https://pulpito.ceph.com/swagner-2020-07-15_10:40:40-rados:cephadm-wip-swagner2-testing-2020-07-15-1022-distro-basic-smithi/

sebastian-philipp · 2020-07-15T12:29:15Z

cephadm/module.py:45: note: In module imported here:
cephadm/inventory.py: note: In member "cleanup" of class "EventStore":
cephadm/inventory.py:500: error: Unpacking a string is disallowed
cephadm/inventory.py:501: error: Cannot determine type of 'k_s'
cephadm/inventory.py:504: error: Cannot determine type of 'k_s'
cephadm/inventory.py:507: error: Cannot determine type of 'k_s'
Found 4 errors in 1 file (checked 11 source files)

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

Like when if daemon deployment fails Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

sebastian-philipp · 2020-07-16T14:29:58Z

https://pulpito.ceph.com/swagner-2020-07-16_14:28:11-rados:cephadm-wip-swagner-testing-2020-07-16-1149-distro-basic-smithi/

smithfarm · 2020-07-24T15:46:36Z

@sebastian-philipp Would it make sense to also emit these events to the cluster log?

https://docs.ceph.com/docs/master/dev/logging/

sebastian-philipp · 2020-07-24T15:58:44Z

everything is logged in the mgr log right now. do we really need to log there as well?

sebastian-philipp · 2020-07-24T16:37:11Z

hehe. I think you clicked on edit, instead of quote :-)

smithfarm · 2020-07-24T16:40:19Z

hehe. I think you clicked on edit, instead of quote :-)

Yes, but not on purpose. A bad habit of mine :-(

smithfarm · 2020-07-24T16:41:18Z

everything is logged in the mgr log right now. do we really need to log there as well?

Depends on whom you mean by "we". Developers or users?

Searching the individual MGR logs for significant events might be easy for developers (?), but most users probably find it hard and mystifying.

The cluster log's purpose is to aggregate significant cluster-wide events in a single place. In my understanding, these "k8s-style" events would be a good fit for the cluster log.

sebastian-philipp added the cephadm label Jun 6, 2020

sebastian-philipp requested a review from liewegas June 6, 2020 23:44

sebastian-philipp requested a review from a team as a code owner June 6, 2020 23:44

sebastian-philipp added the DNM label Jun 6, 2020

sebastian-philipp requested a review from LenzGr June 6, 2020 23:49

sebastian-philipp mentioned this pull request Jun 8, 2020

mgr/cephadm: Add CephadmDaemonSpec class #35471

Merged

3 tasks

sebastian-philipp requested a review from votdev June 8, 2020 12:38

votdev reviewed Jun 8, 2020

View reviewed changes

src/pybind/mgr/cephadm/inventory.py Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

sebastian-philipp force-pushed the cephadm-events branch from e56aa3d to 97f77e1 Compare June 9, 2020 08:50

sebastian-philipp mentioned this pull request Jun 9, 2020

python-common: fix ServiceSpec validation #35493

Merged

3 tasks

sebastian-philipp mentioned this pull request Jun 11, 2020

mgr/orch: increase readability for yaml representation #35537

Merged

3 tasks

sebastian-philipp force-pushed the cephadm-events branch from 1cb7496 to 800b5b7 Compare June 25, 2020 14:05

sebastian-philipp added the needs-rebase label Jun 26, 2020

sebastian-philipp force-pushed the cephadm-events branch from 800b5b7 to 4737030 Compare June 26, 2020 13:18

sebastian-philipp changed the title ~~[RFC] mgr/cephadm: Add k8s-style event system~~ mgr/cephadm: Add k8s-style event system Jun 26, 2020

sebastian-philipp added needs-qa needs-review wip-swagner-testing My Teuthology tests and removed DNM needs-rebase labels Jun 26, 2020

sebastian-philipp force-pushed the cephadm-events branch from 4737030 to 8b226fb Compare June 30, 2020 11:41

sebastian-philipp removed the needs-review label Jun 30, 2020

sebastian-philipp removed wip-swagner-testing My Teuthology tests needs-qa labels Jul 2, 2020

sebastian-philipp force-pushed the cephadm-events branch from 8b226fb to c337bfe Compare July 15, 2020 08:21

sebastian-philipp added the wip-swagner2-testing label Jul 15, 2020

sebastian-philipp added needs-rebase and removed wip-swagner2-testing labels Jul 15, 2020

sebastian-philipp added 6 commits July 16, 2020 10:09

mgr/orch: Add OrchestratorEvent class

b9ff911

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

mgr/orch: Add events to tests/test_orchestrator

83c1636

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

mgr/cephadm: Add inventory.EventStore

f5b0155

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

mgr/cephadm: Add an event when creating a service

05ade49

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

mgr/cephadm: Add event when deploying a daemon

960f6bf

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

mgr/cephadm: Set exception context to populate orch events.

1dd2c5c

Like when if daemon deployment fails Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>

sebastian-philipp force-pushed the cephadm-events branch from c337bfe to 1dd2c5c Compare July 16, 2020 09:40

sebastian-philipp added wip-swagner-testing My Teuthology tests and removed needs-rebase labels Jul 16, 2020

sebastian-philipp merged commit 82acfe9 into ceph:master Jul 17, 2020

sebastian-philipp mentioned this pull request Jul 23, 2020

octopus: cephadm batch backport July (3) #36265

Merged

Conversation

sebastian-philipp commented Jun 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

mgfritch commented Jun 7, 2020

Uh oh!

jschmid1 commented Jun 8, 2020

Uh oh!

sebastian-philipp commented Jun 8, 2020

Uh oh!

Uh oh!

This comment has been minimized.

sebastian-philipp commented Jun 8, 2020

Uh oh!

votdev commented Jun 8, 2020

Uh oh!

sebastian-philipp commented Jun 10, 2020

Uh oh!

neha-ojha commented Jun 12, 2020

Uh oh!

sebastian-philipp commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-philipp commented Jun 29, 2020

Uh oh!

sebastian-philipp commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-philipp commented Jul 1, 2020

Uh oh!

sebastian-philipp commented Jul 2, 2020

Uh oh!

sebastian-philipp commented Jul 15, 2020

Uh oh!

sebastian-philipp commented Jul 15, 2020

Uh oh!

sebastian-philipp commented Jul 16, 2020

Uh oh!

smithfarm commented Jul 24, 2020

Uh oh!

sebastian-philipp commented Jul 24, 2020 • edited by smithfarm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-philipp commented Jul 24, 2020

Uh oh!

smithfarm commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smithfarm commented Jul 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sebastian-philipp commented Jun 6, 2020 •

edited

Loading

sebastian-philipp commented Jun 15, 2020 •

edited

Loading

sebastian-philipp commented Jul 1, 2020 •

edited

Loading

sebastian-philipp commented Jul 24, 2020 •

edited by smithfarm

Loading

smithfarm commented Jul 24, 2020 •

edited

Loading