mgr/cephadm: Add k8s-style event system#35456
Conversation
|
This is an interesting idea! |
|
Indeed interesting! A couple of random thoughts: This is super helpful for troubleshooting/debugging and adds a lot of transparency. Otoh, this feels a bit redundant, the 'events' should be present in the logs. (If not, we did a bad job at logging). It should be rather easy to filter out 'events' from the logs and print them upon request. Using a separate 'events' implementation would enable us to react on them This however adds a lot of complexity and probably requires a more sophisticated solution. |
Right. This is mainly for diagnosing obvious errors. I'm just keeping 5 events per subject in the system.
We already have ceph/src/pybind/mgr/k8sevents/module.py Line 159 in 779de8c but I don't see this as particular helpful for finding domain-specific issues. We'd probably have to wring somethinge else.
Adding context information is already pretty hard to do right. No idea if that would also work for a generic solution. |
This comment has been minimized.
This comment has been minimized.
So, you think we should go ahead with this approach? |
Got another problem while testing my PR to deploy services via Dashboard where your PR was really helpful to identify the problem easily. |
e56aa3d to
97f77e1
Compare
|
@neha-ojha + @jecluis . Are you ok with putting O(daemons) objects into config-key ? Or do I need a different way to store the data? |
@sebastian-philipp I need to understand the context for this, will revisit it next week. |
|
ok the context is: cephadm is really opaque and no one knows what cephadm is doing and what went wrong except for looking into the MGR log file. We're talking about things like
With the exception of the last message, this is nothing we can push to the user via HEALTH, which is a different topic. I'd like to store those messages for each daemon and right now, I'm storing at max 5 messages. Now the Q: is config-key capable to storing a json document per daemon? |
1cb7496 to
800b5b7
Compare
800b5b7 to
4737030
Compare
4737030 to
8b226fb
Compare
|
jenkins retest this please |
|
jenkins test make check |
8b226fb to
c337bfe
Compare
|
cephadm/module.py:45: note: In module imported here: |
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Like when if daemon deployment fails Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
c337bfe to
1dd2c5c
Compare
|
@sebastian-philipp Would it make sense to also emit these events to the cluster log? |
|
everything is logged in the mgr log right now. do we really need to log there as well? |
Yes, but not on purpose. A bad habit of mine :-( |
Depends on whom you mean by "we". Developers or users? Searching the individual MGR logs for significant events might be easy for developers (?), but most users probably find it hard and mystifying. The cluster log's purpose is to aggregate significant cluster-wide events in a single place. In my understanding, these "k8s-style" events would be a good fit for the cluster log. |

blocked by
We're getting repeated complains that cephadm is not transparent. We still have not enough visibility of what cephadm is actually doing.
Adding progress events is going to be complicated, as cephadm contains a declarative state and a loop that tries to make the reality match the configured state. Adding progress events requires some non-trivial bookkeeping, which I'm not willing to do by myself right now.
Well, this is how kubernetes solve this problem? https://kubernetes.io/docs/tasks/debug-application-cluster/
Does this also work for us? Let's give it a try:
There are obviously many open questions. Some of them:
serve()Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox