mgr/cephadm: Provide an integrated configuration validation feature#39541
mgr/cephadm: Provide an integrated configuration validation feature#39541sebastian-philipp merged 22 commits intoceph:masterfrom pcuzner:config-checker
Conversation
| self.mgr.health_checks.pop('CEPHADM_CHECK_KERNEL_LSM', None) | ||
|
|
||
| def _check_subscription(self) -> None: | ||
| if len(self.subscribed['yes']) > 0 and len(self.subscribed['no']) > 0: |
There was a problem hiding this comment.
We need also to take into account the "unknown" subscription hosts....
what if we have a couple of them "subscribed" and one in "unknown" state.?
There was a problem hiding this comment.
that sounds like another check. If we have a subscribed state of yes/no that's red hat - so if we have subsciption state of unknown as well we have another OS in the cluster, which would be a consistency issue
Agree?
jmolmo
left a comment
There was a problem hiding this comment.
Nice feature!!.
Apart of the subscription state issue iIthink that everything is ok. Please add the documentation for the new feature and commands.
sebastian-philipp
left a comment
There was a problem hiding this comment.
awesome work. Just some nits
| defaults = {check.name: 'enabled' for check in self.health_checks} | ||
| self.mgr.set_store('config_checks', json.dumps(defaults)) |
There was a problem hiding this comment.
This means we can't change the the default enabled flag flag for checks in future versions. Maybe we can avoid writing the defaults into the store.
I wen through this specific pain often enough in the past.
There was a problem hiding this comment.
would you like to see each check with a default state - and the apply that? Is that what you're asking?
There was a problem hiding this comment.
If we write default configuration values into the store, we no longer can change those default is future versions for existing clusters. I'd like to be able to disable a check by default. I mean, I don't plan to disable any specific check, but not being able to do so might be problematic in future versions.
There was a problem hiding this comment.
OK...but the checks can be turned off and on individually - so we have to persist the state of the check. Checks that we add in later releases could come in with a different default. Given that doesn't my suggestion to have the default state in the code give you what you want? New checks could then be disabled - but existing checks will adhere to what the admin has set.
Am I missing something?
There was a problem hiding this comment.
OK...but the checks can be turned off and on individually - so we have to persist the state of the check.
Right.
Checks that we add in later releases could come in with a different default.
Right.
New checks could then be disabled - but existing checks will adhere to what the admin has set.
Right.
Am I missing something?
There are indeed some minor downsides when writing program default values to the user's config.
- We have information loss. We can longer distinguish between a value set by the users and the program. That might become relevant if we need to change the architecture of how we persistently store things. I went through this pain already.
- Changes like c896292#diff-4f2fb7d330e74b64ac41457b7c7a723cd78db86433e0b0c398874531e5a7e39eR258 are getting much harder, as we also would need to cope with the persistent user configuration.
- Imagine not being able to enable a check afterwards 54ac36e
- Imagine having to update the persistent user config in order to change the default for existing checks d1ad1a9
- Right now, you need to care about keeping they keys of disabled checks in sync with the checks that are implemented. You no longer need to do that if you don't write the program defaults into the user's config.
- We'd like to be able to support downgrade to previous patch releases. Removing keys for non-existent checks like https://github.com/ceph/ceph/pull/39541/files#diff-0c0d9742d542b6fac8c95276ccb6c24bc3099f006f5d11d17649d8eadda4a0b1R234 becomes a bug then.
|
Once ready, I'll squash to reduce the commits for backport |
|
Docs...Looking at this from the frontend, I was think about the following changes; Also, it appears that even though we have a TOC entry for cephadm CLI, when you click it you get to the orch cli..so where are the ceph cephadm commands documented? |
Please have a look at #39551 This PR tries to improve this. |
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Adds config_checks_enabled (bool) option Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Add unit tests to test suite to verify functionality. The unit tests use a sample host definition and scale that to simulate a cluster to run the tests against Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Multiple updates to ensure - mgr health checks are raised correctly - checks are independent and may be disabled Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Upgrades may change the config checks, so the tests now validate that new checks and old bogus checks are handled correctly Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Initial implementation used the Serve class as the owner of the configuration checker. This patch moves the checker up to the cephadm module itself, to make the CLI command logic cleaner Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Some changes needed to support the introduction of the CLI commands used to manage the cephadm checks. For example, the main Cephadm check class now interacts with the keystore directly to determine status, and provides support for commands like ls to list the check definitions. In addition the main class now handles existing configuration checks and ensure that the stored state in the keystore matches the checks defined by the module Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Patch to add CLI commands to show and manage the state of the configuration checker feature Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Changes to reflect review comments - picked up on subscribed = unknown state - using get_daemon_types() call - use log.exception more - changed logic and errors from the public_network check Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
A check for _ceph_get_server was included for unit testing, but the tests have been updated to make this obsolete. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Since switching how the roles for a host are determining the type hint was missed..this patch addresses that Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The logic was issuing a healthcheck if the linkspeed was different to the majority. But if the difference is good (i.e. better!) we should not be raising a healthcheck Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
checks that we're not raising a healthcheck for a host if it's nic speed it better than the rest! Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The healthcheck could already be active when the admin attempts to disable it. This patch removes the related healthcheck if it's set during a config-check disable request. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Build of ceph metadata needed addition type hints. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
This patch updates the docs to describe the config-check feature, describing how these checks can be enabled and managed. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
|
@sebastian-philipp docs added |

This PR provides a new feature to cephadm, enabling it to actively look for configuration anomalies within the configuration of the clusters hosts and daemons. This initial implementation provides 8 checks that are executed during each 'reconcile' with any issues resulting in ceph health checks. Each check can be independently disabled via new commands that have been added:
Signed-off-by: Paul Cuzner pcuzner@redhat.com