Skip to content

mgr/cephadm: Provide an integrated configuration validation feature#39541

Merged
sebastian-philipp merged 22 commits intoceph:masterfrom
pcuzner:config-checker
Mar 10, 2021
Merged

mgr/cephadm: Provide an integrated configuration validation feature#39541
sebastian-philipp merged 22 commits intoceph:masterfrom
pcuzner:config-checker

Conversation

@pcuzner
Copy link
Contributor

@pcuzner pcuzner commented Feb 18, 2021

This PR provides a new feature to cephadm, enabling it to actively look for configuration anomalies within the configuration of the clusters hosts and daemons. This initial implementation provides 8 checks that are executed during each 'reconcile' with any issues resulting in ceph health checks. Each check can be independently disabled via new commands that have been added:

ceph cephadm config-check status
ceph cephadm config-check ls
ceph cephadm config-check enable <check_name>
ceph cephadm config-check disable <check_name>

Signed-off-by: Paul Cuzner pcuzner@redhat.com

@pcuzner pcuzner requested a review from a team as a code owner February 18, 2021 03:40
@pcuzner
Copy link
Contributor Author

pcuzner commented Feb 18, 2021

CLI interaction example
Peek 2021-02-18 16-48

self.mgr.health_checks.pop('CEPHADM_CHECK_KERNEL_LSM', None)

def _check_subscription(self) -> None:
if len(self.subscribed['yes']) > 0 and len(self.subscribed['no']) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need also to take into account the "unknown" subscription hosts....
what if we have a couple of them "subscribed" and one in "unknown" state.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds like another check. If we have a subscribed state of yes/no that's red hat - so if we have subsciption state of unknown as well we have another OS in the cluster, which would be a consistency issue

Agree?

Copy link
Member

@jmolmo jmolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature!!.
Apart of the subscription state issue iIthink that everything is ok. Please add the documentation for the new feature and commands.

Copy link
Contributor

@sebastian-philipp sebastian-philipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work. Just some nits

Comment on lines +253 to +264
defaults = {check.name: 'enabled' for check in self.health_checks}
self.mgr.set_store('config_checks', json.dumps(defaults))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we can't change the the default enabled flag flag for checks in future versions. Maybe we can avoid writing the defaults into the store.

I wen through this specific pain often enough in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you like to see each check with a default state - and the apply that? Is that what you're asking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we write default configuration values into the store, we no longer can change those default is future versions for existing clusters. I'd like to be able to disable a check by default. I mean, I don't plan to disable any specific check, but not being able to do so might be problematic in future versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...but the checks can be turned off and on individually - so we have to persist the state of the check. Checks that we add in later releases could come in with a different default. Given that doesn't my suggestion to have the default state in the code give you what you want? New checks could then be disabled - but existing checks will adhere to what the admin has set.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...but the checks can be turned off and on individually - so we have to persist the state of the check.

Right.

Checks that we add in later releases could come in with a different default.

Right.

New checks could then be disabled - but existing checks will adhere to what the admin has set.

Right.

Am I missing something?

There are indeed some minor downsides when writing program default values to the user's config.

  1. We have information loss. We can longer distinguish between a value set by the users and the program. That might become relevant if we need to change the architecture of how we persistently store things. I went through this pain already.
  2. Changes like c896292#diff-4f2fb7d330e74b64ac41457b7c7a723cd78db86433e0b0c398874531e5a7e39eR258 are getting much harder, as we also would need to cope with the persistent user configuration.
  3. Imagine not being able to enable a check afterwards 54ac36e
  4. Imagine having to update the persistent user config in order to change the default for existing checks d1ad1a9
  5. Right now, you need to care about keeping they keys of disabled checks in sync with the checks that are implemented. You no longer need to do that if you don't write the program defaults into the user's config.
  6. We'd like to be able to support downgrade to previous patch releases. Removing keys for non-existent checks like https://github.com/ceph/ceph/pull/39541/files#diff-0c0d9742d542b6fac8c95276ccb6c24bc3099f006f5d11d17649d8eadda4a0b1R234 becomes a bug then.

@liewegas liewegas changed the title mgr/cephadm:Provide an integrated configuration validation feature mgr/cephadm: Provide an integrated configuration validation feature Feb 18, 2021
@pcuzner
Copy link
Contributor Author

pcuzner commented Feb 19, 2021

Once ready, I'll squash to reduce the commits for backport

@pcuzner
Copy link
Contributor Author

pcuzner commented Feb 19, 2021

Docs...Looking at this from the frontend, I was think about the following changes;
updating the feature page https://docs.ceph.com/en/latest/dev/cephadm/compliance-check/
updating https://docs.ceph.com/en/latest/cephadm/operations/#health-checks

Also, it appears that even though we have a TOC entry for cephadm CLI, when you click it you get to the orch cli..so where are the ceph cephadm commands documented?

@sebastian-philipp @jmolmo

@sebastian-philipp
Copy link
Contributor

Docs...Looking at this from the frontend, I was think about the following changes;
updating the feature page https://docs.ceph.com/en/latest/dev/cephadm/compliance-check/
updating https://docs.ceph.com/en/latest/cephadm/operations/#health-checks

Also, it appears that even though we have a TOC entry for cephadm CLI, when you click it you get to the orch cli..so where are the ceph cephadm commands documented?

@sebastian-philipp @jmolmo

Please have a look at #39551 This PR tries to improve this.

pcuzner added 15 commits March 2, 2021 11:57
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Adds config_checks_enabled (bool) option

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Add unit tests to test suite to verify functionality. The unit tests use
a sample host definition and scale that to simulate a cluster to run
the tests against

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Multiple updates to ensure
- mgr health checks are raised correctly
- checks are independent and may be disabled

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Upgrades may change the config checks, so the tests now validate
that new checks and old bogus checks are handled correctly

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Initial implementation used the Serve class as the owner of the
configuration checker. This patch moves the checker up to the
cephadm module itself, to make the CLI command logic cleaner

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Some changes needed to support the introduction of the CLI commands
used to manage the cephadm checks. For example, the main Cephadm
check class now interacts with the keystore directly to determine
status, and provides support for commands like ls to list the
check definitions. In addition the main class now handles existing
configuration checks and ensure that the stored state in the keystore
matches the checks defined by the module

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Patch to add CLI commands to show and manage the state of the
configuration checker feature

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Changes to reflect review comments
- picked up on subscribed = unknown state
- using get_daemon_types() call
- use log.exception more
- changed logic and errors from the public_network check

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
A check for _ceph_get_server was included for unit testing, but
the tests have been updated to make this obsolete.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Since switching how the roles for a host are determining the
type hint was missed..this patch addresses that

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
pcuzner added 7 commits March 2, 2021 11:59
The logic was issuing a healthcheck if the linkspeed was different
to the majority. But if the difference is good (i.e. better!) we should
not be raising a healthcheck

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
checks that we're not raising a healthcheck for a host if it's
nic speed it better than the rest!

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The healthcheck could already be active when the admin attempts
to disable it. This patch removes the related healthcheck if it's set
during a config-check disable request.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Build of ceph metadata needed addition type hints.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
This patch updates the docs to describe the config-check feature,
describing how these checks can be enabled and managed.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@pcuzner
Copy link
Contributor Author

pcuzner commented Mar 2, 2021

@sebastian-philipp docs added

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Mar 5, 2021
@sebastian-philipp
Copy link
Contributor

@sebastian-philipp sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Mar 10, 2021
@sebastian-philipp sebastian-philipp merged commit 744edab into ceph:master Mar 10, 2021
@pcuzner pcuzner deleted the config-checker branch March 11, 2021 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants