mgr/cephadm: Provide an integrated configuration validation feature by pcuzner · Pull Request #39541 · ceph/ceph

pcuzner · 2021-02-18T03:40:54Z

This PR provides a new feature to cephadm, enabling it to actively look for configuration anomalies within the configuration of the clusters hosts and daemons. This initial implementation provides 8 checks that are executed during each 'reconcile' with any issues resulting in ceph health checks. Each check can be independently disabled via new commands that have been added:

ceph cephadm config-check status
ceph cephadm config-check ls
ceph cephadm config-check enable <check_name>
ceph cephadm config-check disable <check_name>

Signed-off-by: Paul Cuzner pcuzner@redhat.com

pcuzner · 2021-02-18T03:50:57Z

CLI interaction example

src/pybind/mgr/cephadm/configchecks.py

jmolmo · 2021-02-18T10:12:36Z

src/pybind/mgr/cephadm/configchecks.py

+            self.mgr.health_checks.pop('CEPHADM_CHECK_KERNEL_LSM', None)
+
+    def _check_subscription(self) -> None:
+        if len(self.subscribed['yes']) > 0 and len(self.subscribed['no']) > 0:


We need also to take into account the "unknown" subscription hosts....
what if we have a couple of them "subscribed" and one in "unknown" state.?

that sounds like another check. If we have a subscribed state of yes/no that's red hat - so if we have subsciption state of unknown as well we have another OS in the cluster, which would be a consistency issue

Agree?

src/pybind/mgr/cephadm/configchecks.py

jmolmo

Nice feature!!.
Apart of the subscription state issue iIthink that everything is ok. Please add the documentation for the new feature and commands.

sebastian-philipp

awesome work. Just some nits

src/pybind/mgr/cephadm/configchecks.py

sebastian-philipp · 2021-02-18T09:30:29Z

src/pybind/mgr/cephadm/configchecks.py

+        defaults = {check.name: 'enabled' for check in self.health_checks}
+        self.mgr.set_store('config_checks', json.dumps(defaults))


This means we can't change the the default enabled flag flag for checks in future versions. Maybe we can avoid writing the defaults into the store.

I wen through this specific pain often enough in the past.

would you like to see each check with a default state - and the apply that? Is that what you're asking?

If we write default configuration values into the store, we no longer can change those default is future versions for existing clusters. I'd like to be able to disable a check by default. I mean, I don't plan to disable any specific check, but not being able to do so might be problematic in future versions.

OK...but the checks can be turned off and on individually - so we have to persist the state of the check. Checks that we add in later releases could come in with a different default. Given that doesn't my suggestion to have the default state in the code give you what you want? New checks could then be disabled - but existing checks will adhere to what the admin has set.

Am I missing something?

OK...but the checks can be turned off and on individually - so we have to persist the state of the check.

Right.

Checks that we add in later releases could come in with a different default.

Right.

New checks could then be disabled - but existing checks will adhere to what the admin has set.

Right.

Am I missing something?

There are indeed some minor downsides when writing program default values to the user's config.

We have information loss. We can longer distinguish between a value set by the users and the program. That might become relevant if we need to change the architecture of how we persistently store things. I went through this pain already.

Changes like c896292#diff-4f2fb7d330e74b64ac41457b7c7a723cd78db86433e0b0c398874531e5a7e39eR258 are getting much harder, as we also would need to cope with the persistent user configuration.

Imagine not being able to enable a check afterwards 54ac36e

Imagine having to update the persistent user config in order to change the default for existing checks d1ad1a9

Right now, you need to care about keeping they keys of disabled checks in sync with the checks that are implemented. You no longer need to do that if you don't write the program defaults into the user's config.

We'd like to be able to support downgrade to previous patch releases. Removing keys for non-existent checks like https://github.com/ceph/ceph/pull/39541/files#diff-0c0d9742d542b6fac8c95276ccb6c24bc3099f006f5d11d17649d8eadda4a0b1R234 becomes a bug then.

src/pybind/mgr/cephadm/configchecks.py

src/pybind/mgr/cephadm/tests/test_configchecks.py

pcuzner · 2021-02-19T02:12:04Z

Once ready, I'll squash to reduce the commits for backport

pcuzner · 2021-02-19T02:53:52Z

Docs...Looking at this from the frontend, I was think about the following changes;
updating the feature page https://docs.ceph.com/en/latest/dev/cephadm/compliance-check/
updating https://docs.ceph.com/en/latest/cephadm/operations/#health-checks

Also, it appears that even though we have a TOC entry for cephadm CLI, when you click it you get to the orch cli..so where are the ceph cephadm commands documented?

@sebastian-philipp @jmolmo

sebastian-philipp · 2021-02-19T12:59:45Z

Docs...Looking at this from the frontend, I was think about the following changes;
updating the feature page https://docs.ceph.com/en/latest/dev/cephadm/compliance-check/
updating https://docs.ceph.com/en/latest/cephadm/operations/#health-checks

Also, it appears that even though we have a TOC entry for cephadm CLI, when you click it you get to the orch cli..so where are the ceph cephadm commands documented?

@sebastian-philipp @jmolmo

Please have a look at #39551 This PR tries to improve this.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Adds config_checks_enabled (bool) option Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Add unit tests to test suite to verify functionality. The unit tests use a sample host definition and scale that to simulate a cluster to run the tests against Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Multiple updates to ensure - mgr health checks are raised correctly - checks are independent and may be disabled Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Upgrades may change the config checks, so the tests now validate that new checks and old bogus checks are handled correctly Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Initial implementation used the Serve class as the owner of the configuration checker. This patch moves the checker up to the cephadm module itself, to make the CLI command logic cleaner Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Some changes needed to support the introduction of the CLI commands used to manage the cephadm checks. For example, the main Cephadm check class now interacts with the keystore directly to determine status, and provides support for commands like ls to list the check definitions. In addition the main class now handles existing configuration checks and ensure that the stored state in the keystore matches the checks defined by the module Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Patch to add CLI commands to show and manage the state of the configuration checker feature Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Changes to reflect review comments - picked up on subscribed = unknown state - using get_daemon_types() call - use log.exception more - changed logic and errors from the public_network check Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

A check for _ceph_get_server was included for unit testing, but the tests have been updated to make this obsolete. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Since switching how the roles for a host are determining the type hint was missed..this patch addresses that Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

The logic was issuing a healthcheck if the linkspeed was different to the majority. But if the difference is good (i.e. better!) we should not be raising a healthcheck Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

checks that we're not raising a healthcheck for a host if it's nic speed it better than the rest! Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

The healthcheck could already be active when the admin attempts to disable it. This patch removes the related healthcheck if it's set during a config-check disable request. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Build of ceph metadata needed addition type hints. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

This patch updates the docs to describe the config-check feature, describing how these checks can be enabled and managed. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

pcuzner · 2021-03-02T04:55:10Z

@sebastian-philipp docs added

sebastian-philipp · 2021-03-10T11:13:00Z

https://pulpito.ceph.com/swagner-2021-03-09_13:58:13-rados:cephadm-wip-swagner-testing-2021-03-09-1014-distro-basic-smithi/

https://tracker.ceph.com/issues/49633
dashboard e2e only runs on centos

ack

pcuzner requested a review from a team as a code owner February 18, 2021 03:40

github-actions bot added cephadm pybind labels Feb 18, 2021

jmolmo reviewed Feb 18, 2021

View reviewed changes

src/pybind/mgr/cephadm/configchecks.py Show resolved Hide resolved

jmolmo reviewed Feb 18, 2021

View reviewed changes

src/pybind/mgr/cephadm/configchecks.py Outdated Show resolved Hide resolved

jmolmo requested changes Feb 18, 2021

View reviewed changes

sebastian-philipp previously requested changes Feb 18, 2021

View reviewed changes

liewegas changed the title ~~mgr/cephadm:Provide an integrated configuration validation feature~~ mgr/cephadm: Provide an integrated configuration validation feature Feb 18, 2021

pcuzner requested review from jmolmo and sebastian-philipp February 19, 2021 02:12

pcuzner added 15 commits March 2, 2021 11:57

mgr/cephadm: resolve rebase conflicts

80a9d71

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm: adding check logic

5a555ba

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm: added config checker to main serve loop

3ea8eaf

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:added ceph version consistency check

bcd6acf

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:add module option to enable configuration checks

fae0adf

Adds config_checks_enabled (bool) option Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Adds unit tests for the CephadmConfigChecks class

4b1b136

Add unit tests to test suite to verify functionality. The unit tests use a sample host definition and scale that to simulate a cluster to run the tests against Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Updates to CephadmConfigChecks class

13713b5

Multiple updates to ensure - mgr health checks are raised correctly - checks are independent and may be disabled Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Unit tests updated to account for upgrades

bf02fb9

Upgrades may change the config checks, so the tests now validate that new checks and old bogus checks are handled correctly Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Added CLI interface for the configuration checker

0668ffb

Patch to add CLI commands to show and manage the state of the configuration checker feature Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Add unit test for hosts without public network NIC

d271987

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Remove check from ceph metadata gathering

b2f0b6f

A check for _ceph_get_server was included for unit testing, but the tests have been updated to make this obsolete. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:fix mypy warning

83f5312

Since switching how the roles for a host are determining the type hint was missed..this patch addresses that Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

pcuzner added 7 commits March 2, 2021 11:59

mgr/cephadm:skip an alert if the linkspeed is better than most

74a3599

The logic was issuing a healthcheck if the linkspeed was different to the majority. But if the difference is good (i.e. better!) we should not be raising a healthcheck Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:unit test added for nics better than most

0d7ecb1

checks that we're not raising a healthcheck for a host if it's nic speed it better than the rest! Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Added helper function to return a specific healthcheck

baa2c93

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Drop active healthcheck during a disable request

f3cf41f

The healthcheck could already be active when the admin attempts to disable it. This patch removes the related healthcheck if it's set during a config-check disable request. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:add unit test for the lookup_check helper

c29e6ac

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:fix to resolve mypy issue

1407488

Build of ceph metadata needed addition type hints. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mgr/cephadm:Document the cephadm config-check feature

e407c63

This patch updates the docs to describe the config-check feature, describing how these checks can be enabled and managed. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

github-actions bot added the documentation label Mar 2, 2021

sebastian-philipp added the wip-swagner-testing My Teuthology tests label Mar 5, 2021

sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Mar 10, 2021

jmolmo approved these changes Mar 10, 2021

View reviewed changes

sebastian-philipp merged commit 744edab into ceph:master Mar 10, 2021

pcuzner deleted the config-checker branch March 11, 2021 00:35

liewegas mentioned this pull request Mar 15, 2021

pacific: cephadm: Batch backport March (2) #40135

Merged

		defaults = {check.name: 'enabled' for check in self.health_checks}
		self.mgr.set_store('config_checks', json.dumps(defaults))

Conversation

pcuzner commented Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcuzner commented Feb 18, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmolmo left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuzner commented Feb 19, 2021

Uh oh!

pcuzner commented Feb 19, 2021

Uh oh!

sebastian-philipp commented Feb 19, 2021

Uh oh!

pcuzner commented Mar 2, 2021

Uh oh!

sebastian-philipp commented Mar 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pcuzner commented Feb 18, 2021 •

edited

Loading