cephadm: add gather_facts command by pcuzner · Pull Request #36509 · ceph/ceph

pcuzner · 2020-08-07T04:31:36Z

The gather_facts command is intended to provide host level metadata to the caller, which could then be used
in a number of places: orchestrator inventory, displayed in the UI etc

Data is extracted mainly from sysfs and /proc

Signed-off-by: Paul Cuzner pcuzner@redhat.com

The gather_facts command is intended to provide host level metadata to the caller, which could then be used in a number of places: orchestrator inventory, displayed in the UI etc Data is extracted mainly from sysfs and /proc Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

src/cephadm/cephadm

sebastian-philipp · 2020-08-07T08:24:31Z

src/cephadm/cephadm

+    def _get_block_devs(self):
+        # type: () -> List[str]
+        return [dev for dev in os.listdir('/sys/block')
+                if not dev.startswith('dm')]
+
+    def _get_devs_by_type(self, rota='0'):
+        # type: (str) -> List[str]
+        devs = list()
+        for blk_dev in self._get_block_devs():
+            rot_path = '/sys/block/{}/queue/rotational'.format(blk_dev)
+            rot_value = rfiles([rot_path])
+            if rot_value == rota:
+                devs.append(blk_dev)
+        return devs


I see some overlap with ceph-volume inventory here

Yep - I get a list of the devices but I don't inspect - and the goal I have is to be as quick as possible. On my physical test machines ceph-volume can take a couple of seconds, whereas what I've done takes way less.

Hm. Hmm. But if you want to do anything with the devices (device ls, create OSDs, etc) you need the c-v inventory anyway. Why bother with this then here?

I'm trying to just return the most complete picture of the host as quickly as possible - that's all. I'm thinking about things like; host metadata in the UI, or a "ceph orch host info" command - both give the admin what they need to understand the host config quickly and easily. Ceph-volume remains the ultimate source of truth for disks - in future we could think about running gather facts more often in order to auto-refresh ceph-volume inventory, when disk adds/removes have been detected.

src/cephadm/cephadm

sebastian-philipp · 2020-08-07T08:30:49Z

src/cephadm/cephadm

+    @property
+    def hdd_count(self):
+        # type: () -> int
+        return len(self._get_devs_by_type(rota='1'))


cephadm already has access to ceph-volume inventory. Why do you need this here as well?

ceph-volume takes a few seconds to gather it's stuff - I'm trying to not have dependencies and gather the critical data as quick as possible

Imo it's inadvisable to have two different sources for the same kind of data. If users rely on this source to base their assumptions on we're potentially running into problems.

I think the intend here is not to duplicate c-v inventory, but something like fdisk -l to get an overview about the physical server. Especially also for servers that are not intended for OSD deployments, such as gateways or mons.

src/cephadm/cephadm

Minor changes to address issues raised in the PR; - formatting of ipv6 addresses - missing docstrings - NIC mtu and speed now int instead of string - added NIC driver name - removed discrete JSON method making gather-facts JSON only - added upper/lower device lists to show NIC relationships - added hostname to the JSON! - added selinux/apparmor status - added timestamp (epoch) for the gather-facts run - added system uptime (secs) Closes: Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

mykaul · 2020-08-10T06:37:31Z

General question - why not just use the Prometheus node exporter which probably has all that we need already?

Devp00l

Works nicely and super fast 👍

Devp00l · 2020-08-10T14:55:30Z

General question - why not just use the Prometheus node exporter which probably has all that we need already?

As I understand it this command can be run before a cephadm cluster is bootstrapped, this way you can gather the information before having prometheus running.

There is a lot of potential the gather-facts command inside the dashboard.

Added free/available memory state, and cpu loadavg info Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

pcuzner · 2020-08-10T22:07:23Z

General question - why not just use the Prometheus node exporter which probably has all that we need already?

@mykaul it is a good question. The goal of gather facts is to provide a holistic view of the host - this covers areas that node_exporter doesn't touch, since the focus of node_exporter is on metrics/performance not metadata. Also, remember that prometheus is an optional component to Ceph - recommended, but nevertheless optional.

The data that gather_facts yields covers things like; the machine hw model, bios version, bios date, the disk types and capacities, nic speeds, os version, subscription state, kernel version, linux kernel security module state as well as NIC parent/child relationships - all of this is metadata, not metrics.

Having metadata like this can underpin pre-flight checks, scheduling actions etc. It could also enhance the UI to provide more information to the Admin about the cluster, and reduce their need for dropping to the CLI or using other tooolchains.

pcuzner · 2020-08-11T03:21:01Z

I've mocked up a UI that takes the data that gather-facts extracts, for anyone who's interested

sebastian-philipp · 2020-08-11T12:24:26Z

@ceph/dashboard ^^

Adds a gather-facts invocatio after check-host is run. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

sebastian-philipp · 2020-08-17T07:48:00Z

src/cephadm/cephadm

+def get_ipv6_address(ifname):
+    # type: (str) -> str
+    if not os.path.exists('/proc/net/if_inet6'):
+        return ''
+
+    raw = read_file(['/proc/net/if_inet6'])
+    data = raw.splitlines()
+    # based on docs @ https://www.tldp.org/HOWTO/Linux+IPv6-HOWTO/ch11s04.html
+    # field 0 is ipv6, field 2 is scope
+    for iface_setting in data:
+        field = iface_setting.split()
+        if field[-1] == ifname:
+            ipv6_raw = field[0]
+            ipv6_fmtd = ":".join([ipv6_raw[_p:_p+4] for _p in range(0, len(field[0]),4)])
+            # apply naming rules using ipaddress module
+            ipv6 = ipaddress.ip_address(ipv6_fmtd)
+            return "{}/{}".format(str(ipv6), int('0x{}'.format(field[2]), 16))
+    return ''


cc @matthewoliver

sebastian-philipp · 2020-08-17T08:25:02Z

src/cephadm/cephadm

+            vendor = read_file(['/sys/block/{}/device/vendor'.format(dev)]).strip()
+            disk_vendor = HostFacts._disk_vendor_workarounds.get(vendor, vendor)
+            disk_size_bytes = self._get_capacity(dev)
+            disk_list.append("{} {} ({})".format(disk_vendor, disk_model, bytes_to_human(disk_size_bytes)))


I'm getting

"hdd_list": [ "Unknown Unknown (4.2MB)", "Unknown Unknown (101.7MB)", "Unknown Unknown (58.0MB)", "Unknown Unknown (10.6MB)", "Unknown Unknown (130.9MB)", "Unknown Unknown (130.0MB)", "Unknown Unknown (57.6MB)", "Unknown Unknown (10.3MB)", "Unknown Unknown (101.3MB)", "Unknown Unknown (3.0MB)", "ATA ST2000LM007-1R81 (2.0TB)", "Unknown Unknown (0.0)", "Unknown Unknown (130.4MB)", "Unknown Unknown (10.6MB)", "Unknown Unknown (10.4MB)", "Unknown Unknown (4.2MB)" ],

I'm just wondering how useful that data is.

Would it make sense to expose some more information?

Would it make sense to expose this as objects?

"hdd_list": [ { "vendor": "Unknown", "model": "Unknown" "disk_size": ... } ]

this way the structure is extensible without intoducing potential backward incompatible changes

what on earth are those drives :)
Gather facts is supposed to be quick and easy - I'll patch to use a dict per device for extensibility and include a description field that shows what this current does. I don't want to include objects - my aim is keep it simple - gather-facts is supposed to be a quick scan of the machine - anything more complicated than that needs ceph-volume etc etc

fyi, the teuthology machines return something similar:

"hdd_list": [ "Unknown Unknown (0.0)", "Unknown Unknown (0.0)", "Unknown Unknown (0.0)", "Unknown Unknown (0.0)", "Unknown Unknown (0.0)", "Unknown Unknown (0.0)", "ATA ST1000NM0033-9ZM (1.0TB)", "Unknown Unknown (0.0)", "Unknown Unknown (0.0)" ]

http://qa-proxy.ceph.com/teuthology/swagner-2020-08-18_11:50:15-rados:cephadm-wip-swagner3-testing-2020-08-18-1033-distro-basic-smithi/5356265/teuthology.log

I don't want to include objects - my aim is keep it simple

This will make it hard for the dashboard to display vendor and model etc in a table or something. @ceph/dashboard folks: Are you ok with this?

If it's easy to achieve it would be useful to have that info.

@pcuzner Should some tests be added to protect these data structures from potential regressions?

gather-facts returns JSON - why would that make it hard for the dashboard to consume?

Latest patch breaks down the disk metadata, so it is a list of dicts - as follows;

"description": "Virtio Block Device Unknown (42.9GB)", "dev_name": "vda", "disk_size_bytes": 42949672960, "model": "Unknown", "rev": "Unknown", "vendor": "Virtio Block Device", "wwid": "Unknown" }

@alfonsomthd maybe - but it would be nice to get this in place first!

src/cephadm/cephadm

The gather-facts commands returns a list of hdd and flash devices. This patch changes the list content from being simple strings representing the disks to a dict, making it easier to extend in the future. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Since py2 is EOL, and cephadm requires py3 anyway this patch removes the py2 test iteration from the functional testing suite. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

sebastian-philipp · 2020-08-21T09:00:02Z

2020-08-21T08:50:25.059 INFO:tasks.workunit.client.0.smithi050.stderr:  File "/tmp/tmp.8nxD1U3xqe/cephadm", line 5140
2020-08-21T08:50:25.059 INFO:tasks.workunit.client.0.smithi050.stderr:    security = {**security, **summary} # type: ignore
2020-08-21T08:50:25.060 INFO:tasks.workunit.client.0.smithi050.stderr:                 ^
2020-08-21T08:50:25.060 INFO:tasks.workunit.client.0.smithi050.stderr:SyntaxError: invalid syntax

https://pulpito.ceph.com/swagner-2020-08-21_07:45:38-rados:cephadm-wip-swagner-testing-2020-08-20-1433-distro-basic-smithi/5365094/

See

https://github.com/ceph/ceph/blob/master/qa/workunits/cephadm/test_adoption.sh#L19-L48

sebastian-philipp

lgrm, minus the test failure in test_adoption.py

pcuzner · 2020-08-23T23:25:53Z

jenkins retest this please

Since cephadm is py3 based, and py2 is EOL this patch removes the py2 test iteration from test_adoption.sh Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

sebastian-philipp · 2020-08-25T16:30:56Z

pass: https://pulpito.ceph.com/swagner-2020-08-25_14:55:02-rados:cephadm-wip-swagner2-testing-2020-08-25-1503-distro-basic-smithi/

sebastian-philipp · 2020-08-25T16:33:16Z

jenkins test make check

pcuzner added the cephadm label Aug 7, 2020

pcuzner requested a review from sebastian-philipp August 7, 2020 04:31

pcuzner requested a review from a team as a code owner August 7, 2020 04:31

sebastian-philipp suggested changes Aug 7, 2020

View reviewed changes

src/cephadm/cephadm Outdated Show resolved Hide resolved

src/cephadm/cephadm Outdated Show resolved Hide resolved

src/cephadm/cephadm Show resolved Hide resolved

pcuzner requested a review from Daniel-Pivonka August 10, 2020 06:04

ricardoasmarques mentioned this pull request Aug 10, 2020

Net diagnose runner ceph/ceph-salt#314

Closed

Devp00l approved these changes Aug 10, 2020

View reviewed changes

cephadm: enhance memory and cpu data

fb72a14

Added free/available memory state, and cpu loadavg info Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

pcuzner requested a review from jmolmo August 11, 2020 03:25

LenzGr added dashboard needs-review labels Aug 11, 2020

cephadm: add gather-facts to the qa test

e24321b

Adds a gather-facts invocatio after check-host is run. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

sebastian-philipp approved these changes Aug 17, 2020

View reviewed changes

sebastian-philipp added the wip-swagner3-testing label Aug 17, 2020

sebastian-philipp suggested changes Aug 17, 2020

View reviewed changes

sebastian-philipp suggested changes Aug 18, 2020

View reviewed changes

src/cephadm/cephadm Show resolved Hide resolved

sebastian-philipp removed the wip-swagner3-testing label Aug 18, 2020

pcuzner added 2 commits August 20, 2020 11:12

cephadm: remove py2 from test stream

db5f758

Since py2 is EOL, and cephadm requires py3 anyway this patch removes the py2 test iteration from the functional testing suite. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

pcuzner mentioned this pull request Aug 20, 2020

cephadm: remove py2 from tox tests #36571

Merged

sebastian-philipp added the wip-swagner-testing My Teuthology tests label Aug 20, 2020

sebastian-philipp removed needs-review wip-swagner-testing My Teuthology tests labels Aug 21, 2020

sebastian-philipp approved these changes Aug 21, 2020

View reviewed changes

This was referenced Aug 21, 2020

cephadm: Allow users to use a custom dashboard ssl port #36716

Merged

mgr/cephadm: rgw update period after realm, zone creation #36496

Merged

cephadm: remove py2 from test_adoption

225cbd8

Since cephadm is py3 based, and py2 is EOL this patch removes the py2 test iteration from test_adoption.sh Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

sebastian-philipp added the wip-swagner2-testing label Aug 25, 2020

sebastian-philipp added ready-to-merge and removed wip-swagner2-testing labels Aug 25, 2020

sebastian-philipp merged commit f144e5e into ceph:master Aug 25, 2020

pcuzner deleted the cephadm-add-host-facts branch August 26, 2020 00:14

sebastian-philipp mentioned this pull request Sep 3, 2020

octopus: cephadm batch backport September (1) #36975

Merged

Conversation

pcuzner commented Aug 7, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mykaul commented Aug 10, 2020

Uh oh!

Devp00l left a comment

Choose a reason for hiding this comment

Uh oh!

Devp00l commented Aug 10, 2020

Uh oh!

pcuzner commented Aug 10, 2020

Uh oh!

pcuzner commented Aug 11, 2020

Uh oh!

sebastian-philipp commented Aug 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alfonsomthd Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebastian-philipp commented Aug 21, 2020

Uh oh!

sebastian-philipp left a comment

Choose a reason for hiding this comment

Uh oh!

pcuzner commented Aug 23, 2020

Uh oh!

sebastian-philipp commented Aug 25, 2020

Uh oh!

sebastian-philipp commented Aug 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alfonsomthd Aug 19, 2020 •

edited

Loading