cephadm: add gather_facts command#36509
cephadm: add gather_facts command#36509sebastian-philipp merged 7 commits intoceph:masterfrom pcuzner:cephadm-add-host-facts
Conversation
The gather_facts command is intended to provide host level metadata to the caller, which could then be used in a number of places: orchestrator inventory, displayed in the UI etc Data is extracted mainly from sysfs and /proc Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
| def _get_block_devs(self): | ||
| # type: () -> List[str] | ||
| return [dev for dev in os.listdir('/sys/block') | ||
| if not dev.startswith('dm')] | ||
|
|
||
| def _get_devs_by_type(self, rota='0'): | ||
| # type: (str) -> List[str] | ||
| devs = list() | ||
| for blk_dev in self._get_block_devs(): | ||
| rot_path = '/sys/block/{}/queue/rotational'.format(blk_dev) | ||
| rot_value = rfiles([rot_path]) | ||
| if rot_value == rota: | ||
| devs.append(blk_dev) | ||
| return devs |
There was a problem hiding this comment.
I see some overlap with ceph-volume inventory here
There was a problem hiding this comment.
Yep - I get a list of the devices but I don't inspect - and the goal I have is to be as quick as possible. On my physical test machines ceph-volume can take a couple of seconds, whereas what I've done takes way less.
There was a problem hiding this comment.
Hm. Hmm. But if you want to do anything with the devices (device ls, create OSDs, etc) you need the c-v inventory anyway. Why bother with this then here?
There was a problem hiding this comment.
I'm trying to just return the most complete picture of the host as quickly as possible - that's all. I'm thinking about things like; host metadata in the UI, or a "ceph orch host info" command - both give the admin what they need to understand the host config quickly and easily. Ceph-volume remains the ultimate source of truth for disks - in future we could think about running gather facts more often in order to auto-refresh ceph-volume inventory, when disk adds/removes have been detected.
| @property | ||
| def hdd_count(self): | ||
| # type: () -> int | ||
| return len(self._get_devs_by_type(rota='1')) |
There was a problem hiding this comment.
cephadm already has access to ceph-volume inventory. Why do you need this here as well?
There was a problem hiding this comment.
ceph-volume takes a few seconds to gather it's stuff - I'm trying to not have dependencies and gather the critical data as quick as possible
There was a problem hiding this comment.
Imo it's inadvisable to have two different sources for the same kind of data. If users rely on this source to base their assumptions on we're potentially running into problems.
There was a problem hiding this comment.
I think the intend here is not to duplicate c-v inventory, but something like fdisk -l to get an overview about the physical server. Especially also for servers that are not intended for OSD deployments, such as gateways or mons.
Minor changes to address issues raised in the PR; - formatting of ipv6 addresses - missing docstrings - NIC mtu and speed now int instead of string - added NIC driver name - removed discrete JSON method making gather-facts JSON only - added upper/lower device lists to show NIC relationships - added hostname to the JSON! - added selinux/apparmor status - added timestamp (epoch) for the gather-facts run - added system uptime (secs) Closes: Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
|
General question - why not just use the Prometheus node exporter which probably has all that we need already? |
As I understand it this command can be run before a cephadm cluster is bootstrapped, this way you can gather the information before having prometheus running. There is a lot of potential the gather-facts command inside the dashboard. |
Added free/available memory state, and cpu loadavg info Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@mykaul it is a good question. The goal of gather facts is to provide a holistic view of the host - this covers areas that node_exporter doesn't touch, since the focus of node_exporter is on metrics/performance not metadata. Also, remember that prometheus is an optional component to Ceph - recommended, but nevertheless optional. The data that gather_facts yields covers things like; the machine hw model, bios version, bios date, the disk types and capacities, nic speeds, os version, subscription state, kernel version, linux kernel security module state as well as NIC parent/child relationships - all of this is metadata, not metrics. Having metadata like this can underpin pre-flight checks, scheduling actions etc. It could also enhance the UI to provide more information to the Admin about the cluster, and reduce their need for dropping to the CLI or using other tooolchains. |
|
@ceph/dashboard ^^ |
Adds a gather-facts invocatio after check-host is run. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
| def get_ipv6_address(ifname): | ||
| # type: (str) -> str | ||
| if not os.path.exists('/proc/net/if_inet6'): | ||
| return '' | ||
|
|
||
| raw = read_file(['/proc/net/if_inet6']) | ||
| data = raw.splitlines() | ||
| # based on docs @ https://www.tldp.org/HOWTO/Linux+IPv6-HOWTO/ch11s04.html | ||
| # field 0 is ipv6, field 2 is scope | ||
| for iface_setting in data: | ||
| field = iface_setting.split() | ||
| if field[-1] == ifname: | ||
| ipv6_raw = field[0] | ||
| ipv6_fmtd = ":".join([ipv6_raw[_p:_p+4] for _p in range(0, len(field[0]),4)]) | ||
| # apply naming rules using ipaddress module | ||
| ipv6 = ipaddress.ip_address(ipv6_fmtd) | ||
| return "{}/{}".format(str(ipv6), int('0x{}'.format(field[2]), 16)) | ||
| return '' |
There was a problem hiding this comment.
src/cephadm/cephadm
Outdated
| vendor = read_file(['/sys/block/{}/device/vendor'.format(dev)]).strip() | ||
| disk_vendor = HostFacts._disk_vendor_workarounds.get(vendor, vendor) | ||
| disk_size_bytes = self._get_capacity(dev) | ||
| disk_list.append("{} {} ({})".format(disk_vendor, disk_model, bytes_to_human(disk_size_bytes))) |
There was a problem hiding this comment.
I'm getting
"hdd_list": [
"Unknown Unknown (4.2MB)",
"Unknown Unknown (101.7MB)",
"Unknown Unknown (58.0MB)",
"Unknown Unknown (10.6MB)",
"Unknown Unknown (130.9MB)",
"Unknown Unknown (130.0MB)",
"Unknown Unknown (57.6MB)",
"Unknown Unknown (10.3MB)",
"Unknown Unknown (101.3MB)",
"Unknown Unknown (3.0MB)",
"ATA ST2000LM007-1R81 (2.0TB)",
"Unknown Unknown (0.0)",
"Unknown Unknown (130.4MB)",
"Unknown Unknown (10.6MB)",
"Unknown Unknown (10.4MB)",
"Unknown Unknown (4.2MB)"
],
I'm just wondering how useful that data is.
- Would it make sense to expose some more information?
- Would it make sense to expose this as objects?
"hdd_list": [
{
"vendor": "Unknown",
"model": "Unknown"
"disk_size": ...
}
]this way the structure is extensible without intoducing potential backward incompatible changes
There was a problem hiding this comment.
what on earth are those drives :)
Gather facts is supposed to be quick and easy - I'll patch to use a dict per device for extensibility and include a description field that shows what this current does. I don't want to include objects - my aim is keep it simple - gather-facts is supposed to be a quick scan of the machine - anything more complicated than that needs ceph-volume etc etc
There was a problem hiding this comment.
fyi, the teuthology machines return something similar:
"hdd_list": [
"Unknown Unknown (0.0)",
"Unknown Unknown (0.0)",
"Unknown Unknown (0.0)",
"Unknown Unknown (0.0)",
"Unknown Unknown (0.0)",
"Unknown Unknown (0.0)",
"ATA ST1000NM0033-9ZM (1.0TB)",
"Unknown Unknown (0.0)",
"Unknown Unknown (0.0)"
]There was a problem hiding this comment.
I don't want to include objects - my aim is keep it simple
This will make it hard for the dashboard to display vendor and model etc in a table or something. @ceph/dashboard folks: Are you ok with this?
There was a problem hiding this comment.
If it's easy to achieve it would be useful to have that info.
@pcuzner Should some tests be added to protect these data structures from potential regressions?
There was a problem hiding this comment.
gather-facts returns JSON - why would that make it hard for the dashboard to consume?
Latest patch breaks down the disk metadata, so it is a list of dicts - as follows;
"description": "Virtio Block Device Unknown (42.9GB)",
"dev_name": "vda",
"disk_size_bytes": 42949672960,
"model": "Unknown",
"rev": "Unknown",
"vendor": "Virtio Block Device",
"wwid": "Unknown"
}
@alfonsomthd maybe - but it would be nice to get this in place first!
The gather-facts commands returns a list of hdd and flash devices. This patch changes the list content from being simple strings representing the disks to a dict, making it easier to extend in the future. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Since py2 is EOL, and cephadm requires py3 anyway this patch removes the py2 test iteration from the functional testing suite. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
See https://github.com/ceph/ceph/blob/master/qa/workunits/cephadm/test_adoption.sh#L19-L48 |
sebastian-philipp
left a comment
There was a problem hiding this comment.
lgrm, minus the test failure in test_adoption.py
|
jenkins retest this please |
Since cephadm is py3 based, and py2 is EOL this patch removes the py2 test iteration from test_adoption.sh Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
|
jenkins test make check |

The gather_facts command is intended to provide host level metadata to the caller, which could then be used
in a number of places: orchestrator inventory, displayed in the UI etc
Data is extracted mainly from sysfs and /proc
Signed-off-by: Paul Cuzner pcuzner@redhat.com