Skip to content

python-common: drive_selection: fix KeyError when osdspec_affinity is not set#52532

Merged
ljflores merged 1 commit intoceph:mainfrom
guits:fix-tracker-58946
Aug 25, 2023
Merged

python-common: drive_selection: fix KeyError when osdspec_affinity is not set#52532
ljflores merged 1 commit intoceph:mainfrom
guits:fix-tracker-58946

Conversation

@guits
Copy link
Contributor

@guits guits commented Jul 19, 2023

When osdspec_affinity is not set, the drive selection code will fail. This can happen when a device has multiple LVs where some of are used by Ceph and at least one LV isn't used by Ceph.

Fixes: https://tracker.ceph.com/issues/58946

@guits
Copy link
Contributor Author

guits commented Jul 19, 2023

2023-07-19T09:44:46.642447+0000 mgr.debug-teutho-1.nxhjec [DBG] Found inventory for host [Device(path=/dev/vdb, lvs=[{'comment': 'not used by ceph', 'name': 'lv1'}, {'block_uuid': 'zWUoQX-Rk9d-OUf9-b01b-6U1l-KPrp-22L0i5', 'cluster_fsid': 'eb283af2-25c2-11ee-874d-5254008fbc85', 'cluster_name': 'ceph', 'name': 'lv2', 'osd_fsid': '2153c6ac-03b2-4b18-8989-b4c1adeda11b', 'osd_id': '0', 'osdspec_affinity': 'None', 'type': 'block'}], available=False, ceph_device=True, crush_device_class=None, rejection reasons=['LVM detected', 'locked'])]
2023-07-19T09:44:46.642515+0000 mgr.debug-teutho-1.nxhjec [DBG] Processing disk /dev/vdb
2023-07-19T09:44:46.642546+0000 mgr.debug-teutho-1.nxhjec [DBG] /dev/vdb is already used in spec None, skipping it.
2023-07-19T09:44:46.642582+0000 mgr.debug-teutho-1.nxhjec [DBG] device_filter is None
2023-07-19T09:44:46.642608+0000 mgr.debug-teutho-1.nxhjec [DBG] device_filter is None
2023-07-19T09:44:46.642643+0000 mgr.debug-teutho-1.nxhjec [DBG] device_filter is None
2023-07-19T09:44:46.642682+0000 mgr.debug-teutho-1.nxhjec [DBG] Found drive selection DeviceSelection(data devices=[], wal_devices=[], db devices=[], journal devices=[])
2023-07-19T09:44:46.643481+0000 mgr.debug-teutho-1.nxhjec [DBG] Translating DriveGroup <DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_id: dashboard-admin-1678346078356
service_name: osd.dashboard-admin-1678346078356
placement:
  host_pattern: '*'
spec:
  data_devices:
    rotational: true
  filter_logic: AND
  objectstore: bluestore
'''))> to ceph-volume command
2023-07-19T09:44:46.643616+0000 mgr.debug-teutho-1.nxhjec [DBG] Resulting ceph-volume cmds: []
2023-07-19T09:44:46.643656+0000 mgr.debug-teutho-1.nxhjec [DBG] No data_devices, skipping DriveGroup: dashboard-admin-1678346078356

@ljflores
Copy link
Member

jenkins test make check

1 similar comment
@guits
Copy link
Contributor Author

guits commented Jul 24, 2023

jenkins test make check

@guits
Copy link
Contributor Author

guits commented Jul 24, 2023

@guits
Copy link
Contributor Author

guits commented Jul 24, 2023

@ljflores jobs are failing with the following error

    Command failed (workunit test cephadm/test_dashboard_e2e.sh) on smithi130
    with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd
    -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1
    CEPH_REF=a13bde4b6f056ba4773932cfd784d11bffd1524f
    TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0"
    PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0
    CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
    CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage
    /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clo
    ne.client.0/qa/workunits/cephadm/test_dashboard_e2e.sh'

@ljflores
Copy link
Member

@ljflores jobs are failing with the following error

    Command failed (workunit test cephadm/test_dashboard_e2e.sh) on smithi130
    with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd
    -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1
    CEPH_REF=a13bde4b6f056ba4773932cfd784d11bffd1524f
    TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0"
    PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0
    CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
    CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage
    /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clo
    ne.client.0/qa/workunits/cephadm/test_dashboard_e2e.sh'

@guits that looks like https://tracker.ceph.com/issues/59142, which was merged to main. Is your branch on the latest commit?

@guits guits force-pushed the fix-tracker-58946 branch from f6aa5f8 to e4a50d7 Compare July 25, 2023 15:45
… not set

When osdspec_affinity is not set, the drive selection code will fail.
This can happen when a device has multiple LVs where some of are used
by Ceph and at least one LV isn't used by Ceph.

Fixes: https://tracker.ceph.com/issues/58946

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
@guits guits force-pushed the fix-tracker-58946 branch from e4a50d7 to 908f1d1 Compare July 25, 2023 15:46
@guits
Copy link
Contributor Author

guits commented Jul 25, 2023

@guits that looks like https://tracker.ceph.com/issues/59142, which was merged to main. Is your branch on the latest commit?

just rebased, thanks 🙂

@guits
Copy link
Contributor Author

guits commented Jul 26, 2023

@ljflores the fix for https://tracker.ceph.com/issues/59142 was merged 3 months ago, my branch was created only a few days ago so that fix was already present in my branch. That being said, it is still failing

@guits
Copy link
Contributor Author

guits commented Aug 11, 2023

@ljflores
Copy link
Member

ljflores commented Aug 14, 2023

@ljflores the fix for https://tracker.ceph.com/issues/59142 was merged 3 months ago, my branch was created only a few days ago so that fix was already present in my branch. That being said, it is still failing

@guits can you link to the still-failing tests? All I can see are the ones that were scheduled on the non-rebased branch.

I see this link https://pulpito.ceph.com/gabrioux-2023-08-10_21:20:10-orch:cephadm-wip-guits-testing-5-2023-08-10-1324-distro-default-smithi/ as well, but I don't see the dashboard test that runs into this failure

@ljflores
Copy link
Member

ljflores commented Aug 16, 2023

@ljflores
Copy link
Member

Okay, I studied the two runs, and they are both exhibiting a new, unrelated bug which I tracked here: https://tracker.ceph.com/issues/62491

Since the failure occurs in both before we can get to the spec file that concerns this bug, it makes it difficult to evaluate this fix.

@ceph/dashboard can you help us out here? TL;DR is that @guits has a fix for one of the spec files, but a new bug tracked above fails the test before we can get to the original point of failure. Is there a way we can isolate this spec file locally to verify the fix?

@ljflores ljflores requested review from a team, aaSharma14 and pereman2 and removed request for a team August 17, 2023 21:26
@avanthakkar
Copy link
Contributor

jenkins test dashboard

@avanthakkar
Copy link
Contributor

avanthakkar commented Aug 23, 2023

@guits Do you mind rebasing the PR and push it, so it triggers the dashboard e2e jenkins job (jenkins test dashboard)

@ljflores
Copy link
Member

jenkins test dashboard cephadm

@ljflores
Copy link
Member

@guits Do you mind rebasing the PR and push it, so it triggers the dashboard e2e jenkins job (jenkins test dashboard)

@avanthakkar I got it to retrigger with jenkins test dashboard cephadm

@avanthakkar
Copy link
Contributor

@guits Do you mind rebasing the PR and push it, so it triggers the dashboard e2e jenkins job (jenkins test dashboard)

@avanthakkar I got it to retrigger with jenkins test dashboard cephadm

@ljflores Those are different set of e2es(which are cephadm based). We also need to make sure if dashboard e2e are passing jenkins test dashboard.

@avanthakkar
Copy link
Contributor

jenkins test dashboard cephadm

@ljflores
Copy link
Member

ljflores commented Aug 25, 2023

Rebuilding here on the tip of main, which includes #53141.

https://shaman.ceph.com/builds/ceph/wip-lflores-testing-2-2023-08-25-1435/abb43274df0bbfcee9f7033ec961c90b171961b7/

Running some tests here: http://pulpito.front.sepia.ceph.com/lflores-2023-08-25_16:17:14-rados-wip-lflores-testing-2-2023-08-25-1435-distro-default-smithi/

The tests failed again due to another unrelated dashboard failure, but this time the affected spec file has progressed past host assignment, where it failed previously.

With the fix:

2023-08-25T17:02:07.686 INFO:tasks.workunit.client.0.smithi102.stdout:  Running:  04-osds.e2e-spec.ts                                                             (1 of 1)
2023-08-25T17:02:08.654 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:08 smithi102 ceph-mon[111467]: pgmap v300: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:10.644 INFO:tasks.workunit.client.0.smithi102.stderr:Couldn't determine Mocha version
2023-08-25T17:02:10.648 INFO:tasks.workunit.client.0.smithi102.stdout:
2023-08-25T17:02:10.652 INFO:tasks.workunit.client.0.smithi102.stdout:
2023-08-25T17:02:10.654 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:10 smithi102 ceph-mon[111467]: pgmap v301: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:10.681 INFO:tasks.workunit.client.0.smithi102.stdout:  OSDs page
2023-08-25T17:02:10.706 INFO:tasks.workunit.client.0.smithi102.stdout:    when Orchestrator is available
2023-08-25T17:02:12.404 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:12 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:12.405 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:12 smithi102 ceph-mon[111467]: pgmap v302: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:13.260 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2023-08-25T17:02:13.260 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2023-08-25T17:02:13.260 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:13.261 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:13.261 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:13.261 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:14.154 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:14 smithi102 ceph-mon[111467]: pgmap v303: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:15.404 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:15 smithi102 ceph-mon[111467]: Marking host: smithi102 for OSDSpec preview refresh.
2023-08-25T17:02:15.405 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:15 smithi102 ceph-mon[111467]: Marking host: smithi190 for OSDSpec preview refresh.
2023-08-25T17:02:15.405 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:15 smithi102 ceph-mon[111467]: Saving service osd.dashboard-admin-1692982933825 spec with placement *

W/o the fix:

2023-07-18T14:54:37.744 INFO:tasks.workunit.client.0.smithi026.stdout:  Running:  04-osds.e2e-spec.ts                                                             (1 of 1)
2023-07-18T14:54:39.649 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:39 smithi026 ceph-mon[108533]: pgmap v557: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:41.040 INFO:tasks.workunit.client.0.smithi026.stderr:Couldn't determine Mocha version
2023-07-18T14:54:41.044 INFO:tasks.workunit.client.0.smithi026.stdout:
2023-07-18T14:54:41.050 INFO:tasks.workunit.client.0.smithi026.stdout:
2023-07-18T14:54:41.086 INFO:tasks.workunit.client.0.smithi026.stdout:  OSDs page
2023-07-18T14:54:41.106 INFO:tasks.workunit.client.0.smithi026.stdout:    when Orchestrator is available
2023-07-18T14:54:41.649 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:41 smithi026 ceph-mon[108533]: pgmap v558: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:42.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:42 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:42.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:42 smithi026 ceph-mon[108533]: pgmap v559: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:42.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:42 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2023-07-18T14:54:43.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:43 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2023-07-18T14:54:43.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:43 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:43.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:43 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.771 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: pgmap v560: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2023-07-18T14:54:46.003 INFO:journalctl@ceph.mgr.a.smithi026.stdout:Jul 18 14:54:45 smithi026 ceph-47d85cfa-2578-11ee-9b34-001a4aab830c-mgr-a[108759]: 2023-07-18T14:54:45.889+0000 7ff01c2f6700 -1 log_channel(cephadm) log [ERR] : Failed to apply osd.dashboard-admin-1689692084361 spec DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd

Copy link
Contributor

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants