python-common: drive_selection: fix KeyError when osdspec_affinity is not set by guits · Pull Request #52532 · ceph/ceph

guits · 2023-07-19T09:32:42Z

When osdspec_affinity is not set, the drive selection code will fail. This can happen when a device has multiple LVs where some of are used by Ceph and at least one LV isn't used by Ceph.

Fixes: https://tracker.ceph.com/issues/58946

guits · 2023-07-19T09:45:24Z

2023-07-19T09:44:46.642447+0000 mgr.debug-teutho-1.nxhjec [DBG] Found inventory for host [Device(path=/dev/vdb, lvs=[{'comment': 'not used by ceph', 'name': 'lv1'}, {'block_uuid': 'zWUoQX-Rk9d-OUf9-b01b-6U1l-KPrp-22L0i5', 'cluster_fsid': 'eb283af2-25c2-11ee-874d-5254008fbc85', 'cluster_name': 'ceph', 'name': 'lv2', 'osd_fsid': '2153c6ac-03b2-4b18-8989-b4c1adeda11b', 'osd_id': '0', 'osdspec_affinity': 'None', 'type': 'block'}], available=False, ceph_device=True, crush_device_class=None, rejection reasons=['LVM detected', 'locked'])]
2023-07-19T09:44:46.642515+0000 mgr.debug-teutho-1.nxhjec [DBG] Processing disk /dev/vdb
2023-07-19T09:44:46.642546+0000 mgr.debug-teutho-1.nxhjec [DBG] /dev/vdb is already used in spec None, skipping it.
2023-07-19T09:44:46.642582+0000 mgr.debug-teutho-1.nxhjec [DBG] device_filter is None
2023-07-19T09:44:46.642608+0000 mgr.debug-teutho-1.nxhjec [DBG] device_filter is None
2023-07-19T09:44:46.642643+0000 mgr.debug-teutho-1.nxhjec [DBG] device_filter is None
2023-07-19T09:44:46.642682+0000 mgr.debug-teutho-1.nxhjec [DBG] Found drive selection DeviceSelection(data devices=[], wal_devices=[], db devices=[], journal devices=[])
2023-07-19T09:44:46.643481+0000 mgr.debug-teutho-1.nxhjec [DBG] Translating DriveGroup <DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_id: dashboard-admin-1678346078356
service_name: osd.dashboard-admin-1678346078356
placement:
  host_pattern: '*'
spec:
  data_devices:
    rotational: true
  filter_logic: AND
  objectstore: bluestore
'''))> to ceph-volume command
2023-07-19T09:44:46.643616+0000 mgr.debug-teutho-1.nxhjec [DBG] Resulting ceph-volume cmds: []
2023-07-19T09:44:46.643656+0000 mgr.debug-teutho-1.nxhjec [DBG] No data_devices, skipping DriveGroup: dashboard-admin-1678346078356

ljflores · 2023-07-19T14:20:05Z

jenkins test make check

guits · 2023-07-24T08:04:54Z

jenkins test make check

guits · 2023-07-24T08:47:32Z

http://pulpito.ceph.com/gabrioux-2023-07-24_08:35:19-rados:dashboard-wip-guits-testing-3-2023-07-19-0949-distro-default-smithi/

guits · 2023-07-24T09:24:02Z

@ljflores jobs are failing with the following error

    Command failed (workunit test cephadm/test_dashboard_e2e.sh) on smithi130
    with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd
    -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1
    CEPH_REF=a13bde4b6f056ba4773932cfd784d11bffd1524f
    TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0"
    PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0
    CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
    CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage
    /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clo
    ne.client.0/qa/workunits/cephadm/test_dashboard_e2e.sh'

ljflores · 2023-07-25T15:41:07Z

@ljflores jobs are failing with the following error

    Command failed (workunit test cephadm/test_dashboard_e2e.sh) on smithi130
    with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd
    -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1
    CEPH_REF=a13bde4b6f056ba4773932cfd784d11bffd1524f
    TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0"
    PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0
    CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
    CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage
    /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clo
    ne.client.0/qa/workunits/cephadm/test_dashboard_e2e.sh'

@guits that looks like https://tracker.ceph.com/issues/59142, which was merged to main. Is your branch on the latest commit?

… not set When osdspec_affinity is not set, the drive selection code will fail. This can happen when a device has multiple LVs where some of are used by Ceph and at least one LV isn't used by Ceph. Fixes: https://tracker.ceph.com/issues/58946 Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>

guits · 2023-07-25T15:47:25Z

@guits that looks like https://tracker.ceph.com/issues/59142, which was merged to main. Is your branch on the latest commit?

just rebased, thanks 🙂

guits · 2023-07-26T08:44:56Z

@ljflores the fix for https://tracker.ceph.com/issues/59142 was merged 3 months ago, my branch was created only a few days ago so that fix was already present in my branch. That being said, it is still failing

guits · 2023-08-11T07:32:08Z

https://pulpito.ceph.com/gabrioux-2023-08-10_21:20:10-orch:cephadm-wip-guits-testing-5-2023-08-10-1324-distro-default-smithi/

ljflores · 2023-08-14T21:41:16Z

@ljflores the fix for https://tracker.ceph.com/issues/59142 was merged 3 months ago, my branch was created only a few days ago so that fix was already present in my branch. That being said, it is still failing

@guits can you link to the still-failing tests? All I can see are the ones that were scheduled on the non-rebased branch.

I see this link https://pulpito.ceph.com/gabrioux-2023-08-10_21:20:10-orch:cephadm-wip-guits-testing-5-2023-08-10-1324-distro-default-smithi/ as well, but I don't see the dashboard test that runs into this failure

ljflores · 2023-08-16T18:42:32Z

I scheduled some fresh runs here:
http://pulpito.front.sepia.ceph.com/lflores-2023-08-16_18:38:27-rados-wip-lflores-testing-2023-08-14-2134-distro-default-smithi/

https://shaman.ceph.com/builds/ceph/wip-lflores-testing-2023-08-14-2134/018ad05600252b12c9c91ac9ba6e225ca1957e3c/

This branch is based on the tip of main.

Here's also the same tests run on plain main for comparison:
http://pulpito.front.sepia.ceph.com/lflores-2023-08-16_21:53:52-rados-main-distro-default-smithi/

ljflores · 2023-08-17T21:26:47Z

Okay, I studied the two runs, and they are both exhibiting a new, unrelated bug which I tracked here: https://tracker.ceph.com/issues/62491

Since the failure occurs in both before we can get to the spec file that concerns this bug, it makes it difficult to evaluate this fix.

@ceph/dashboard can you help us out here? TL;DR is that @guits has a fix for one of the spec files, but a new bug tracked above fails the test before we can get to the original point of failure. Is there a way we can isolate this spec file locally to verify the fix?

avanthakkar · 2023-08-23T09:46:19Z

jenkins test dashboard

avanthakkar · 2023-08-23T09:47:55Z

@guits Do you mind rebasing the PR and push it, so it triggers the dashboard e2e jenkins job (jenkins test dashboard)

ljflores · 2023-08-23T19:51:22Z

jenkins test dashboard cephadm

ljflores · 2023-08-23T19:52:01Z

@guits Do you mind rebasing the PR and push it, so it triggers the dashboard e2e jenkins job (jenkins test dashboard)

@avanthakkar I got it to retrigger with jenkins test dashboard cephadm

avanthakkar · 2023-08-24T07:05:14Z

@guits Do you mind rebasing the PR and push it, so it triggers the dashboard e2e jenkins job (jenkins test dashboard)

@avanthakkar I got it to retrigger with jenkins test dashboard cephadm

@ljflores Those are different set of e2es(which are cephadm based). We also need to make sure if dashboard e2e are passing jenkins test dashboard.

avanthakkar · 2023-08-24T07:06:05Z

jenkins test dashboard cephadm

ljflores · 2023-08-25T14:39:28Z

Rebuilding here on the tip of main, which includes #53141.

https://shaman.ceph.com/builds/ceph/wip-lflores-testing-2-2023-08-25-1435/abb43274df0bbfcee9f7033ec961c90b171961b7/

Running some tests here: http://pulpito.front.sepia.ceph.com/lflores-2023-08-25_16:17:14-rados-wip-lflores-testing-2-2023-08-25-1435-distro-default-smithi/

The tests failed again due to another unrelated dashboard failure, but this time the affected spec file has progressed past host assignment, where it failed previously.

With the fix:

2023-08-25T17:02:07.686 INFO:tasks.workunit.client.0.smithi102.stdout:  Running:  04-osds.e2e-spec.ts                                                             (1 of 1)
2023-08-25T17:02:08.654 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:08 smithi102 ceph-mon[111467]: pgmap v300: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:10.644 INFO:tasks.workunit.client.0.smithi102.stderr:Couldn't determine Mocha version
2023-08-25T17:02:10.648 INFO:tasks.workunit.client.0.smithi102.stdout:
2023-08-25T17:02:10.652 INFO:tasks.workunit.client.0.smithi102.stdout:
2023-08-25T17:02:10.654 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:10 smithi102 ceph-mon[111467]: pgmap v301: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:10.681 INFO:tasks.workunit.client.0.smithi102.stdout:  OSDs page
2023-08-25T17:02:10.706 INFO:tasks.workunit.client.0.smithi102.stdout:    when Orchestrator is available
2023-08-25T17:02:12.404 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:12 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:12.405 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:12 smithi102 ceph-mon[111467]: pgmap v302: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:13.260 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2023-08-25T17:02:13.260 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2023-08-25T17:02:13.260 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:13.261 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:13.261 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:13.261 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:13 smithi102 ceph-mon[111467]: from='mgr.14152 172.21.15.102:0/843330708' entity='mgr.a'
2023-08-25T17:02:14.154 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:14 smithi102 ceph-mon[111467]: pgmap v303: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-08-25T17:02:15.404 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:15 smithi102 ceph-mon[111467]: Marking host: smithi102 for OSDSpec preview refresh.
2023-08-25T17:02:15.405 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:15 smithi102 ceph-mon[111467]: Marking host: smithi190 for OSDSpec preview refresh.
2023-08-25T17:02:15.405 INFO:journalctl@ceph.mon.a.smithi102.stdout:Aug 25 17:02:15 smithi102 ceph-mon[111467]: Saving service osd.dashboard-admin-1692982933825 spec with placement *

W/o the fix:

2023-07-18T14:54:37.744 INFO:tasks.workunit.client.0.smithi026.stdout:  Running:  04-osds.e2e-spec.ts                                                             (1 of 1)
2023-07-18T14:54:39.649 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:39 smithi026 ceph-mon[108533]: pgmap v557: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:41.040 INFO:tasks.workunit.client.0.smithi026.stderr:Couldn't determine Mocha version
2023-07-18T14:54:41.044 INFO:tasks.workunit.client.0.smithi026.stdout:
2023-07-18T14:54:41.050 INFO:tasks.workunit.client.0.smithi026.stdout:
2023-07-18T14:54:41.086 INFO:tasks.workunit.client.0.smithi026.stdout:  OSDs page
2023-07-18T14:54:41.106 INFO:tasks.workunit.client.0.smithi026.stdout:    when Orchestrator is available
2023-07-18T14:54:41.649 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:41 smithi026 ceph-mon[108533]: pgmap v558: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:42.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:42 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:42.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:42 smithi026 ceph-mon[108533]: pgmap v559: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:42.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:42 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2023-07-18T14:54:43.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:43 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2023-07-18T14:54:43.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:43 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:43.899 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:43 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.771 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: pgmap v560: 1 pgs: 1 active+clean; 577 KiB data, 80 MiB used, 268 GiB / 268 GiB avail
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a'
2023-07-18T14:54:44.772 INFO:journalctl@ceph.mon.a.smithi026.stdout:Jul 18 14:54:44 smithi026 ceph-mon[108533]: from='mgr.14152 172.21.15.26:0/1231066348' entity='mgr.a' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2023-07-18T14:54:46.003 INFO:journalctl@ceph.mgr.a.smithi026.stdout:Jul 18 14:54:45 smithi026 ceph-47d85cfa-2578-11ee-9b34-001a4aab830c-mgr-a[108759]: 2023-07-18T14:54:45.889+0000 7ff01c2f6700 -1 log_channel(cephadm) log [ERR] : Failed to apply osd.dashboard-admin-1689692084361 spec DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd

phlogistonjohn

Looks OK to me.

guits added ceph-volume orchestrator labels Jul 19, 2023

guits requested a review from a team as a code owner July 19, 2023 09:32

github-actions bot added cephadm pybind labels Jul 19, 2023

guits added the wip-guits-testing-3 label Jul 19, 2023

ljflores approved these changes Jul 19, 2023

View reviewed changes

guits force-pushed the fix-tracker-58946 branch from f6aa5f8 to e4a50d7 Compare July 25, 2023 15:45

guits force-pushed the fix-tracker-58946 branch from e4a50d7 to 908f1d1 Compare July 25, 2023 15:46

guits added wip-guits-testing-5 and removed wip-guits-testing-3 labels Aug 10, 2023

ljflores added wip-lflores-testing and removed wip-lflores-testing labels Aug 14, 2023

ljflores requested review from a team, aaSharma14 and pereman2 and removed request for a team August 17, 2023 21:26

adk3798 approved these changes Aug 21, 2023

View reviewed changes

aaSharma14 added the dashboard label Aug 23, 2023

Pegonzal mentioned this pull request Aug 24, 2023

mgr/dashboard: remove unnecessary failing hosts e2e #53141

Merged

14 tasks

ljflores added the wip-lflores-testing-2 label Aug 25, 2023

phlogistonjohn approved these changes Aug 25, 2023

View reviewed changes

ljflores merged commit d50a53f into ceph:main Aug 25, 2023

guits deleted the fix-tracker-58946 branch September 1, 2023 17:48

guits removed the wip-guits-testing-5 label Mar 6, 2024

Conversation

guits commented Jul 19, 2023

Uh oh!

guits commented Jul 19, 2023

Uh oh!

ljflores commented Jul 19, 2023

Uh oh!

guits commented Jul 24, 2023

Uh oh!

guits commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guits commented Jul 24, 2023

Uh oh!

ljflores commented Jul 25, 2023

Uh oh!

guits commented Jul 25, 2023

Uh oh!

guits commented Jul 26, 2023

Uh oh!

guits commented Aug 11, 2023

Uh oh!

ljflores commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Aug 17, 2023

Uh oh!

avanthakkar commented Aug 23, 2023

Uh oh!

avanthakkar commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Aug 23, 2023

Uh oh!

ljflores commented Aug 23, 2023

Uh oh!

avanthakkar commented Aug 24, 2023

Uh oh!

avanthakkar commented Aug 24, 2023

Uh oh!

ljflores commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phlogistonjohn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

guits commented Jul 24, 2023 •

edited

Loading

ljflores commented Aug 14, 2023 •

edited

Loading

ljflores commented Aug 16, 2023 •

edited

Loading

avanthakkar commented Aug 23, 2023 •

edited

Loading

ljflores commented Aug 25, 2023 •

edited

Loading