Skip to content

quincy: qa/cephfs: no reliance on centos#59037

Merged
vshankar merged 2 commits intoceph:quincyfrom
vshankar:wip-quincy-rm-centos8
Oct 16, 2024
Merged

quincy: qa/cephfs: no reliance on centos#59037
vshankar merged 2 commits intoceph:quincyfrom
vshankar:wip-quincy-rm-centos8

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Aug 6, 2024

Although this is off-putting, this is probably the way forward to get tests running (esp. fs:upgrade) without relying on centos8.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@vshankar vshankar added the cephfs Ceph File System label Aug 6, 2024
@vshankar vshankar requested a review from a team August 6, 2024 05:12
@github-actions github-actions bot added the tests label Aug 6, 2024
@github-actions github-actions bot added this to the quincy milestone Aug 6, 2024
@vshankar vshankar marked this pull request as ready for review August 6, 2024 05:13
@vshankar vshankar changed the title quincy: qa/cephfs: no reliance on centos [RFC] quincy: qa/cephfs: no reliance on centos Aug 6, 2024
@vshankar vshankar changed the title [RFC] quincy: qa/cephfs: no reliance on centos quincy: qa/cephfs: no reliance on centos Aug 9, 2024
@vshankar
Copy link
Contributor Author

vshankar commented Aug 9, 2024

cc @lxbsz @joscollin

@lxbsz
Copy link
Member

lxbsz commented Aug 12, 2024

@vshankar There still have two jobs will fail, and you missed:

qa/suites/fs/mixed-clients/kclient-overrides/distro/stock/rhel_8.yaml

If I filter these two tests with --filter-out kernel_cfuse_workunits_dbench_iozone,kernel_cfuse_workunits_untarbuild_blogbench there will be 19 jobs filtered out:

2024-08-12 00:54:28,313.313 INFO:teuthology.suite.run:19/204 jobs were filtered out.
2024-08-12 00:54:28,313.313 INFO:teuthology.suite.run:Scheduled 185 jobs in total.

@vshankar
Copy link
Contributor Author

vshankar commented Aug 12, 2024

@vshankar There still have two jobs will fail, and you missed:

qa/suites/fs/mixed-clients/kclient-overrides/distro/stock/rhel_8.yaml

If I filter these two tests with --filter-out kernel_cfuse_workunits_dbench_iozone,kernel_cfuse_workunits_untarbuild_blogbench there will be 19 jobs filtered out:

2024-08-12 00:54:28,313.313 INFO:teuthology.suite.run:19/204 jobs were filtered out.
2024-08-12 00:54:28,313.313 INFO:teuthology.suite.run:Scheduled 185 jobs in total.

I'll fix that up quincy. I likely just checked centos yamls and missed out rhel8.

@lxbsz
Copy link
Member

lxbsz commented Aug 12, 2024

@vshankar There still have two jobs will fail, and you missed:
qa/suites/fs/mixed-clients/kclient-overrides/distro/stock/rhel_8.yaml
If I filter these two tests with --filter-out kernel_cfuse_workunits_dbench_iozone,kernel_cfuse_workunits_untarbuild_blogbench there will be 19 jobs are filtered out:

2024-08-12 00:54:28,313.313 INFO:teuthology.suite.run:19/204 jobs were filtered out.
2024-08-12 00:54:28,313.313 INFO:teuthology.suite.run:Scheduled 185 jobs in total.

I'll fix that up quincy. I likely just checked centos yamls and missed out rhel8.

Currently I just triggered to schedule 185/204 jobs, and will trigger the rest ones after this being fixed.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
And switch to ubuntu.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar vshankar force-pushed the wip-quincy-rm-centos8 branch from b0ca4b3 to e954116 Compare August 12, 2024 05:01
@vshankar
Copy link
Contributor Author

@lxbsz Try to schedule with this update. Kept the fix in a separate commit for now.

@lxbsz
Copy link
Member

lxbsz commented Aug 12, 2024

@lxbsz Try to schedule with this update. Kept the fix in a separate commit for now.

Trying it now. Thanks

Copy link
Contributor

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vshankar. I think following failures are related to this PR -

    2024-08-13T05:20:16.778 DEBUG:teuthology.orchestra.run.smithi028:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.821 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.969 INFO:teuthology.orchestra.run.smithi028.stderr:mount error 22 = Invalid argument
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: resolved to: "172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789"
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: trying mount with old device syntax: 172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789:/
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync,key=0,fsid=cc3ed6a8-5930-11ef-bcce-c7b262605968" will pass to kernel
    2024-08-13T05:20:16.972 DEBUG:teuthology.orchestra.run:got remote process result: 32
    2024-08-13T05:20:16.973 INFO:tasks.cephfs.kernel_mount:mount command failed
    2024-08-13T05:20:16.974 ERROR:teuthology.run_tasks:Saw exception from tasks.
2024-08-13T05:29:58.237 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:ERROR: test_cap_acquisition_throttle_readdir (tasks.cephfs.test_client_limits.TestClientLimits)
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:Mostly readdir acquires caps faster than the mds recalls, so the cap
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_client_limits.py", line 189, in test_cap_acquisition_throttle_readdir
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:    cap_acquisition_value = self.get_session(mount_a_client_id)['cap_acquisition']['value']
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/cephfs_test_case.py", line 257, in get_session
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:    return self._session_by_id(session_ls)[client_id]
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:KeyError: '5358'
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:ERROR: test_client_metrics_and_metadata (tasks.cephfs.test_mds_metrics.TestMDSMetrics)
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_mds_metrics.py", line 541, in test_client_metrics_and_metadata
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:    raise RuntimeError("valid_metrics of fs1 not found!")
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:RuntimeError: valid_metrics of fs1 not found!
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Ran 1 test in 94.801s

There are more new failures in this run. And other QA runs including this PR also have failures mentioed above -

https://pulpito.ceph.com/xiubli-2024-08-13_04:53:58-fs-wip-xiubli-testing-20240812.051138-quincy-distro-default-smithi/
https://pulpito.ceph.com/xiubli-2024-08-13_04:48:23-fs-wip-jcollin-testing-20240812.053224-quincy-distro-default-smithi/
https://pulpito.ceph.com/xiubli-2024-08-13_10:32:13-fs-wip-xiubli-testing-20240813.052545-quincy-distro-default-smithi/

@lxbsz
Copy link
Member

lxbsz commented Aug 20, 2024

2024-08-13T05:29:58.237 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:ERROR: test_cap_acquisition_throttle_readdir (tasks.cephfs.test_client_limits.TestClientLimits)
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:Mostly readdir acquires caps faster than the mds recalls, so the cap
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_client_limits.py", line 189, in test_cap_acquisition_throttle_readdir
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:    cap_acquisition_value = self.get_session(mount_a_client_id)['cap_acquisition']['value']
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/cephfs_test_case.py", line 257, in get_session
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:    return self._session_by_id(session_ls)[client_id]
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:KeyError: '5358'

Hi @rishabh-d-dave

Checked the failures, I found that the id: 5358 was in the sessions ls output, but it didn't correctly parse it:

    },
    "id": 5358,
    "inst": "client.5358 v1:192.168.0.1:0/3430441988",
    "last_trim_completed_flushes_tid": 101,
    "last_trim_completed_requests_tid": 119,
    "num_caps": 105,
    "num_completed_flushes": 1,
    "num_completed_requests": 0,
    "num_leases": 1,
    "prealloc_inos": [
      {

I don't know how this failure is related to this PR. Possibly the python issue in different disto ?

@rishabh-d-dave
Copy link
Contributor

rishabh-d-dave commented Aug 20, 2024

We see this failure for multiple different QA runs and this PR was the only PR that was present on all those QA runs. Plus, this failure is seen only for ubuntu_latest jobs.

Spending some time digging deeper in to this, the code itself looks alright to me. It creates dict of session_id and session key-value pair and returns session only for client_id it was passed.

    def get_session(self, client_id, session_ls=None):
        if session_ls is None:
            session_ls = self.fs.mds_asok(['session', 'ls'])

        return self._session_by_id(session_ls)[client_id]

    def _session_by_id(self, session_ls):
        return dict([(s['id'], s) for s in session_ls])

And this same code is present on main branch too. Probably the extra debug code needs to be added to find out exactly what went wrong.

@vshankar
Copy link
Contributor Author

Hi @vshankar. I think following failures are related to this PR -

    2024-08-13T05:20:16.778 DEBUG:teuthology.orchestra.run.smithi028:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.821 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.969 INFO:teuthology.orchestra.run.smithi028.stderr:mount error 22 = Invalid argument
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: resolved to: "172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789"
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: trying mount with old device syntax: 172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789:/
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync,key=0,fsid=cc3ed6a8-5930-11ef-bcce-c7b262605968" will pass to kernel
    2024-08-13T05:20:16.972 DEBUG:teuthology.orchestra.run:got remote process result: 32
    2024-08-13T05:20:16.973 INFO:tasks.cephfs.kernel_mount:mount command failed
    2024-08-13T05:20:16.974 ERROR:teuthology.run_tasks:Saw exception from tasks.

For this run, the kernel ring buffer shows

2024-08-13T05:20:16.962741+00:00 smithi028 kernel: [ 1809.357395] ceph: loaded (mds proto 32)
2024-08-13T05:20:16.974761+00:00 smithi028 kernel: [ 1809.362716] libceph: bad option at 'ms_mode=legacy'

ms_node=legacy is coming from the yaml itself

fs/workload/{begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mgr mon osd} mount/kclient/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock ubuntu_latest} ms-die-on-skipped}} ms_mode/{legacy} objectstore-ec/bluestore-comp-ec-root omap_limit/10000 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts pg_health session_timeout} ranks/3 scrub/yes standby-replay tasks/{0-check-counter 3-snaps/yes workunit/fs/misc} ubuntu_latest wsync/{yes}}

and this is ubuntu 20.04 (focal) with the stock kernel -- so, this is trying to use v1 messenger protocol0 which is getting denied by the kernel driver.

cc @lxbsz

@lxbsz
Copy link
Member

lxbsz commented Aug 21, 2024

Hi @vshankar. I think following failures are related to this PR -

    2024-08-13T05:20:16.778 DEBUG:teuthology.orchestra.run.smithi028:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.821 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.969 INFO:teuthology.orchestra.run.smithi028.stderr:mount error 22 = Invalid argument
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: resolved to: "172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789"
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: trying mount with old device syntax: 172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789:/
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync,key=0,fsid=cc3ed6a8-5930-11ef-bcce-c7b262605968" will pass to kernel
    2024-08-13T05:20:16.972 DEBUG:teuthology.orchestra.run:got remote process result: 32
    2024-08-13T05:20:16.973 INFO:tasks.cephfs.kernel_mount:mount command failed
    2024-08-13T05:20:16.974 ERROR:teuthology.run_tasks:Saw exception from tasks.

For this run, the kernel ring buffer shows

2024-08-13T05:20:16.962741+00:00 smithi028 kernel: [ 1809.357395] ceph: loaded (mds proto 32)
2024-08-13T05:20:16.974761+00:00 smithi028 kernel: [ 1809.362716] libceph: bad option at 'ms_mode=legacy'

ms_node=legacy is coming from the yaml itself

fs/workload/{begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mgr mon osd} mount/kclient/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock ubuntu_latest} ms-die-on-skipped}} ms_mode/{legacy} objectstore-ec/bluestore-comp-ec-root omap_limit/10000 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts pg_health session_timeout} ranks/3 scrub/yes standby-replay tasks/{0-check-counter 3-snaps/yes workunit/fs/misc} ubuntu_latest wsync/{yes}}

and this is ubuntu 20.04 (focal) with the stock kernel -- so, this is trying to use v1 messenger protocol0 which is getting denied by the kernel driver.

cc @lxbsz

For this kernel it hasn't included the following commit yet:

commit 00498b994113a871a556f7ff24a4cf8a00611700
Author: Ilya Dryomov <idryomov@gmail.com>
Date:   Thu Nov 19 16:04:58 2020 +0100

    libceph: introduce connection modes and ms_mode option
    
    msgr2 supports two connection modes: crc (plain) and secure (on-wire
    encryption).  Connection mode is picked by server based on input from
    client.
    
    Introduce ms_mode option:
    
      ms_mode=legacy        - msgr1 (default)
      ms_mode=crc           - crc mode, if denied fail
      ms_mode=secure        - secure mode, if denied fail
      ms_mode=prefer-crc    - crc mode, if denied agree to secure mode
      ms_mode=prefer-secure - secure mode, if denied agree to crc mode
    
    ms_mode affects all connections, we don't separate connections to mons
    like it's done in userspace with ms_client_mode vs ms_mon_client_mode.
    
    For now the default is legacy, to be flipped to prefer-crc after some
    time.
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

@vshankar
Copy link
Contributor Author

2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:ERROR: test_client_metrics_and_metadata (tasks.cephfs.test_mds_metrics.TestMDSMetrics)
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_mds_metrics.py", line 541, in test_client_metrics_and_metadata
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:    raise RuntimeError("valid_metrics of fs1 not found!")
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:RuntimeError: valid_metrics of fs1 not found!
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Ran 1 test in 94.801s

This failure again is related to ubuntu 20.04 kernel driver which is sending an empty metric feature bitset:

-3014> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server  metric specification: [{metric_flags: '0x0'}]
 -3013> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   entity_id: 1
 -3012> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   hostname: smithi094
 -3011> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   kernel_version: 5.4.0-192-generic
 -3010> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   root: /
 -3009> 2024-08-13T06:28:39.288+0000 7f8659dab700 10 mds.0.sessionmap add_session s=0x563ddda3be00 name=client.5860

Its kclient version 5.4.0 which probably does not have the metric changes @lxbsz ?

@lxbsz
Copy link
Member

lxbsz commented Aug 21, 2024

2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:ERROR: test_client_metrics_and_metadata (tasks.cephfs.test_mds_metrics.TestMDSMetrics)
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_mds_metrics.py", line 541, in test_client_metrics_and_metadata
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:    raise RuntimeError("valid_metrics of fs1 not found!")
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:RuntimeError: valid_metrics of fs1 not found!
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Ran 1 test in 94.801s

This failure again is related to ubuntu 20.04 kernel driver which is sending an empty metric feature bitset:

-3014> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server  metric specification: [{metric_flags: '0x0'}]
 -3013> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   entity_id: 1
 -3012> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   hostname: smithi094
 -3011> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   kernel_version: 5.4.0-192-generic
 -3010> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   root: /
 -3009> 2024-08-13T06:28:39.288+0000 7f8659dab700 10 mds.0.sessionmap add_session s=0x563ddda3be00 name=client.5860

Its kclient version 5.4.0 which probably does not have the metric changes @lxbsz ?

Yeah, correct.

@vshankar
Copy link
Contributor Author

2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:ERROR: test_client_metrics_and_metadata (tasks.cephfs.test_mds_metrics.TestMDSMetrics)
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_mds_metrics.py", line 541, in test_client_metrics_and_metadata
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:    raise RuntimeError("valid_metrics of fs1 not found!")
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:RuntimeError: valid_metrics of fs1 not found!
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Ran 1 test in 94.801s

This failure again is related to ubuntu 20.04 kernel driver which is sending an empty metric feature bitset:

-3014> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server  metric specification: [{metric_flags: '0x0'}]
 -3013> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   entity_id: 1
 -3012> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   hostname: smithi094
 -3011> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   kernel_version: 5.4.0-192-generic
 -3010> 2024-08-13T06:28:39.288+0000 7f8659dab700 20 mds.0.server   root: /
 -3009> 2024-08-13T06:28:39.288+0000 7f8659dab700 10 mds.0.sessionmap add_session s=0x563ddda3be00 name=client.5860

Its kclient version 5.4.0 which probably does not have the metric changes @lxbsz ?

Yeah, correct.

In that case, lets track this in redmine as these are known issues with using stock kernel in ubuntu 20.04.

@lxbsz
Copy link
Member

lxbsz commented Aug 22, 2024

Its kclient version 5.4.0 which probably does not have the metric changes @lxbsz ?

Yeah, correct.

In that case, lets track this in redmine as these are known issues with using stock kernel in ubuntu 20.04.

Or should we also just skip this tests ?

@vshankar
Copy link
Contributor Author

Its kclient version 5.4.0 which probably does not have the metric changes @lxbsz ?

Yeah, correct.

In that case, lets track this in redmine as these are known issues with using stock kernel in ubuntu 20.04.

Or should we also just skip this tests ?

That's fine too, but it involved a custom quincy patch which needs to be reverted once we get back to testing latest distros.

@vshankar
Copy link
Contributor Author

jenkins test api

1 similar comment
@vshankar
Copy link
Contributor Author

vshankar commented Sep 9, 2024

jenkins test api

@vshankar
Copy link
Contributor Author

vshankar commented Sep 9, 2024

Hi @vshankar. I think following failures are related to this PR -

    2024-08-13T05:20:16.778 DEBUG:teuthology.orchestra.run.smithi028:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.821 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.822 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.969 INFO:teuthology.orchestra.run.smithi028.stderr:mount error 22 = Invalid argument
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:parsing options: rw,norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync".
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:invalid new device string format
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: resolved to: "172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789"
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: trying mount with old device syntax: 172.21.15.28:6789,172.21.15.53:6789,172.21.15.155:6789:/
    2024-08-13T05:20:16.970 INFO:teuthology.orchestra.run.smithi028.stdout:mount.ceph: options "norequire_active_mds,norbytes,name=0,mds_namespace=cephfs,ms_mode=legacy,wsync,key=0,fsid=cc3ed6a8-5930-11ef-bcce-c7b262605968" will pass to kernel
    2024-08-13T05:20:16.972 DEBUG:teuthology.orchestra.run:got remote process result: 32
    2024-08-13T05:20:16.973 INFO:tasks.cephfs.kernel_mount:mount command failed
    2024-08-13T05:20:16.974 ERROR:teuthology.run_tasks:Saw exception from tasks.
2024-08-13T05:29:58.237 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:ERROR: test_cap_acquisition_throttle_readdir (tasks.cephfs.test_client_limits.TestClientLimits)
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:Mostly readdir acquires caps faster than the mds recalls, so the cap
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_client_limits.py", line 189, in test_cap_acquisition_throttle_readdir
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:    cap_acquisition_value = self.get_session(mount_a_client_id)['cap_acquisition']['value']
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/cephfs_test_case.py", line 257, in get_session
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:    return self._session_by_id(session_ls)[client_id]
2024-08-13T05:29:58.238 INFO:tasks.cephfs_test_runner:KeyError: '5358'
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:======================================================================
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:ERROR: test_client_metrics_and_metadata (tasks.cephfs.test_mds_metrics.TestMDSMetrics)
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_e0e452b3a276271196a54dadb3ac706afad6f142/qa/tasks/cephfs/test_mds_metrics.py", line 541, in test_client_metrics_and_metadata
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:    raise RuntimeError("valid_metrics of fs1 not found!")
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:RuntimeError: valid_metrics of fs1 not found!
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-08-13T06:29:02.519 INFO:tasks.cephfs_test_runner:Ran 1 test in 94.801s

There are more new failures in this run. And other QA runs including this PR also have failures mentioed above -

https://pulpito.ceph.com/xiubli-2024-08-13_04:53:58-fs-wip-xiubli-testing-20240812.051138-quincy-distro-default-smithi/ https://pulpito.ceph.com/xiubli-2024-08-13_04:48:23-fs-wip-jcollin-testing-20240812.053224-quincy-distro-default-smithi/ https://pulpito.ceph.com/xiubli-2024-08-13_10:32:13-fs-wip-xiubli-testing-20240813.052545-quincy-distro-default-smithi/

So, I've been going over https://pulpito.ceph.com/xiubli-2024-08-13_04:29:48-fs-wip-xiubli-testing-20240812.103236-quincy-distro-default-smithi/ (QA tracker: https://tracker.ceph.com/issues/67495) and these failures are popping up, however, given that switching to Ubuntu is the only choice we have right now (due to the centos mess), we should just track these issues in redmine tickets and revisit them when we have the relevant distro's for testing. @rishabh-d-dave

@vshankar
Copy link
Contributor Author

vshankar commented Oct 7, 2024

Okay. After some delay, I'm getting back to this -- will create quincy specific redmine tickets and merge the set of PRs that are pending for Quincy.

@vshankar
Copy link
Contributor Author

Okay. After some delay, I'm getting back to this -- will create quincy specific redmine tickets and merge the set of PRs that are pending for Quincy.

I'll start merging Quincy backport PRs after preparing the run wiki if no other new failures are seen related to the PRs being tested.

@vshankar
Copy link
Contributor Author

@vshankar vshankar merged commit 553d486 into ceph:quincy Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants