Skip to content

qa: fix "no orch backend set" in nfs suite#53594

Merged
vshankar merged 1 commit intoceph:mainfrom
dparmar18:wip-62870
Sep 29, 2023
Merged

qa: fix "no orch backend set" in nfs suite#53594
vshankar merged 1 commit intoceph:mainfrom
dparmar18:wip-62870

Conversation

@dparmar18
Copy link
Contributor

@dparmar18 dparmar18 commented Sep 22, 2023

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@vshankar
Copy link
Contributor

@dparmar18 For commit 0576645, could you explain the issue and how does the change fix it?

@dparmar18
Copy link
Contributor Author

Okay so here's the story:

basically, post #52708 merge; NFS tests started failing due to:

2023-09-07T13:06:57.204 INFO:teuthology.orchestra.run.smithi143.stderr:Error ENOENT: No orchestrator configured (try `ceph orch set backend`)

This happened because the cluster was instantiated just the way it is done with other sub-suites in fs suite, but the issue with this way here is that we would not be able to run NFS commands because the cluster would complain that it has "no backend set" and setting the backend is not a straightforward task because let's say we run ceph orch set backend cephadm in the test cases class itself(i.e. in setUp()) then it would complain:

2023-09-21T12:33:52.923 INFO:tasks.ceph.mgr.x.smithi002.stderr:  File "/usr/share/ceph/mgr/cephadm/module.py", line 3062, in _apply_service_spec
2023-09-21T12:33:52.923 INFO:tasks.ceph.mgr.x.smithi002.stderr:    raise OrchestratorError((f'The maximum number of {spec.service_type} daemons allowed with {host_count} hosts is {host_count*max_count} ({host_count}x{max_count}).'
2023-09-21T12:33:52.923 INFO:tasks.ceph.mgr.x.smithi002.stderr:orchestrator._interface.OrchestratorError: The maximum number of nfs daemons allowed with 0 hosts is 0 (0x10). This limit can be adjusted by changing the mgr/cephadm/max_count_per_host config option

So in order to overcome this we need cephadm to not only set the backend but also bootstrap the cluster and add the hosts; that's why we had been using qa/tasks/cephadm.py for carrying out NFS tests i.e. using the below in the YAML which was in the orch/cephadm suite before:

tasks:
- install:
- cephadm:

So let's say even if we add the above YAML tasks in the current setup, it is going to fail because the cluster would be instantiated by qa/tasks/ceph.py and then qa/tasks/cephadm.py would try to orchestrate which would fail [0]:

'Namespace' object has no attribute 'bootstrapped'

qa/tasks/ceph.py is used in begin dir's 0-install.yaml in the NFS sub-suite therefore using begin's steps and qa/tasks/cephadm.py is a disaster while just using begin is even worse.

So it is mandatory to orchestrate the cluster using cephadm tasks and we have already been doing it in the cephadm sub-suite in fs -> fs/cephadm has ditto setup to what was in the orch/cephadm to run NFS tests. Long story short, we can't carry out NFS tests using fs suite's way of constructing the cluster and basically need to follow steps mentioned in the fs/cephadm and/or what is being done in orch/cephadm suite.

One more question one may ask: Having the cephadm tasks is fine but why bootstrap MDS clusters using:

- cephadm.shell:
    host.a:
      - ceph orch apply mds a

and not just mention two MDSs in the yaml file itself directly?

Well I did try it out and as expected the job failed [1], because we changed the way we bootstrap the MDSs and thus we would have a CephFS cephfs now and some test cases in test_nfs.py creates a CephFS(so now we have two FSs) and uses ceph-fuse to mount a fs (without --client_fs) and then check for a particular path when creating CephFS exports, here it will pick the default fs i.e. cephfs and will lead to unwanted failures because the path doesn't exist there. Now one may say we can pass --client_fs in the test cases right? Yeah we can but that is not worth the effort when we can just follow the way we've been bootstrapping the MDSs with ceph orch apply mds a instead of directly mentioning the MDSs in YAML.

To support all of these, here are two successful jobs:
http://pulpito.front.sepia.ceph.com/dparmar-2023-09-22_14:16:34-fs:nfs-wip-62870-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/dparmar-2023-09-22_15:26:51-fs:nfs-wip-62870-distro-default-smithi/


[0] http://pulpito.front.sepia.ceph.com/dparmar-2023-09-21_14:36:37-fs:nfs-fix-nfs-apply-err-reporting-distro-default-smithi/
[1] http://pulpito.front.sepia.ceph.com/dparmar-2023-09-22_12:42:29-fs:nfs-wip-62870-distro-default-smithi/

@dparmar18
Copy link
Contributor Author

@vshankar @adk3798 ^^

@dparmar18 dparmar18 marked this pull request as ready for review September 22, 2023 21:50
@dparmar18 dparmar18 requested review from a team and adk3798 September 22, 2023 21:50
Fixes: https://tracker.ceph.com/issues/62870
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
@dparmar18
Copy link
Contributor Author

last push made no code changes, just added the Fixes line in commit message

@dparmar18
Copy link
Contributor Author

oh i forgot to mention why I removed objectstore: a) it is not needed since we're not dealing with anything complex and b) i'm following the blueprint of fs/cephadm sub-suite and also fact that orch/cephadm never had it but the tests always ran fine; this is also a supporting proof. So all over I think it is good to merge ASAP since we're currently blocked for carrying out NFS testing

@vshankar
Copy link
Contributor

Okay so here's the story:

basically, post #52708 merge; NFS tests started failing due to:

2023-09-07T13:06:57.204 INFO:teuthology.orchestra.run.smithi143.stderr:Error ENOENT: No orchestrator configured (try `ceph orch set backend`)

This happened because the cluster was instantiated just the way it is done with other sub-suites in fs suite, but the issue with this way here is that we would not be able to run NFS commands because the cluster would complain that it has "no backend set" and setting the backend is not a straightforward task because let's say we run ceph orch set backend cephadm in the test cases class itself(i.e. in setUp()) then it would complain:

2023-09-21T12:33:52.923 INFO:tasks.ceph.mgr.x.smithi002.stderr:  File "/usr/share/ceph/mgr/cephadm/module.py", line 3062, in _apply_service_spec
2023-09-21T12:33:52.923 INFO:tasks.ceph.mgr.x.smithi002.stderr:    raise OrchestratorError((f'The maximum number of {spec.service_type} daemons allowed with {host_count} hosts is {host_count*max_count} ({host_count}x{max_count}).'
2023-09-21T12:33:52.923 INFO:tasks.ceph.mgr.x.smithi002.stderr:orchestrator._interface.OrchestratorError: The maximum number of nfs daemons allowed with 0 hosts is 0 (0x10). This limit can be adjusted by changing the mgr/cephadm/max_count_per_host config option

So in order to overcome this we need cephadm to not only set the backend but also bootstrap the cluster and add the hosts; that's why we had been using qa/tasks/cephadm.py for carrying out NFS tests i.e. using the below in the YAML which was in the orch/cephadm suite before:

tasks:
- install:
- cephadm:

So let's say even if we add the above YAML tasks in the current setup, it is going to fail because the cluster would be instantiated by qa/tasks/ceph.py and then qa/tasks/cephadm.py would try to orchestrate which would fail [0]:

'Namespace' object has no attribute 'bootstrapped'

qa/tasks/ceph.py is used in begin dir's 0-install.yaml in the NFS sub-suite therefore using begin's steps and qa/tasks/cephadm.py is a disaster while just using begin is even worse.

Looks fine till here.

So it is mandatory to orchestrate the cluster using cephadm tasks and we have already been doing it in the cephadm sub-suite in fs -> fs/cephadm has ditto setup to what was in the orch/cephadm to run NFS tests. Long story short, we can't carry out NFS tests using fs suite's way of constructing the cluster and basically need to follow steps mentioned in the fs/cephadm and/or what is being done in orch/cephadm suite.

One more question one may ask: Having the cephadm tasks is fine but why bootstrap MDS clusters using:

- cephadm.shell:
    host.a:
      - ceph orch apply mds a

and not just mention two MDSs in the yaml file itself directly?

Well I did try it out and as expected the job failed [1], because we changed the way we bootstrap the MDSs and thus we would have a CephFS cephfs now and some test cases in test_nfs.py creates a CephFS(so now we have two FSs) and uses ceph-fuse to mount a fs (without --client_fs) and then check for a particular path when creating CephFS exports, here it will pick the default fs i.e. cephfs and will lead to unwanted failures because the path doesn't exist there. Now one may say we can pass --client_fs in the test cases right? Yeah we can but that is not worth the effort when we can just follow the way we've been bootstrapping the MDSs with ceph orch apply mds a instead of directly mentioning the MDSs in YAML.

Fair enough.

To support all of these, here are two successful jobs: http://pulpito.front.sepia.ceph.com/dparmar-2023-09-22_14:16:34-fs:nfs-wip-62870-distro-default-smithi/ http://pulpito.front.sepia.ceph.com/dparmar-2023-09-22_15:26:51-fs:nfs-wip-62870-distro-default-smithi/

[0] http://pulpito.front.sepia.ceph.com/dparmar-2023-09-21_14:36:37-fs:nfs-fix-nfs-apply-err-reporting-distro-default-smithi/ [1] http://pulpito.front.sepia.ceph.com/dparmar-2023-09-22_12:42:29-fs:nfs-wip-62870-distro-default-smithi/

👍 Nice work @dparmar18

@dparmar18
Copy link
Contributor Author

can this be merged? this is blocking some PRs to be tested

@vshankar
Copy link
Contributor

can this be merged? this is blocking some PRs to be tested

running this through qa - will be meged soon (subset test).

@vshankar
Copy link
Contributor

@dparmar18
Copy link
Contributor Author

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20230926.081818

2023-09-26T16:43:04.340 DEBUG:teuthology.orchestra.run.smithi186:> sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c pull
2023-09-26T16:43:04.513 INFO:teuthology.orchestra.run.smithi186.stderr:Pulling container image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c...
2023-09-26T16:43:45.465 INFO:teuthology.orchestra.run.smithi186.stderr:Non-zero exit code 125 from /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint ceph --init -e CONTAINER_IMAGE=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c -e NODE_NAME=smithi186 quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c --version
2023-09-26T16:43:45.466 INFO:teuthology.orchestra.run.smithi186.stderr:ceph: stderr docker: Error response from daemon: failed to create task for container: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: bpf_prog_query(BPF_CGROUP_DEVICE) failed: invalid argument: unknown.
2023-09-26T16:43:45.466 INFO:teuthology.orchestra.run.smithi186.stderr:Traceback (most recent call last):
2023-09-26T16:43:45.466 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2023-09-26T16:43:45.469 INFO:teuthology.orchestra.run.smithi186.stderr:    return _run_code(code, main_globals, None,
2023-09-26T16:43:45.469 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2023-09-26T16:43:45.469 INFO:teuthology.orchestra.run.smithi186.stderr:    exec(code, run_globals)
2023-09-26T16:43:45.469 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/__main__.py", line 8193, in <module>
2023-09-26T16:43:45.469 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/__main__.py", line 8181, in main
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/__main__.py", line 1644, in _default_image
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/__main__.py", line 4083, in command_pull
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/cephadmlib/decorators.py", line 27, in _require_image
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/__main__.py", line 1635, in _infer_image
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/__main__.py", line 4136, in command_inspect_image
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/cephadmlib/container_types.py", line 400, in run
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:  File "/tmp/tmpp8p6emph.cephadm.build/cephadmlib/call_wrappers.py", line 307, in call_throws
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint ceph --init -e CONTAINER_IMAGE=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c -e NODE_NAME=smithi186 quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c --version: docker: Error response from daemon: failed to create task for container: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: bpf_prog_query(BPF_CGROUP_DEVICE) failed: invalid argument: unknown.
2023-09-26T16:43:45.470 INFO:teuthology.orchestra.run.smithi186.stderr:
2023-09-26T16:43:45.489 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-09-26T16:43:45.490 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_54e62bcbac4e53d9685e08328b790d3b20d71cae/teuthology/contextutil.py", line 30, in nested
    vars.append(enter())
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_a2e911cf76140ce8227d2acb6dc462b727acb78c/qa/tasks/cephadm.py", line 433, in pull_image
    run.wait(
  File "/home/teuthworker/src/git.ceph.com_teuthology_54e62bcbac4e53d9685e08328b790d3b20d71cae/teuthology/orchestra/run.py", line 479, in wait
    proc.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_54e62bcbac4e53d9685e08328b790d3b20d71cae/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_54e62bcbac4e53d9685e08328b790d3b20d71cae/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on smithi186 with status 1: 'sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:a2e911cf76140ce8227d2acb6dc462b727acb78c pull'

this shouldn't have occured

@dparmar18
Copy link
Contributor Author

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/dparmar-2023-09-26_19:06:28-fs:nfs-wip-62870-distro-default-smithi/

I see this passes, but both of my fs:nfs runs fail which likely means we are doing something different :)

@dparmar18
Copy link
Contributor Author

https://pulpito.ceph.com/dparmar-2023-09-26_19:06:28-fs:nfs-wip-62870-distro-default-smithi/

I see this passes, but both of my fs:nfs runs fail which likely means we are doing something different :)

The error doesn't relate to the code. Some issue with the branch i guess? Rebuilding the branch might help?

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/dparmar-2023-09-26_19:06:28-fs:nfs-wip-62870-distro-default-smithi/

I see this passes, but both of my fs:nfs runs fail which likely means we are doing something different :)

The error doesn't relate to the code. Some issue with the branch i guess? Rebuilding the branch might help?

The branch is just a bunch of PRs built in shaman, not sure what can go wrong with that (only one nfs related change).

@vshankar
Copy link
Contributor

@vshankar
Copy link
Contributor

heh - https://pulpito.ceph.com/vshankar-2023-09-26_14:09:41-fs-wip-vshankar-testing-20230926.081818-testing-default-smithi/7402468/

the from the full fs suite passed.

Not sure if its related to the distro in use. rhel_8 passes but no the centos or ubuntu in my run. Could you please check @dparmar18?

@dparmar18
Copy link
Contributor Author

heh - https://pulpito.ceph.com/vshankar-2023-09-26_14:09:41-fs-wip-vshankar-testing-20230926.081818-testing-default-smithi/7402468/
the from the full fs suite passed.

Not sure if its related to the distro in use. rhel_8 passes but no the centos or ubuntu in my run. Could you please check @dparmar18?

I'm currently away so won't be able check atm but the three runs i had were on ubuntu 20.04, centos 9 and rhel 8 and all of them had passed:

Ubuntu: https://pulpito.ceph.com/dparmar-2023-09-26_19:06:28-fs:nfs-wip-62870-distro-default-smithi/

Centos: https://pulpito.ceph.com/dparmar-2023-09-22_15:26:51-fs:nfs-wip-62870-distro-default-smithi/

Rhel: https://pulpito.ceph.com/dparmar-2023-09-22_14:16:34-fs:nfs-wip-62870-distro-default-smithi/

Seems like something went wrong with the builds

@vshankar
Copy link
Contributor

Seems like something went wrong with the builds

looks more of an infra related issue to me.

@vshankar
Copy link
Contributor

@vshankar vshankar merged commit 6d8679e into ceph:main Sep 29, 2023
vshankar added a commit to vshankar/ceph that referenced this pull request Oct 7, 2023
* refs/pull/53594/head:
	qa: fix "no orch backend set" in nfs suite

Reviewed-by: Adam King <adking@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants