Project

General

Profile

Actions

Bug #69899

open

Test failure: test_exports_on_mgr_restart (tasks.cephfs.test_nfs.TestNFS)

Added by Kumar Hemanth about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

https://pulpito.ceph.com/hyelloji-2025-01-15_07:12:26-fs-wip-hemanth1-testing-2025-01-13-1558-reef-distro-default-smithi/8078229/

2025-01-16T17:59:38.202 INFO:tasks.cephfs_test_runner:Test export availability on restarting mgr. ... ERROR
2025-01-16T17:59:38.203 INFO:tasks.cephfs_test_runner:
2025-01-16T17:59:38.203 INFO:tasks.cephfs_test_runner:======================================================================
2025-01-16T17:59:38.203 INFO:tasks.cephfs_test_runner:ERROR: test_exports_on_mgr_restart (tasks.cephfs.test_nfs.TestNFS)
2025-01-16T17:59:38.203 INFO:tasks.cephfs_test_runner:Test export availability on restarting mgr.
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/cephfs/test_nfs.py", line 503, in test_exports_on_mgr_restart
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:    self._load_module("cephadm")
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/mgr/mgr_test_case.py", line 176, in _load_module
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:    cls.wait_until_true(has_restarted, timeout=30)
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/ceph_test_case.py", line 347, in wait_until_true
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:Ran 16 tests in 588.660s
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:======================================================================
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:ERROR: test_exports_on_mgr_restart (tasks.cephfs.test_nfs.TestNFS)
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:Test export availability on restarting mgr.
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/cephfs/test_nfs.py", line 503, in test_exports_on_mgr_restart
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:    self._load_module("cephadm")
2025-01-16T17:59:38.205 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/mgr/mgr_test_case.py", line 176, in _load_module
2025-01-16T17:59:38.206 INFO:tasks.cephfs_test_runner:    cls.wait_until_true(has_restarted, timeout=30)
2025-01-16T17:59:38.206 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/ceph_test_case.py", line 347, in wait_until_true
2025-01-16T17:59:38.206 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2025-01-16T17:59:38.206 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2025-01-16T17:59:38.206 INFO:tasks.cephfs_test_runner:
2025-01-16T17:59:38.206 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_42294faf23294f557d349df1bd69657772519729/teuthology/run_tasks.py", line 112, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/cephfs_test_runner.py", line 211, in task
    raise RuntimeError("Test failure: {0}".format(", ".join(bad_tests)))
RuntimeError: Test failure: test_exports_on_mgr_restart (tasks.cephfs.test_nfs.TestNFS)
2025-01-16T17:59:38.381 ERROR:teuthology.util.sentry: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=8b59cad8622645adb9c9e500629b397e
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_42294faf23294f557d349df1bd69657772519729/teuthology/run_tasks.py", line 112, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_35aa12e8113161a135a96fd9b5a767dd61a532c4/qa/tasks/cephfs_test_runner.py", line 211, in task
    raise RuntimeError("Test failure: {0}".format(", ".join(bad_tests)))
RuntimeError: Test failure: test_exports_on_mgr_restart (tasks.cephfs.test_nfs.TestNFS)
Actions #1

Updated by Milind Changire about 1 year ago

  • Assignee set to Dhairya Parmar
Actions #2

Updated by Dhairya Parmar about 1 year ago · Edited

@Kumar Hemanth

cephadm module failed to load in time, have not faced it before, what PRs were part of wip-hemanth1-testing-2025-01-13-1558-reef? looks like an intermittent failure.

Actions #3

Updated by Nitzan Mordechai about 1 year ago

/a/yuriw-2025-02-12_15:34:28-orch-wip-yuri4-testing-2025-02-08-0943-reef-distro-default-smithi/8126631

Actions #4

Updated by Venky Shankar about 1 year ago

  • Description updated (diff)

@Dhairya Parmar - Could you please update the actual cephadm load error?

Actions #5

Updated by Dhairya Parmar about 1 year ago · Edited

Venky Shankar wrote in #note-4:

@Dhairya Parmar - Could you please update the actual cephadm load error?

The stack trace is pointing to cls.wait_until_true(has_restarted, timeout=30) in method _load_module() (part of qa/tasks/mgr/mgr_test_case.py). If we look at the condition which is failing:

def has_restarted():
    mgr_map = cls.mgr_cluster.get_mgr_map()
    done = mgr_map['active_gid'] != initial_gid and mgr_map['available']
    if done:
        log.debug("Restarted after module load (new active {0}/{1})".format(
            mgr_map['active_name'], mgr_map['active_gid']))
    return done

Then it is basically looking for a change in mgr active_gid and the mgr (mgr.x in this case) should be available. The active_gid and mgr availability before the ceph mgr module enable cephadm --force is:
2025-01-16T17:58:59.941 INFO:teuthology.orchestra.run.smithi107.stdout:    "active_gid": 14318
2025-01-16T17:58:59.942 INFO:teuthology.orchestra.run.smithi107.stdout:    "available": true

after enabling cephadm, active_gid remain the same:

2025-01-16T17:59:01.564 INFO:teuthology.orchestra.run.smithi107.stdout:    "active_gid": 14318
2025-01-16T17:59:01.565 INFO:teuthology.orchestra.run.smithi107.stdout:    "available": true

we have a 30s timeout with each a sleep period of 5 seconds for each iteration, so in second and third iteration - active_gid didnt change and so is the availability but in later iterations the active_gid did change but so did the availability (now false):

2025-01-16T17:59:19.021 INFO:teuthology.orchestra.run.smithi107.stdout:    "active_gid": 15112
2025-01-16T17:59:19.021 INFO:teuthology.orchestra.run.smithi107.stdout:    "available": false

therefore done = mgr_map['active_gid'] != initial_gid and mgr_map['available'] never turns true and hence we see:

2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2025-01-16T17:59:38.204 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries

Why is the mgr daemon not available is the question, i'm not sure why is that. @Venky Shankar do you need me to dig deeper?

Actions #6

Updated by Venky Shankar about 1 year ago

It's possible that this is related to https://tracker.ceph.com/issues/68747

Actions #7

Updated by Venky Shankar about 1 year ago

  • Related to Bug #68747: fs:upgrade failure due to ceph-mgr not being ready added
Actions #8

Updated by Venky Shankar about 1 year ago

Dhairya, suggest checking if this is an instance of the related tracker I added.

Actions #9

Updated by Dhairya Parmar about 1 year ago

Venky Shankar wrote in #note-8:

Dhairya, suggest checking if this is an instance of the related tracker I added.

I'm not seeing any similarity as far as teuthology logs are concerned, checking mgr logs better insights.

Actions #10

Updated by Venky Shankar about 1 year ago

  • Project changed from CephFS to Orchestrator

Dhairya Parmar wrote in #note-9:

Venky Shankar wrote in #note-8:

Dhairya, suggest checking if this is an instance of the related tracker I added.

I'm not seeing any similarity as far as teuthology logs are concerned, checking mgr logs better insights.

Yeah - this is unrelated to cephfs. Moving this under orch project for @Adam King to have a look.

Actions #11

Updated by Venky Shankar about 1 year ago

  • Assignee changed from Dhairya Parmar to Adam King
Actions #12

Updated by Venky Shankar about 1 year ago

  • Related to deleted (Bug #68747: fs:upgrade failure due to ceph-mgr not being ready)
Actions

Also available in: Atom PDF