Project

General

Profile

Actions

Bug #71239

closed

list_realms regression after change to ConfigStore

Added by Casey Bodley 11 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

a ceph api test started timing out after the merge of https://github.com/ceph/ceph/pull/62398, which changed one librados read request from the default librados client to the one owned by ConfigStore. ConfigStore's client was way behind in osdmaps and needed two retries before completing, often triggering a 45-second client timeout

API test failure console output:

FAIL: test_get_realms (tasks.mgr.dashboard.test_rgw.RgwSiteTest)
Traceback (most recent call last):
  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/test_rgw.py", line 86, in test_get_realms
    self.assertStatus(200)
  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 507, in assertStatus
    self.assertEqual(self._resp.status_code, status)
AssertionError: 500 != 200

> ip netns list
> sudo ip link delete ceph-brx
Cannot find device "ceph-brx" 


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #71261: osdc: idle librados client can lead to Objecter getting way behind in osdmapsPending BackportNitzan Mordechai

Actions
Actions #1

Updated by Casey Bodley 11 months ago

using https://github.com/ceph/ceph/pull/63141 to capture debug logs

Actions #2

Updated by Casey Bodley 11 months ago

https://github.com/ceph/ceph/pull/63126 disables the 'RgwSiteTest' cases that trigger this for now. we'll need to reenable when validating the fix

Actions #4

Updated by Casey Bodley 11 months ago

  • Related to Bug #71261: osdc: idle librados client can lead to Objecter getting way behind in osdmaps added
Actions #5

Updated by Casey Bodley 10 months ago

  • Assignee deleted (Casey Bodley)
  • Priority changed from Immediate to Normal
lowering prio for the following reasons:
  • these rgw changes aren't on tentacle and won't be backported
  • times out in jenkins vstart environment which does not reflect a real cluster
  • this amount of osdmap churn seems unusual for normal clusters
  • for rgw, this only effects an admin api so is not visible to clients
  • http clients should retry at least once, even if the ceph-mgr api doesn't, so a single timeout isn't likely to break applications
  • Objecter does appear to be misbehaving, but i don't see evidence of a recent regression there

the underlying Objecter issue is tracked in https://tracker.ceph.com/issues/71261

Actions #6

Updated by Ernesto Puerta 10 months ago

  • Description updated (diff)
Actions #7

Updated by Casey Bodley 10 months ago

  • Status changed from New to Triaged
Actions #8

Updated by Casey Bodley 10 months ago

  • Assignee set to Casey Bodley
Actions #9

Updated by Casey Bodley 9 months ago

  • Status changed from Triaged to Resolved
Actions

Also available in: Atom PDF