Bug #71239: list_realms regression after change to ConfigStore - rgw - Ceph

Actions

Copy link

Bug #71239

closed

list_realms regression after change to ConfigStore

Added by Casey Bodley 11 months ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

a ceph api test started timing out after the merge of https://github.com/ceph/ceph/pull/62398, which changed one librados read request from the default librados client to the one owned by ConfigStore. ConfigStore's client was way behind in osdmaps and needed two retries before completing, often triggering a 45-second client timeout

API test failure console output:

FAIL: test_get_realms (tasks.mgr.dashboard.test_rgw.RgwSiteTest)
Traceback (most recent call last):
  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/test_rgw.py", line 86, in test_get_realms
    self.assertStatus(200)
  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 507, in assertStatus
    self.assertEqual(self._resp.status_code, status)
AssertionError: 500 != 200

> ip netns list
> sudo ip link delete ceph-brx
Cannot find device "ceph-brx"

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Casey Bodley 11 months ago

using https://github.com/ceph/ceph/pull/63141 to capture debug logs

Actions

Copy link

Updated by Casey Bodley 11 months ago

https://github.com/ceph/ceph/pull/63126 disables the 'RgwSiteTest' cases that trigger this for now. we'll need to reenable when validating the fix

Actions

Copy link

Updated by Casey Bodley 11 months ago

captured debug logs from objecter in https://jenkins.ceph.com/job/ceph-api/95379/artifact/build/out/radosgw.8000.log and analyzed them in https://gist.github.com/cbodley/aeb273e19ab26dda20f1072126c890a2

Actions

Copy link

Updated by Casey Bodley 11 months ago

Related to Bug #71261: osdc: idle librados client can lead to Objecter getting way behind in osdmaps added

Actions

Copy link

Updated by Casey Bodley 10 months ago

Assignee deleted (~~Casey Bodley~~)
Priority changed from Immediate to Normal

lowering prio for the following reasons:

these rgw changes aren't on tentacle and won't be backported
times out in jenkins vstart environment which does not reflect a real cluster
this amount of osdmap churn seems unusual for normal clusters
for rgw, this only effects an admin api so is not visible to clients
http clients should retry at least once, even if the ceph-mgr api doesn't, so a single timeout isn't likely to break applications
Objecter does appear to be misbehaving, but i don't see evidence of a recent regression there

the underlying Objecter issue is tracked in https://tracker.ceph.com/issues/71261

Actions

Copy link