Bug #71239
closedlist_realms regression after change to ConfigStore
0%
Description
a ceph api test started timing out after the merge of https://github.com/ceph/ceph/pull/62398, which changed one librados read request from the default librados client to the one owned by ConfigStore. ConfigStore's client was way behind in osdmaps and needed two retries before completing, often triggering a 45-second client timeout
API test failure console output:
FAIL: test_get_realms (tasks.mgr.dashboard.test_rgw.RgwSiteTest)
Traceback (most recent call last):
File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/test_rgw.py", line 86, in test_get_realms
self.assertStatus(200)
File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 507, in assertStatus
self.assertEqual(self._resp.status_code, status)
AssertionError: 500 != 200
> ip netns list
> sudo ip link delete ceph-brx
Cannot find device "ceph-brx"
Updated by Casey Bodley 11 months ago
using https://github.com/ceph/ceph/pull/63141 to capture debug logs
Updated by Casey Bodley 11 months ago
https://github.com/ceph/ceph/pull/63126 disables the 'RgwSiteTest' cases that trigger this for now. we'll need to reenable when validating the fix
Updated by Casey Bodley 11 months ago
captured debug logs from objecter in https://jenkins.ceph.com/job/ceph-api/95379/artifact/build/out/radosgw.8000.log and analyzed them in https://gist.github.com/cbodley/aeb273e19ab26dda20f1072126c890a2
Updated by Casey Bodley 11 months ago
- Related to Bug #71261: osdc: idle librados client can lead to Objecter getting way behind in osdmaps added
Updated by Casey Bodley 10 months ago
- Assignee deleted (
Casey Bodley) - Priority changed from Immediate to Normal
- these rgw changes aren't on tentacle and won't be backported
- times out in jenkins vstart environment which does not reflect a real cluster
- this amount of osdmap churn seems unusual for normal clusters
- for rgw, this only effects an admin api so is not visible to clients
- http clients should retry at least once, even if the ceph-mgr api doesn't, so a single timeout isn't likely to break applications
- Objecter does appear to be misbehaving, but i don't see evidence of a recent regression there
the underlying Objecter issue is tracked in https://tracker.ceph.com/issues/71261