Conversation
|
Replication lag comparison between pre-fix and post-fix. Multisite clusters configuration:
Pre-fix:
Post-fix:
|
| class RGWRESTConn | ||
| { | ||
| /* the endpoint is not able to connect if the timestamp is not real_clock::zero */ | ||
| using endpoint_status_map = std::unordered_map<std::string, std::atomic<ceph::real_time>>; |
There was a problem hiding this comment.
This is a comment @cbodley made in this PR on the same changes, which have been moved and updated here.
endpoints_statuscould probably just be astd::map<std::string, ceph::real_time>that only stores the failed endpoints.std::unordered_mapis probably overkill when we only expect a handful of endpoints per zone.boost::container::flat_mapcould be a good alternative to reduce memory allocations
Here are the reasons why I still choose to use the std::unordered_map to maintain the status for each and every endpoint are the following:
- the
endpoints_statusmap itself is created when the RGWRESTConn is instantiated and won't be updated after that, hence the map structure itself don't need to be thread-safe. - however the timestamp/status of each endpoint might be accessed simultaneously by multiple threads so it should be handled in a thread-safe way. In order to reduce the lock granularity, I choose to use
std::atomicfor each endpoint instead of locking the entire map. Since majority operations on the map would be read-only, this can reduce the overall time spent on locking. std::atomicis neither copyable nor movable hence I'm not able to useboost::container::flat_maphere.
|
jenkins test make check |
|
jenkins test ceph API tests |
|
jenkins test make check |
There was a problem hiding this comment.
@cbodley
tested, on a 3x3 dedicated for sync RGWs,
graph is sync rate from PRI to SEC, bottom scale is seconds, at:
at the 1400 seconds mark 1 RGW of the 3 was stopped and synching continued with 2 RGWs out of 3
at the 3000 seconds mark another RGW was stopped and synching continued with 1 RGWs out of 3
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
fad79d4 to
7014f69
Compare
|
rebased |
7014f69 to
cb7af19
Compare
|
jenkins test make check |
|
does this PR need |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
jenkins test api |
|
@jzhu116-bloomberg regarding multisite functional tests, there's a README in src/test/rgw/test_multi.md for running the tests locally from a ceph build directory the test cases themselves are in src/test/rgw/rgw_multi/tests.py. see test_object_sync() for a simple example src/test/rgw/rgw_multi/multisite.py contains the abstractions for a central concept of these tests are the 'checkpoints', like does that make sense? |
|
01dd685 to
e467909
Compare
|
Added some multisite test cases trying to cover as mush code path changed in this PR as possible. In these new test cases, the 2nd rgw instance in each zone is being brought down before the test and brought up after the test. All the newly added test cases are under
All the newly added test cases passed in my local run. |
|
All the newly added test cases work well until I pull the main branch where the test case "test_role_sync" fails in |
…quests to them when appropriate Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>
9612399 to
3ee43f0
Compare
|
Rebased and all |
src/test/rgw/rgw_multi/tests.py
Outdated
| if not stop_2nd_rgw(zonegroup): | ||
| log.warning('Skipping the test - test_bucket_create_rgw_down: ' | ||
| 'more than one rgw needed in any one or multiple zone(s)') | ||
| return |
There was a problem hiding this comment.
can use SkipTest so these get reported as a skip instead of a pass, ex:
raise SkipTest("test_multi_period_incremental_sync skipped. Requires 3 or more zones in master zonegroup.")
src/test/rgw/rgw_multi/tests.py
Outdated
| zonegroup_conns = ZonegroupConns(zonegroup) | ||
| buckets, _ = create_bucket_per_zone(zonegroup_conns, 2) | ||
| zonegroup_meta_checkpoint(zonegroup) | ||
|
|
||
| for zone in zonegroup_conns.zones: | ||
| assert check_all_buckets_exist(zone, buckets) | ||
|
|
||
| start_2nd_rgw(zonegroup) |
There was a problem hiding this comment.
consider using try/finally to make sure the gateways always get restarted, even if the test fails above
|
verified that the tests pass locally, albeit slowly, with
|
Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>
3ee43f0 to
9b36e31
Compare
The heavier the traffic, the better the performance improvement. |
it seems that would cause at most a 2-second delay to each test case after restarting its rgw? i'd guess that more significant delays are caused by the fact that radosgw doesn't drop the data/mdlog shard locks it holds on shutdown, so nobody can make progress on those for the next rgw_sync_lease_period=2 minutes i'll run this through qa. if see that it adds too much to the runtime of rgw/multisite jobs, we might look for ways to simplify the tests |
|
the runtimes were 3h38m and 4h53m, but all of the rgw_down tests were skipped because we only run a single rgw per zone cc @smanjara @yuvalif do you think we should add a rgw/multisite job that runs two gateways per zone so we can exercise this? i just worry that it'll take much longer. the rest of the rgw suite tends to complete in under an hour, so i often approve/merge things before the multisite jobs even complete |
we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think? |
okay, so it would live outside of qa/suites/rgw there's already a qa/suites/rgw-multisite-upgrade subsuite that we had to move out of qa/suites/upgrade in #42869 because the multisite tests fail consistently i guess we could add qa/suites/rgw-multisite for more comprehensive coverage, but we might want to wait until the existing tests can pass reliably |
sounds good. about the present multisite failures, I updated the result here: #54442 |
|
thanks @smanjara; any objection to merging this as-is? i was able to validate the new tests locally with test_multi.py |
sure, please go ahead. thanks! |
|
thanks @jzhu116-bloomberg, and thanks @mkogan1 for the help with testing |
| << " diff=" << diff << dendl; | ||
|
|
||
| static constexpr uint32_t CONN_STATUS_EXPIRE_SECS = 2; | ||
| if (diff >= CONN_STATUS_EXPIRE_SECS) { |
There was a problem hiding this comment.
@jzhu116-bloomberg .. could you please explain what it means if connection_status is not expired here (i.e, diff < 2sec). Shouldn't it be considered valid endpoint?
This check added here breaks cloud-transition code and perhaps cloud-sync module too - https://tracker.ceph.com/issues/65251
There was a problem hiding this comment.
@soumyakoduri this tracks endpoints that recently returned a connection error so we can retry against other endpoints. 'not expired' here means that it failed recently enough that we shouldn't try again
however, this logic is not very good at differentiating between connection errors and other http errors. it just checks for EIO, which turns out to be the default case in rgw_http_error_to_errno()
@yuvalif and i discussed this recently in #55527, and changed some pubsub ops to return ERR_SERVICE_UNAVAILABLE instead of EIO to work around that
i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite
for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?
There was a problem hiding this comment.
for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?
Thanks Casey.
For simple object transition case, I can think of requests such as target bucket creation (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1217) and fetching HEAD object (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282) which may fail. This is expected at times and hence the errors are ignored. The errors returned by the cloud endpoint for these ops may include EIO.
There may be few more such cases with multipart upload to the cloud.
There was a problem hiding this comment.
i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite
if we can't prevent these cloud transition ops from returning errors that map to EIO, removing EIO from rgw_http_error_to_errno() seems like the only viable solution. i pushed #56704 as an example. @smanjara do you think we could reasonably validate such a change with multisite testing at this point?
There was a problem hiding this comment.
@smanjara i've added that to a test batch already. i'll share the results and we can go through them together
There was a problem hiding this comment.
Thanks Casey and Shilpa

Maintain endpoints connectable status and retry the requests to them when appropriate.
Fixes: https://tracker.ceph.com/issues/62710
Signed-off-by: Jane Zhu jzhu116@bloomberg.net
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows