rgw multisite: data sync optimizations by cbodley · Pull Request #34094 · ceph/ceph

cbodley · 2020-03-20T18:49:15Z

This contains a series of related data sync optimizations which allow sites with a large backlog of stale datalog entries to recover much faster.

Data sync obligations (whether from datalog entries, error repo retries, or async notifications) were previously unbounded, meaning that bucket sync had to catch up completely before they could be retired. Each obligation now carries a timestamp, allowing them to be retired (or ignored) once bucket sync status reaches that timestamp. The new cls_cmpomap module from #33982 allows the error repo to store these timestamps in omap values, and update/remove entries with atomic timestamp comparisons.

DataSyncShardCR now contains a cache of its associated bucket shards. This cache remembers the latest timestamp from the bucket-shard's last sync, and uses that in DataSyncSingleEntryCR to quickly filter out obligations that we've already satisfied. This cache also detects racing calls to DataSyncSingleEntryCR and avoids duplicating bucket sync on the same entry. The cache is invalidated when the data sync lease expires.

Bucket sync leases are removed now that we don't duplicate calls to bucket sync within the same radosgw instance. Bucket sync is already covered by the lease of the data sync shard that spawned it, so different radosgws are similarly protected. The bucket sync status objects now use cls_version to detect racing writes in the event that a data sync shard lease expires (or the admin runs bucket sync or bucket sync init).

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

cbodley · 2020-03-26T14:43:23Z

rebased

cbodley · 2020-03-26T15:19:04Z

planning to add some simple unit test cases for the Cache class

yehudasa · 2020-03-26T16:03:01Z

@cbodley I started reviewing but rebase seems to have clobbered it. I'll need to re-start I think.

cbodley · 2020-03-26T16:07:49Z

@cbodley I started reviewing but rebase seems to have clobbered it. I'll need to re-start I think.

apologies! i won't force-push again until reviews are done

yehudasa

@cbodley see my comments

src/rgw/rgw_cr_rados.cc

src/cls/cmpomap/ops.h

src/cls/cmpomap/server.cc

src/include/buffer.h

src/rgw/rgw_sync_error_repo.cc

yehudasa · 2020-03-27T12:40:36Z

src/rgw/rgw_data_sync.cc

      }
    }
+    if (progress) {
+      *progress = *std::min_element(shard_progress.begin(), shard_progress.end());


@cbodley: not sure if I fully understand this. What if a shard had no changes therefore it's completely up to date? What will be its timestamp?

for each shard, we'll either get the timestamp we read from its sync status, or the timestamp we last wrote to its sync status. the sync status will either store a) the last timestamp incremental sync wrote, b) the timestamp when full sync started, or c) an empty timestamp if those events happened before upgrade

if we get empty timestamps, we lose the state->progress_timestamp filtering optimization in DataSyncSingleEntry, and we'll retry RunBucketSourcesSync each time state->counter changes - this is effectively the same behavior we got from marker_tracker->need_retry() before

but eventually these timestamps will fill in. and in the general case, we should be seeing changes to these bucket shards, because that's what the datalog entry implies

it's also worth noting how this timestamp strategy ties into the planned 'sync fairness' work, where we'll be expiring our mdlog and datalog leases somewhat regularly so other gateways have a chance to take over

when a datalog lease is pending this expiration, we'll stop any associated bucket sync. by tracking the timestamps of their progress, we can potentially retire the datalog/error-repo entries that spawned them. that way the next gateway that gets the lease won't have to retry it

src/rgw/rgw_bucket_sync_cache.h

src/rgw/rgw_data_sync.cc

cbodley · 2020-03-30T21:38:07Z

addressed review comments, squashed, and rebased on top of the changes in #33982

cbodley · 2020-04-01T20:44:25Z

added some test fixes for ceph_test_cls_cmpomap and got a completely green run in http://pulpito.ceph.com/cbodley-2020-04-01_19:06:18-rgw-wip-cbodley-testing-distro-basic-smithi/

coroutines that want to sleep should just call RGWCoroutine::wait() Signed-off-by: Casey Bodley <cbodley@redhat.com>

async notifications are just hints, and don't imply an obligation to sync the bucket shard. if we fail to sync, don't write it to the error repo for retry. we'll see the change later when processing the datalog Signed-off-by: Casey Bodley <cbodley@redhat.com>