Skip to content

rgw multisite: teach 'bucket sync checkpoint' about log generations#41455

Merged
cbodley merged 3 commits intoceph:wip-rgw-multisite-reshardfrom
cbodley:wip-rgw-multisite-reshard-gen-checkpoint
May 25, 2021
Merged

rgw multisite: teach 'bucket sync checkpoint' about log generations#41455
cbodley merged 3 commits intoceph:wip-rgw-multisite-reshardfrom
cbodley:wip-rgw-multisite-reshard-gen-checkpoint

Conversation

@cbodley
Copy link
Contributor

@cbodley cbodley commented May 20, 2021

radosgw-admin bucket sync checkpoint is used by multisite tests to wait for a bucket sync to catch up with a source zone. sync on a resharded bucket is not caught up until the sync status' generation matches the source zone's latest_gen

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox


static void get_bucket_instance_ids(const RGWBucketInfo& bucket_info,
int shard_id,
int num_shards, int shard_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you pass "num_shards" if this could be fetched from "bucket_info"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function was hard-coded to use num_shards from the current_index layout, but this function is called by open_bucket_index() which takes an arbitrary rgw::bucket_index_layout_generation so we need to use the shards from that index layout instead of current_index

this is relevent now because we can request RGWOp_BILog_Info for any log generation; we convert that to an index layout with log_to_index_layout() and pass that to RGWRados::get_bucket_stats() which calls into open_bucket_index()

without this change, we hit an assert in RGWRados::get_bucket_stats() because open_bucket_index() returns a different number of bucket_objs (based on the given index's num_shards) vs bucket_instance_ids (based on current_index num_shards)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I should have changed this as part of the bilog pr.

ldpp_dout(dpp, -1) << "failed to fetch remote log markers: " << cpp_strerror(r) << dendl;
return r;
}
r = markers.from_string(result.max_marker, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this change related to the generation fix?
you call "from_string" on the result of the function in 2 places. isn't it batter to keep it inside "rgw_read_remote_bilog_info()" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the callers of rgw_read_remote_bilog_info() now need more than just info.max_marker, so i changed it to return rgw_bucket_index_marker_info directly instead of BucketIndexShardsManager. but i can change it to return both so the callers don't have to duplicate the marker parsing

cbodley added 2 commits May 24, 2021 12:16
Signed-off-by: Casey Bodley <cbodley@redhat.com>
knock out a TODO that was causing this assertion failure in
RGWRados::get_bucket_stats() after a reshard:

  ceph_assert(headers.size() == bucket_instance_ids.size());

Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley cbodley force-pushed the wip-rgw-multisite-reshard-gen-checkpoint branch from e40b86a to 079c4de Compare May 24, 2021 16:21
@cbodley
Copy link
Contributor Author

cbodley commented May 24, 2021

thanks @yuvalif, updated

poll on rgw_read_bucket_full_sync_status() until
full_status.incremental_gen catches up to the latest_gen we got from
rgw_read_remote_bilog_info()

Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley cbodley force-pushed the wip-rgw-multisite-reshard-gen-checkpoint branch from 079c4de to c253bf2 Compare May 24, 2021 18:01
Comment on lines +133 to +137
if (full_status.incremental_gen > latest_gen) {
ldpp_dout(dpp, 1) << "bucket sync caught up with source:\n"
<< " local gen: " << full_status.incremental_gen << '\n'
<< " remote gen: " << latest_gen << dendl;
return 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this last bit in case we overshoot the remote's starting generation, so we don't go on and try to compare markers from different generations

@cbodley cbodley merged commit c253bf2 into ceph:wip-rgw-multisite-reshard May 25, 2021
@cbodley cbodley deleted the wip-rgw-multisite-reshard-gen-checkpoint branch May 25, 2021 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants