Skip to content

rgw: Redimension bucket sync cache to include optional generation#45122

Closed
adamemerson wants to merge 1 commit intoceph:wip-rgw-multisite-reshardfrom
adamemerson:wip-multisite-reshard-cache-generation
Closed

rgw: Redimension bucket sync cache to include optional generation#45122
adamemerson wants to merge 1 commit intoceph:wip-rgw-multisite-reshardfrom
adamemerson:wip-multisite-reshard-cache-generation

Conversation

@adamemerson
Copy link
Contributor

@adamemerson adamemerson commented Feb 22, 2022

The alternative would be to compare generations and throw out older
generation/no generation if we have a (newer) one.

But if we have the potential for older generations and blank
generations coming up on error retry, then we have to keep them
around.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

The alternative would be to compare generations and throw out older
generation/no generation if we have a (newer) one.

But if we have the potential for older generations and blank
generations coming up on error retry, then we have to keep them
around.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
@adamemerson adamemerson requested a review from cbodley February 22, 2022 19:57
@github-actions github-actions bot added the rgw label Feb 22, 2022
@cbodley
Copy link
Contributor

cbodley commented Mar 16, 2022

i took this commit into #45417 with an additional change, can you please review?

@mattbenjamin
Copy link
Contributor

Where do blank generations come from? In discussion w/Yuval, the topic of generation "reset" also came up--is that actually needed?

@cbodley
Copy link
Contributor

cbodley commented Mar 16, 2022

Where do blank generations come from?

empty generations come from 'data full sync', which walks through a listing of remote bucket metadata and tries to run sync on each. it doesn't get any information about log generations from that list of bucket metadata keys, so it passed an 'empty' generation number into bucket sync

in RGWSyncBucketCR (https://github.com/ceph/ceph/blob/wip-rgw-multisite-reshard/src/rgw/rgw_data_sync.cc#L5368-L5383) we treat an empty generation as a request for the 'current' generation of that bucket's sync status - just because that's the only generation that 'bucket incremental sync' can ever make progress on

In discussion w/Yuval the topic of generation "reset" also came up--is that actually needed?

not sure what this means

@cbodley
Copy link
Contributor

cbodley commented Mar 16, 2022

In discussion w/Yuval, the topic of generation "reset" also came up--is that actually needed?

@yuvalif was that related to bucket sync disable/enable? do you suspect there's a problem with empty gen vs. gen=0 there?

@yuvalif
Copy link
Contributor

yuvalif commented Mar 16, 2022

In discussion w/Yuval, the topic of generation "reset" also came up--is that actually needed?

@yuvalif was that related to bucket sync disable/enable? do you suspect there's a problem with empty gen vs. gen=0 there?

was wondering about the case where we have incremental sync with some gen going on, and then we disable+enable sync.
this results in a full sync with no generation. if we set the gen number of the full sync to zero, how would we know how to compare these entries with the existing incremental sync enties?

@cbodley
Copy link
Contributor

cbodley commented Mar 16, 2022

was wondering about the case where we have incremental sync with some gen going on, and then we disable+enable sync. this results in a full sync with no generation

let's be careful to differentiate between 'bucket full sync' and 'data full sync'. bucket sync disable/enable is cycling through the Incremental -> Init -> Full -> Incremental states of bucket sync, not data sync

remember this logic in the Init state that fetches the remote's latest log generation and markers, and records them as the position where Incremental should resume after Full. so bucket sync uses this sync status object to keep track of which generation it's on

@cbodley
Copy link
Contributor

cbodley commented Mar 17, 2022

@adamemerson unfortunately my approach in #45417 (comment) isn't going to work, so this PR may be the best we can do

my main concern with this one is that it allows us to spawn two concurrent syncs on the same thing (one with a real gen, and one with empty), now that they're tracked by separate cache entries

when the 'bucket sync cache' was added, it allowed us to detect racing calls to RGWDataSyncSingleEntryCR and avoid spawning sync on both. because of this, we were able to able remove the use of per-bucket-shard cls_locks from bucket sync

however, along with that change to remove locking, i did at least make sure that all the sync status writes were using cls_version, so i believe that racing bucket syncs should be okay in general. they may duplicate some work, but the first time they try to record their progress in the sync status object, one of them will fail with ECANCELED

so let's take this fix, and try to exercise these races to make sure this untested error path does what we expect?

cc @yuvalif @mattbenjamin

@cbodley
Copy link
Contributor

cbodley commented Mar 21, 2022

merged into wip-rgw-multisite-reshard, thanks @adamemerson

@cbodley cbodley closed this Mar 21, 2022
@adamemerson adamemerson deleted the wip-multisite-reshard-cache-generation branch September 15, 2022 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants