Skip to content

rgw: sync fairness bidding#45958

Merged
cbodley merged 8 commits intoceph:mainfrom
ofriedma:wip-ofriedma-sync-fairness-bidding
May 15, 2023
Merged

rgw: sync fairness bidding#45958
cbodley merged 8 commits intoceph:mainfrom
ofriedma:wip-ofriedma-sync-fairness-bidding

Conversation

@ofriedma
Copy link
Contributor

@ofriedma ofriedma commented Apr 19, 2022

multisite metadata sync fairness

The approach of this commit is to allow multiple RGWs to participate in the multisite metadata sync.
Before this commit only single RGW has caught the all the sync locks.
This feature is using bidding algorithm.
For each lock, RGW is randomizing a number from 0 to shard count and for each shard is picking randomally one number and giving it as the bid_amount.
each one of those vectors each RGW handles are being sent using watch notify (based on RADOS watch notify).
Each time the RGW tries to lock it will compare its bid for the lock and the bids of other rgws, if the current RGW has the highest bid it will try to acquire the lock.

Important configs:

rgw_sync_work_period - For how long the RGW will sync until it will send unlock (very important in the beggining, because in the beginning only single RGW holds the locks)
rgw_sync_lease_period - not new to this commit but affecting it, For how many seconds the RGW will request from the RADOS to keep the lock, mainly important in case of failure, so automatically the RGW will lose a lock if it's down

Thank you Eric for writing most of the code and for the algorithm

Fixes: https://tracker.ceph.com/issues/41230
Signed-off-by: J. Eric Ivancich ivancich@redhat.com
Signed-off-by: Or Friedmann ofriedma@ibm.com
Signed-off-by: Casey Bodley cbodley@redhat.com

@cbodley
Copy link
Contributor

cbodley commented Apr 20, 2022

please make sure that we preserve the original commits and their attributions (Author/Signed-off-by)

@ofriedma
Copy link
Contributor Author

because the code base has changed a lot I will need to change the commits, it was difficult to preserve it, anyway once I will make reasonable commits I will add ofcourse the signed off by of the original ones

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@djgalloway djgalloway changed the base branch from master to main May 25, 2022 20:00
@ofriedma ofriedma force-pushed the wip-ofriedma-sync-fairness-bidding branch from 5d4dd5b to cb3355e Compare August 4, 2022 22:32
@ofriedma ofriedma marked this pull request as ready for review August 4, 2022 22:32
@ofriedma ofriedma requested review from a team as code owners August 4, 2022 22:32
Copy link
Member

@ivancich ivancich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the non-co-routine portion of the code and approve it.

I believe other reviewers are better-suited to review the co-routine components.

This represents an amazing effort on Or's behalf and I'm very appreciative. Thanks, Or!!!

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to split out the cls_lock changes (at the very least, into its own commit), so they can be more easily reviewed by the rbd team

Comment on lines +157 to +158
static std::mutex bids_mtx;
static ObjectBidMap bids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm concerned about the use of global state here. shouldn't these bids be specific to one rados object? if so, this map should be stored in the object's xattr with the rest of lock_info_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't understand this as well. It appears to be specific to one object, but because there is a single map holding data for all objects a mutex is needed? And then, even if bids doesn't need to persist between OSD restarts (why -- because data that helps to ensure fairness is safe to discard?), I think we still need to consider what happens when the primary moves to a different OSD and then returns back after a while? Wouldn't bids go stale?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This...appears to be a per-process bid map and associated mutex. This approach seems entirely untenable. The osd which handles invocation of an object class is the current primary for the pg containing that object and object classes do not get notified when the primary changes since they are categorically not supposed to maintain any state outside of their rados objects.

  • What handles removing the bid object map when this specific osd ceases to be the primary for one of these objects?
  • How does this ensure that the in-memory state actually reflects the most recent updates to the object if the primary changed briefly and the new primary handled one of these calls?
  • How would this state get updated if the OSD's copy of the object is recovered?

I'll try to take a closer look at this tomorrow or Monday, but I strongly discourage you from trying to be clever with stashing state in memory. It seems like you could pretty easily use the existing object class api to actually store each object's state within its own omap entries. If performance is demonstrated to be a problem with that approach and no other simpler solution seems viable, it might be possible to add some object class owned ephemeral state to the ObjectContext, but that would be much more complicated.

Copy link
Contributor

@mattbenjamin mattbenjamin Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

This...appears to be a per-process bid map and associated mutex. This approach seems entirely untenable. The osd which handles invocation of an object class is the current primary for the pg containing that object and object classes do not get notified when the primary changes since they are categorically not supposed to maintain any state outside of their rados objects.

* What handles removing the bid object map when this specific osd ceases to be the primary for one of these objects?

* How does this ensure that the in-memory state actually reflects the most recent updates to the object if the primary changed briefly and the new primary handled one of these calls?

* How would this state get updated if the OSD's copy of the object is recovered?

I'll try to take a closer look at this tomorrow or Monday, but I strongly discourage you from trying to be clever with stashing state in memory. It seems like you could pretty easily use the existing object class api to actually store each object's state within its own omap entries. If performance is demonstrated to be a problem with that approach and no other simpler solution seems viable, it might be possible to add some object class owned ephemeral state to the ObjectContext, but that would be much more complicated.

"Completely untenable" seems rather overdone. And the suggestion to transform potentially transient/discardable/reconstructable state into costly, transactional state seems bad advice. (And apparently that would conflict with RADOS semantics some of the time.)

From what I understand, a key point is that the bid state is valuable but not critical--the state is a set of hints, basically. As I observed elsewhere, it's probably indeed intended to be per pg, and not "per process," which would suggest partitioning the map and its mutex by pg, maybe? It seems to me the currently biggest gap is in not bounding the size or lifetime of the state in the map(s), but agree that while the stored state is apparently not critical and thus, apparently ok to allow to be stale (as in your pg primary change/change back and object recovery scenarios). OTOH, it would probably be attractive if something could induce this module to drop state for a pg entirely in that event.

Matt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iow, no one is (unreasonably) trying to "be clever." Folks are trying to implement an efficient work-sharing mechanism that is, weakly, stateful. I think @cbodley's suggestion to stash some specific data in ObjectContext was made on the assumption that it would be easy, and not intrusive. I buy that that isn't the case, but statefulness of the type intended here is not overreach, it's just ordinary application of standard art.

Copy link
Member

@gregsfortytwo gregsfortytwo Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The librados interface is designed around objects and the osd is set up to maintain durable state around those objects; and to handle their consistency. Nothing in any interface is set up to handle in-memory state or for the object_class api to be stateful separate from what it writes out to objects. A "weakly stateful" Ceph interface doesn't exist.

Now, if this is actually a critical capability to develop, we can talk about that. But trying to shoehorn it in through an accident of the class implementation is indeed not tenable. There needs to be either a fundamentally different solution or a fundamentally different approach to getting it into the OSD, with a new system that has defined semantics and a proper interface for dealing with...all the things the osd does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, there is a mechanism for users of the osd to communicate between each other and send around state. It's watch-notify. Can this bid-and-lock algorithm be constructed that way, instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback!

It's watch-notify. Can this bid-and-lock algorithm be constructed that way, instead?

good point, let's try to build a design around that. multisite already uses watch-notify to coordinate bilog trimming, so that could serve as an example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick discussion with @mattbenjamin and it sounds like the crux of the issue is that you can't update a hypothetical persistent version of this state in the event that the lock operation fails. Is that a reasonable characterization of the issue?

IIRC, object class writes can return payloads now. We could get around this by modifying the cls_lock behavior to always succeed if the client sent a bid (or add a differently named method with that behavior) and instead indicate success via the returned payload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, object class writes can return payloads now.

Correct. It looks like on the client side you have to set the flag OPERATION_RETURNVEC, and then just fill "outdata" in the class (same as for a read method), and the OSD checks it's less than osd_max_write_op_reply_len and passes it back as part of the result. This looks to be used by rgw/cls_fifo_legacy.cc and rgw/rgw_notify.cc, plus examples in cls_hello.

Lock(const std::string& _n) : name(_n),
flags(0),
bid_amount(-1),
bid_duration(utime_t())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is the same as default-constructing bid_duration, so we can remove it from the constructor's member initializer list entirely


entity_inst_t origin;
int r = cls_get_request_origin(hctx, &origin);
ceph_assert(r == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please log and return these errors, instead of asserting on them. same with the cls_get_obj_info() call below. unexpected errors here should never crash the osd

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! 👍

Comment on lines +806 to +807
class RGWBiddedRadosLockCR : public RGWSimpleRadosLockCR {

RGWGetBidFunc get_bid_func;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inheritance structure doesn't make a lot of sense. BiddedRadosLock inherits SimpleRadosLock for its member variables, but overrides all of its logic. and the only difference is this RGWGetBidFunc piece

instead of creating this new Bidded subclass, maybe we just add an optional<RGWLockBid> argument to SimpleRadosLock? then ContinuousLeaseCR would pass std::nullopt, and RebiddableLeaseCR would pass the result of its bid func

Comment on lines +1493 to +1507
class RGWRebiddableLeaseCR : public RGWContinuousLeaseCR {

protected:

std::atomic<bool> releasing_lock = { false };
RGWGetBidFunc get_bid_func;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::function can be a very useful abstraction, but it comes with a cost. whenever the lambda capture is bigger than ~24 bytes, it has to allocate memory. full_bid_func = [this, bid_func, bid_duration_secs]() -> RGWLockBid { is always going to allocate because it captures another function. passing that function around will allocate again every time it's copied

in this case, we don't really need the abstraction - we're defining RGWBidManagerCR just below. we should be able to take a pointer to that and call into it directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also the fu2 (more efficient std::function alternative) by @adamemerson ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not more efficient, the main reason for pulling it in is for unique_function. (For functions that aren't copyable.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not use the std::function here at all, the solution that @cbodley should work just fine and maybe we can remove the shuffle mechanism, from my tests, it doesn't improve the locks distribution (I mean the rgw will need to shuffle only to initial the bids for one time only) over time and it makes the solution a little bit more complex.

Comment on lines +1462 to +1467
// NOTE: once full_sync is also switched to bidded lease, these two
// sets can be merged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_sync already switched, right?

Comment on lines +1946 to +1954
if(ceph::coarse_mono_clock::now() >= work_period_end)
{
ldpp_dout(sync_env->dpp, 20) << "metadata sync work period end releasing lock" << dendl;
yield rebiddable_lease_cr->release_lock();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look like we're checking this until we've already finished all the work. i feel like the best place for this check is at the very top of the work loop above:

for (log_iter = log_entries.begin(); log_iter != log_entries.end() && !done_with_period; ++log_iter) {
  if (ceph::coarse_mono_clock::now() >= work_period_end) {

if we see that we're passed the work period, we should spawning more work. but we can't drop the lock immediately - first we need to wait for any pending work to finish, then write the updated sync status. it doesn't look like metadata sync ever flushes the sync status, but here's an example from data sync: https://github.com/ceph/ceph/blob/e9d361f6/src/rgw/rgw_data_sync.cc#L4080-L4084

this approach should minimize how far we overshoot the work period. does that make sense? do you think that's consistent with your design?

Comment on lines +1553 to +1565
} else if (r == R_WORK_COMPLETE) {
ldpp_dout(sync_env->dpp, 20) << "RGWMetaSyncShardCR::" << __func__ <<
" incremental_sync work complete" << dendl;
return 0;
} else if (r == R_WORK_PARTIAL) {
ldpp_dout(sync_env->dpp, 20) << "RGWMetaSyncShardCR::" << __func__ <<
" incremental_sync work partially complete" << dendl;
continue; // loop back up and try for more work
}
Copy link
Contributor

@cbodley cbodley Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand what we're trying to do differently between these two cases? if i understand correctly, there's very little difference between return 0 and continue here

return set_cr_done() would complete the coroutine successfully, and i agree that's not what we want to do here. RGWMetaSyncShardControlCR spawned us, and it doesn't retry on success, so nothing would ever run sync on this shard again

return 0 just acts like a yield that leaves the coroutine scheduled. that would let RGWCoroutinesManager::run() go and process any other scheduled coroutines, then call us right back

what do you think should happen here? should R_WORK_COMPLETE wait for a bit before trying to grab the lock again? or should both cases retry right away?

Comment on lines +1053 to +1073
int RGWBidManagerCR::operate(const DoutPrefixProvider *dpp)
{
if (aborted) {
return set_cr_done();
}
reenter(this) {
while (!going_down) {
yield wait(utime_t(recalc_interval_secs, 0));
bids.shuffle();

}
return set_state(RGWCoroutine_Done);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bid manager coroutine is simple and straightforward, but it complicates RGWRemoteMetaLog::run_sync() with an extra background coroutine. that's passing a raw RGWBidManagerCR pointer into RGWMetaSyncCR, but i don't think we can guarantee that the former coroutine will outlive the latter. to make that safe, RGWMetaSyncCR would need to hold a reference with intrusive_ptr<RGWBidManagerCR>, as well as the intrusive_ptr<RGWCoroutinesStack> it was spawned on

instead of this coroutine, what if the shuffle was just part of the 'bid function'?

class BidFunc {
  ceph::timespan interval;
  ceph::timespan bid_duration;
  ceph::coarse_mono_time last_shuffle;
  rados::cls::lock::BidSet<int32_t> bids;
 public:
  BidFunc(...);

  RGWLockBid operator()(int shard_id, ceph::coarse_mono_time now) {
    if (last_shuffle + interval < now) {
      bids.shuffle();
      last_shuffle = now;
    }
    return {bids.get_bid(shard_id), bid_duration};
  }
};

this could just live on the stack in RGWRemoteMetaLog::run_sync(), so wouldn't need any special memory management

Comment on lines +157 to +160
template<typename T>
class BidSet {

static_assert(std::is_integral<T>::value, "Integral required.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need a template here if the functions all require int32_t bid_amount?

Comment on lines +179 to +186
void shuffle() {
std::uniform_int_distribution<T> dist(0, bids.size() - 1);
std::random_device urandom("/dev/urandom");
T count = bids.size();
for (T i = count - 1; i >= 1; --i) {
T swap = dist(urandom);
if (i != swap) {
std::swap(bids[i], bids[swap]);
Copy link
Contributor

@cbodley cbodley Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's a nice 3-line example of a random shuffle from rgw_lc.cc:

  std::random_device rd;
  std::default_random_engine rng{rd()};
  std::shuffle(v.begin(), v.end(), rng);

we can avoid reopening the random device on every call by doing it once in the constructor to seed a std::default_random_engine member variable. don't forget to #include <random> and <algorithm>

Comment on lines +987 to +997
// polling loop; as long as lock is busy wait and poll again
do {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RGWContinuousLeaseCR doesn't have a polling loop like this, it just fails and returns EBUSY up through RGWMetaSyncShardCR, and RGWMetaSyncShardControlCR keeps retrying with backoff. it shouldn't be necessary to have a retry loop here in RGWRebiddableLeaseCR

otherwise, the rest of this coroutine looks essentially the same as RGWContinuousLeaseCR except that it calls RGWBiddedRadosLockCR instead of RGWSimpleRadosLockCR

in another comment, i suggested combining RGWBiddedRadosLockCR into RGWSimpleRadosLockCR by using an optional<RGWLockBid> argument. could we do the same with RGWRebiddableLeaseCR and RGWContinuousLeaseCR, except with an optional<BidFunc>?

Comment on lines +1665 to +1670
while (!rebiddable_lease_cr->is_locked()) {
if (rebiddable_lease_cr->is_done()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't full_sync() already wait on this lock acquisition above on line 1637?

maybe this part inside the do-while loop should just check if (!rebiddable_lease_cr->is_locked()) { for an early exit like you do below in incremental_sync()?

@yuvalif yuvalif self-requested a review August 10, 2022 16:31
@smanjara smanjara self-requested a review August 10, 2022 17:05
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@mattbenjamin mattbenjamin mentioned this pull request Sep 26, 2022
3 tasks
@trociny
Copy link
Contributor

trociny commented Sep 26, 2022

The tracker ticket https://tracker.ceph.com/issues/41230 is still pointing to the closed #28119. Should the ticket be updated to point to this PR?

@ofriedma ofriedma force-pushed the wip-ofriedma-sync-fairness-bidding branch from cb3355e to 5f71198 Compare October 24, 2022 13:18
@cbodley
Copy link
Contributor

cbodley commented Apr 20, 2023

@cbodley In my view, this commit try to introduce that multiple rgw sync meta with "loop" mode, that says every RGW have a chance to do sync meta, but only one RGW works at the same time. Is that correct?

@caisan multisite sync has always used this polling strategy with cls_lock on log shards. the problem is that the first radosgw to start up usually gets all of the shard locks. the goal of this PR is to spread those locks evenly across all participating radosgws

@smanjara smanjara force-pushed the wip-ofriedma-sync-fairness-bidding branch from 3c3b4d5 to 917b392 Compare April 20, 2023 17:14
@caisan
Copy link

caisan commented Apr 21, 2023

@caisan multisite sync has always used this polling strategy with cls_lock on log shards. the problem is that the first radosgw to start up usually gets all of the shard locks. the goal of this PR is to spread those locks evenly across all participating radosgws

for example, lease lock will get all shards locks if without bid? @cbodley

can_adjust_marker = true;
/* grab lock */

if (!sync_env->bid_manager->is_highest_bidder(shard_id)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May is_highest_bidder(shard_id) may be invoked before spawn RGWMetaSyncShardControlCR ? i mean if rgw could not get the highest bidder, it's not necessary spawn RGWMetaSyncShardControlCR. @smanjara @cbodley

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RGWMetaSyncShardControlCR is where the retry loop lives. we want to run even if we aren't the highest bidder yet

@mkogan1
Copy link
Contributor

mkogan1 commented Apr 24, 2023

metadata sync performance test results comparing before and after the commits in the PR,
MS env with 3x3 dedicated to synching RGWs
workload is s3 put-bucket-tagging on 1000 buckets in the primary MS zone, and measure how long until the secondary MS zone reaches metadata is caught up with master
(top in the screenshots below):

time echo -n {1..999} | xargs -P$(( $(numactl -N 0 -- nproc) / 1 )) -L1 -d' ' -i sh -c 'echo $1 ; set -x ; nice aws --endpoint-url [http://0:8000](about:blank) s3api put-bucket-tagging  --bucket  hsbench-1-$(printf "%012d" $1) --tagging "TagSet=[{Key=organization,Value=marketing}]" ; set +x' sh {} ; date

on the bottom left is metadata sync status of the secondary:
while sleep 1; do date ; nice sudo ./bin/radosgw-admin sync status | grep metadata ; done
and on the bottom right is lock distribution of the secondary:
while sleep 1; do date ; pwsh -c 'sudo ./bin/rados ls -p us-west.rgw.log | grep mdlog | foreach -Parallel { sudo ./bin/rados lock info $_ -p us-west.rgw.log sync_lock } | convertFrom-json | foreach {$_.lockers.name} | group | select Count, Name' ; done

with sync-fairness caught-up after 5 sec:
image

without sync-fairness caught-up after 31 sec
image

@smanjara
Copy link
Contributor

metadata sync performance test results comparing before and after the commits in the PR, MS env with 3x3 dedicated to synching RGWs workload is s3 put-bucket-tagging on 1000 buckets in the primary MS zone, and measure how long until the secondary MS zone reaches metadata is caught up with master (top in the screenshots below):

time echo -n {1..999} | xargs -P$(( $(numactl -N 0 -- nproc) / 1 )) -L1 -d' ' -i sh -c 'echo $1 ; set -x ; nice aws --endpoint-url [http://0:8000](about:blank) s3api put-bucket-tagging  --bucket  hsbench-1-$(printf "%012d" $1) --tagging "TagSet=[{Key=organization,Value=marketing}]" ; set +x' sh {} ; date

on the bottom left is metadata sync status of the secondary: while sleep 1; do date ; nice sudo ./bin/radosgw-admin sync status | grep metadata ; done and on the bottom right is lock distribution of the secondary: while sleep 1; do date ; pwsh -c 'sudo ./bin/rados ls -p us-west.rgw.log | grep mdlog | foreach -Parallel { sudo ./bin/rados lock info $_ -p us-west.rgw.log sync_lock } | convertFrom-json | foreach {$_.lockers.name} | group | select Count, Name' ; done

with sync-fairness caught-up after 5 sec: image

without sync-fairness caught-up after 31 sec image

thanks @mkogan1

@smanjara
Copy link
Contributor

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few leftover whitespace changes, but otherwise looks good. great work @ofriedma @smanjara

@ofriedma
Copy link
Contributor Author

there are a few leftover whitespace changes, but otherwise looks good. great work @ofriedma @smanjara

Thank you @cbodley @smanjara

@caisan
Copy link

caisan commented May 10, 2023

@ofriedma @cbodley hi , i just wonder if this PR support the rgw instance scale out? i mean every newly-joined rgw sync instance can spread metadata sync workload. This would be significate for improving the sync performence.

cbodley and others added 8 commits May 10, 2023 11:24
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
    multisite metadata sync fairness

    The approach of this commit is to allow multiple RGWs to participate in the multisite metadata sync.
    Before this commit only single RGW has caught the all the sync locks.
    This feature is using bidding algorithm.
    For each lock, RGW is randomizing a number from 0 to shard count and for each shard is picking randomally one number and giving it as the bid_amount.
    each one of those vectors each RGW handles are being sent using watch notify (based on RADOS watch notify).
    Each time the RGW tries to lock it will compare its bid for the lock and the bids of other rgws, if the current RGW has the highest bid it will try to acquire the lock.

    Important configs:

    rgw_sync_work_period - For how long the RGW will sync until it will send unlock (very important in the beggining, because in the beginning only single RGW holds the locks)
    rgw_sync_lease_period - not new to this commit but affecting it, For how many seconds the RGW will request from the RADOS to keep the lock, mainly important in case of failure, so automatically the RGW will lose a lock if it's down

Fixes: https://tracker.ceph.com/issues/41230
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Signed-off-by: Or Friedmann <ofriedma@ibm.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
If we are outbid, exit after updating sync status.

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
…) control flow

- check for is_highest_bidder() before even attempting to take the lock
- don't block on RGWMetaSyncShardNotifyCR()
- other minor fixes

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
@smanjara smanjara force-pushed the wip-ofriedma-sync-fairness-bidding branch from 917b392 to 410246f Compare May 10, 2023 15:26
@ljflores ljflores removed the needs-qa label May 10, 2023
@ljflores
Copy link
Member

It looks like there are still changes going on with this PR. If it's ready for testing please re-add "needs-qa".

@cbodley cbodley removed the core label May 10, 2023
@smanjara
Copy link
Contributor

Some known failures including the crash seen in https://tracker.ceph.com/issues/57905 unrelated to this PR.

https://pulpito.ceph.com/cbodley-2023-05-10_20:34:35-rgw-wip-shilpa-test-sync-fairness-distro-default-smithi/

@cbodley I'll go ahead and merge if you have no objection.

@cbodley cbodley merged commit 8c26848 into ceph:main May 15, 2023
@cbodley
Copy link
Contributor

cbodley commented May 15, 2023

thanks to @mattbenjamin @idryomov @athanatos @gregsfortytwo for all the design review and feedback

smanjara pushed a commit to smanjara/ceph that referenced this pull request Sep 7, 2023
…s-bidding

rgw: sync fairness bidding

Resolves rhbz#1740782
Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 8c26848)
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Jan 18, 2024
…s-bidding

rgw: sync fairness bidding

Resolves rhbz#1740782
Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 8c26848)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.