rgw: sync fairness bidding by ofriedma · Pull Request #45958 · ceph/ceph

ofriedma · 2022-04-19T12:04:03Z

multisite metadata sync fairness

The approach of this commit is to allow multiple RGWs to participate in the multisite metadata sync.
Before this commit only single RGW has caught the all the sync locks.
This feature is using bidding algorithm.
For each lock, RGW is randomizing a number from 0 to shard count and for each shard is picking randomally one number and giving it as the bid_amount.
each one of those vectors each RGW handles are being sent using watch notify (based on RADOS watch notify).
Each time the RGW tries to lock it will compare its bid for the lock and the bids of other rgws, if the current RGW has the highest bid it will try to acquire the lock.

Important configs:

rgw_sync_work_period - For how long the RGW will sync until it will send unlock (very important in the beggining, because in the beginning only single RGW holds the locks)
rgw_sync_lease_period - not new to this commit but affecting it, For how many seconds the RGW will request from the RADOS to keep the lock, mainly important in case of failure, so automatically the RGW will lose a lock if it's down

Thank you Eric for writing most of the code and for the algorithm

Fixes: https://tracker.ceph.com/issues/41230
Signed-off-by: J. Eric Ivancich ivancich@redhat.com
Signed-off-by: Or Friedmann ofriedma@ibm.com
Signed-off-by: Casey Bodley cbodley@redhat.com

cbodley · 2022-04-20T16:21:47Z

please make sure that we preserve the original commits and their attributions (Author/Signed-off-by)

ofriedma · 2022-04-20T19:19:03Z

because the code base has changed a lot I will need to change the commits, it was difficult to preserve it, anyway once I will make reasonable commits I will add ofcourse the signed off by of the original ones

github-actions · 2022-04-26T16:40:36Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

ivancich

I reviewed the non-co-routine portion of the code and approve it.

I believe other reviewers are better-suited to review the co-routine components.

This represents an amazing effort on Or's behalf and I'm very appreciative. Thanks, Or!!!

cbodley

it might be better to split out the cls_lock changes (at the very least, into its own commit), so they can be more easily reviewed by the rbd team

cbodley · 2022-08-05T16:26:55Z

src/cls/lock/cls_lock.cc

+  static std::mutex bids_mtx;
+  static ObjectBidMap bids;


i'm concerned about the use of global state here. shouldn't these bids be specific to one rados object? if so, this map should be stored in the object's xattr with the rest of lock_info_t

Yeah, I don't understand this as well. It appears to be specific to one object, but because there is a single map holding data for all objects a mutex is needed? And then, even if bids doesn't need to persist between OSD restarts (why -- because data that helps to ensure fairness is safe to discard?), I think we still need to consider what happens when the primary moves to a different OSD and then returns back after a while? Wouldn't bids go stale?

This...appears to be a per-process bid map and associated mutex. This approach seems entirely untenable. The osd which handles invocation of an object class is the current primary for the pg containing that object and object classes do not get notified when the primary changes since they are categorically not supposed to maintain any state outside of their rados objects.

What handles removing the bid object map when this specific osd ceases to be the primary for one of these objects?

How does this ensure that the in-memory state actually reflects the most recent updates to the object if the primary changed briefly and the new primary handled one of these calls?

How would this state get updated if the OSD's copy of the object is recovered?

I'll try to take a closer look at this tomorrow or Monday, but I strongly discourage you from trying to be clever with stashing state in memory. It seems like you could pretty easily use the existing object class api to actually store each object's state within its own omap entries. If performance is demonstrated to be a problem with that approach and no other simpler solution seems viable, it might be possible to add some object class owned ephemeral state to the ObjectContext, but that would be much more complicated.

inline

This...appears to be a per-process bid map and associated mutex. This approach seems entirely untenable. The osd which handles invocation of an object class is the current primary for the pg containing that object and object classes do not get notified when the primary changes since they are categorically not supposed to maintain any state outside of their rados objects.

* What handles removing the bid object map when this specific osd ceases to be the primary for one of these objects? * How does this ensure that the in-memory state actually reflects the most recent updates to the object if the primary changed briefly and the new primary handled one of these calls? * How would this state get updated if the OSD's copy of the object is recovered?

I'll try to take a closer look at this tomorrow or Monday, but I strongly discourage you from trying to be clever with stashing state in memory. It seems like you could pretty easily use the existing object class api to actually store each object's state within its own omap entries. If performance is demonstrated to be a problem with that approach and no other simpler solution seems viable, it might be possible to add some object class owned ephemeral state to the ObjectContext, but that would be much more complicated.

"Completely untenable" seems rather overdone. And the suggestion to transform potentially transient/discardable/reconstructable state into costly, transactional state seems bad advice. (And apparently that would conflict with RADOS semantics some of the time.)

From what I understand, a key point is that the bid state is valuable but not critical--the state is a set of hints, basically. As I observed elsewhere, it's probably indeed intended to be per pg, and not "per process," which would suggest partitioning the map and its mutex by pg, maybe? It seems to me the currently biggest gap is in not bounding the size or lifetime of the state in the map(s), but agree that while the stored state is apparently not critical and thus, apparently ok to allow to be stale (as in your pg primary change/change back and object recovery scenarios). OTOH, it would probably be attractive if something could induce this module to drop state for a pg entirely in that event.

Matt

Iow, no one is (unreasonably) trying to "be clever." Folks are trying to implement an efficient work-sharing mechanism that is, weakly, stateful. I think @cbodley's suggestion to stash some specific data in ObjectContext was made on the assumption that it would be easy, and not intrusive. I buy that that isn't the case, but statefulness of the type intended here is not overreach, it's just ordinary application of standard art.

The librados interface is designed around objects and the osd is set up to maintain durable state around those objects; and to handle their consistency. Nothing in any interface is set up to handle in-memory state or for the object_class api to be stateful separate from what it writes out to objects. A "weakly stateful" Ceph interface doesn't exist.

Now, if this is actually a critical capability to develop, we can talk about that. But trying to shoehorn it in through an accident of the class implementation is indeed not tenable. There needs to be either a fundamentally different solution or a fundamentally different approach to getting it into the OSD, with a new system that has defined semantics and a proper interface for dealing with...all the things the osd does.

That said, there is a mechanism for users of the osd to communicate between each other and send around state. It's watch-notify. Can this bid-and-lock algorithm be constructed that way, instead?

thanks for the feedback!

It's watch-notify. Can this bid-and-lock algorithm be constructed that way, instead?

good point, let's try to build a design around that. multisite already uses watch-notify to coordinate bilog trimming, so that could serve as an example

I had a quick discussion with @mattbenjamin and it sounds like the crux of the issue is that you can't update a hypothetical persistent version of this state in the event that the lock operation fails. Is that a reasonable characterization of the issue?

IIRC, object class writes can return payloads now. We could get around this by modifying the cls_lock behavior to always succeed if the client sent a bid (or add a differently named method with that behavior) and instead indicate success via the returned payload.

IIRC, object class writes can return payloads now.

Correct. It looks like on the client side you have to set the flag OPERATION_RETURNVEC, and then just fill "outdata" in the class (same as for a read method), and the OSD checks it's less than osd_max_write_op_reply_len and passes it back as part of the result. This looks to be used by rgw/cls_fifo_legacy.cc and rgw/rgw_notify.cc, plus examples in cls_hello.

cbodley · 2022-08-05T16:30:50Z

src/cls/lock/cls_lock_client.h

+	Lock(const std::string& _n) : name(_n),
+                                  flags(0),
+                                  bid_amount(-1),
+                                  bid_duration(utime_t())


nit: this is the same as default-constructing bid_duration, so we can remove it from the constructor's member initializer list entirely

cbodley · 2022-08-05T16:33:59Z

src/cls/lock/cls_lock.cc

+
+  entity_inst_t origin;
+  int r = cls_get_request_origin(hctx, &origin);
+  ceph_assert(r == 0);


please log and return these errors, instead of asserting on them. same with the cls_get_obj_info() call below. unexpected errors here should never crash the osd

cbodley

nice work! 👍

cbodley · 2022-08-05T20:24:06Z

src/rgw/rgw_cr_rados.h

+class RGWBiddedRadosLockCR : public RGWSimpleRadosLockCR {
+
+  RGWGetBidFunc get_bid_func;


this inheritance structure doesn't make a lot of sense. BiddedRadosLock inherits SimpleRadosLock for its member variables, but overrides all of its logic. and the only difference is this RGWGetBidFunc piece

instead of creating this new Bidded subclass, maybe we just add an optional<RGWLockBid> argument to SimpleRadosLock? then ContinuousLeaseCR would pass std::nullopt, and RebiddableLeaseCR would pass the result of its bid func

cbodley · 2022-08-05T20:37:16Z

src/rgw/rgw_cr_rados.h

+class RGWRebiddableLeaseCR : public RGWContinuousLeaseCR {
+
+protected:
+
+  std::atomic<bool> releasing_lock = { false };
+  RGWGetBidFunc get_bid_func;


std::function can be a very useful abstraction, but it comes with a cost. whenever the lambda capture is bigger than ~24 bytes, it has to allocate memory. full_bid_func = [this, bid_func, bid_duration_secs]() -> RGWLockBid { is always going to allocate because it captures another function. passing that function around will allocate again every time it's copied

in this case, we don't really need the abstraction - we're defining RGWBidManagerCR just below. we should be able to take a pointer to that and call into it directly

there's also the fu2 (more efficient std::function alternative) by @adamemerson ?

It's not more efficient, the main reason for pulling it in is for unique_function. (For functions that aren't copyable.)

We will not use the std::function here at all, the solution that @cbodley should work just fine and maybe we can remove the shuffle mechanism, from my tests, it doesn't improve the locks distribution (I mean the rgw will need to shuffle only to initial the bids for one time only) over time and it makes the solution a little bit more complex.

cbodley · 2022-08-05T20:53:35Z

src/rgw/rgw_sync.cc

+  // NOTE: once full_sync is also switched to bidded lease, these two
+  // sets can be merged


full_sync already switched, right?

cbodley · 2022-08-05T20:56:32Z

src/rgw/rgw_sync.cc

+  if(ceph::coarse_mono_clock::now() >= work_period_end)
+  {
+    ldpp_dout(sync_env->dpp, 20) << "metadata sync work period end releasing lock" << dendl;
+    yield rebiddable_lease_cr->release_lock();
+  }


it doesn't look like we're checking this until we've already finished all the work. i feel like the best place for this check is at the very top of the work loop above:

for (log_iter = log_entries.begin(); log_iter != log_entries.end() && !done_with_period; ++log_iter) { if (ceph::coarse_mono_clock::now() >= work_period_end) {

if we see that we're passed the work period, we should spawning more work. but we can't drop the lock immediately - first we need to wait for any pending work to finish, then write the updated sync status. it doesn't look like metadata sync ever flushes the sync status, but here's an example from data sync: https://github.com/ceph/ceph/blob/e9d361f6/src/rgw/rgw_data_sync.cc#L4080-L4084

this approach should minimize how far we overshoot the work period. does that make sense? do you think that's consistent with your design?

cbodley · 2022-08-05T21:37:31Z

src/rgw/rgw_sync.cc

+        } else if (r == R_WORK_COMPLETE) {
+	  ldpp_dout(sync_env->dpp, 20) << "RGWMetaSyncShardCR::" << __func__ <<
+	    " incremental_sync work complete" << dendl;
+	  return 0;
+	} else if (r == R_WORK_PARTIAL) {
+	  ldpp_dout(sync_env->dpp, 20) << "RGWMetaSyncShardCR::" << __func__ <<
+	    " incremental_sync work partially complete" << dendl;
+	  continue; // loop back up and try for more work
+	}


can you help me understand what we're trying to do differently between these two cases? if i understand correctly, there's very little difference between return 0 and continue here

return set_cr_done() would complete the coroutine successfully, and i agree that's not what we want to do here. RGWMetaSyncShardControlCR spawned us, and it doesn't retry on success, so nothing would ever run sync on this shard again

return 0 just acts like a yield that leaves the coroutine scheduled. that would let RGWCoroutinesManager::run() go and process any other scheduled coroutines, then call us right back

what do you think should happen here? should R_WORK_COMPLETE wait for a bit before trying to grab the lock again? or should both cases retry right away?

cbodley · 2022-08-06T03:59:59Z

src/rgw/rgw_cr_rados.cc

+int RGWBidManagerCR::operate(const DoutPrefixProvider *dpp)
+{
+  if (aborted) {
+    return set_cr_done();
+  }
+  reenter(this) {
+    while (!going_down) {
+      yield wait(utime_t(recalc_interval_secs, 0));
+      bids.shuffle();
+
+    }
+    return set_state(RGWCoroutine_Done);


this bid manager coroutine is simple and straightforward, but it complicates RGWRemoteMetaLog::run_sync() with an extra background coroutine. that's passing a raw RGWBidManagerCR pointer into RGWMetaSyncCR, but i don't think we can guarantee that the former coroutine will outlive the latter. to make that safe, RGWMetaSyncCR would need to hold a reference with intrusive_ptr<RGWBidManagerCR>, as well as the intrusive_ptr<RGWCoroutinesStack> it was spawned on

instead of this coroutine, what if the shuffle was just part of the 'bid function'?

class BidFunc { ceph::timespan interval; ceph::timespan bid_duration; ceph::coarse_mono_time last_shuffle; rados::cls::lock::BidSet<int32_t> bids; public: BidFunc(...); RGWLockBid operator()(int shard_id, ceph::coarse_mono_time now) { if (last_shuffle + interval < now) { bids.shuffle(); last_shuffle = now; } return {bids.get_bid(shard_id), bid_duration}; } };

this could just live on the stack in RGWRemoteMetaLog::run_sync(), so wouldn't need any special memory management

cbodley · 2022-08-06T04:22:33Z

src/cls/lock/cls_lock_client.h

+      template<typename T>
+      class BidSet {
+
+		static_assert(std::is_integral<T>::value, "Integral required.");


do we really need a template here if the functions all require int32_t bid_amount?

cbodley · 2022-08-06T04:31:55Z

src/cls/lock/cls_lock_client.h

+			void shuffle() {
+			  std::uniform_int_distribution<T> dist(0, bids.size() - 1);
+			  std::random_device urandom("/dev/urandom");
+			  T count = bids.size();
+			  for (T i = count - 1; i >= 1; --i) {
+			    T swap = dist(urandom);
+			    if (i != swap) {
+			      std::swap(bids[i], bids[swap]);


here's a nice 3-line example of a random shuffle from rgw_lc.cc:

std::random_device rd; std::default_random_engine rng{rd()}; std::shuffle(v.begin(), v.end(), rng);

we can avoid reopening the random device on every call by doing it once in the constructor to seed a std::default_random_engine member variable. don't forget to #include <random> and <algorithm>

cbodley · 2022-08-06T04:38:54Z

src/rgw/rgw_cr_rados.cc

+      // polling loop; as long as lock is busy wait and poll again
+      do {


RGWContinuousLeaseCR doesn't have a polling loop like this, it just fails and returns EBUSY up through RGWMetaSyncShardCR, and RGWMetaSyncShardControlCR keeps retrying with backoff. it shouldn't be necessary to have a retry loop here in RGWRebiddableLeaseCR

otherwise, the rest of this coroutine looks essentially the same as RGWContinuousLeaseCR except that it calls RGWBiddedRadosLockCR instead of RGWSimpleRadosLockCR

in another comment, i suggested combining RGWBiddedRadosLockCR into RGWSimpleRadosLockCR by using an optional<RGWLockBid> argument. could we do the same with RGWRebiddableLeaseCR and RGWContinuousLeaseCR, except with an optional<BidFunc>?

cbodley · 2022-08-06T15:01:08Z

src/rgw/rgw_sync.cc

+      while (!rebiddable_lease_cr->is_locked()) {
+        if (rebiddable_lease_cr->is_done()) {


didn't full_sync() already wait on this lock acquisition above on line 1637?

maybe this part inside the do-while loop should just check if (!rebiddable_lease_cr->is_locked()) { for an early exit like you do below in incremental_sync()?

github-actions · 2022-08-28T08:37:53Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

trociny · 2022-09-26T13:38:10Z

The tracker ticket https://tracker.ceph.com/issues/41230 is still pointing to the closed #28119. Should the ticket be updated to point to this PR?

cbodley · 2023-04-20T15:37:34Z

@cbodley In my view, this commit try to introduce that multiple rgw sync meta with "loop" mode, that says every RGW have a chance to do sync meta, but only one RGW works at the same time. Is that correct?

@caisan multisite sync has always used this polling strategy with cls_lock on log shards. the problem is that the first radosgw to start up usually gets all of the shard locks. the goal of this PR is to spread those locks evenly across all participating radosgws

caisan · 2023-04-21T01:37:13Z

@caisan multisite sync has always used this polling strategy with cls_lock on log shards. the problem is that the first radosgw to start up usually gets all of the shard locks. the goal of this PR is to spread those locks evenly across all participating radosgws

for example, lease lock will get all shards locks if without bid? @cbodley

caisan · 2023-04-23T09:56:21Z

src/rgw/driver/rados/rgw_sync.cc

      can_adjust_marker = true;
      /* grab lock */
+
+      if (!sync_env->bid_manager->is_highest_bidder(shard_id)) {


May is_highest_bidder(shard_id) may be invoked before spawn RGWMetaSyncShardControlCR ? i mean if rgw could not get the highest bidder, it's not necessary spawn RGWMetaSyncShardControlCR. @smanjara @cbodley

RGWMetaSyncShardControlCR is where the retry loop lives. we want to run even if we aren't the highest bidder yet

mkogan1 · 2023-04-24T18:29:27Z

metadata sync performance test results comparing before and after the commits in the PR,
MS env with 3x3 dedicated to synching RGWs
workload is s3 put-bucket-tagging on 1000 buckets in the primary MS zone, and measure how long until the secondary MS zone reaches metadata is caught up with master
(top in the screenshots below):

time echo -n {1..999} | xargs -P$(( $(numactl -N 0 -- nproc) / 1 )) -L1 -d' ' -i sh -c 'echo $1 ; set -x ; nice aws --endpoint-url [http://0:8000](about:blank) s3api put-bucket-tagging  --bucket  hsbench-1-$(printf "%012d" $1) --tagging "TagSet=[{Key=organization,Value=marketing}]" ; set +x' sh {} ; date

on the bottom left is metadata sync status of the secondary:
while sleep 1; do date ; nice sudo ./bin/radosgw-admin sync status | grep metadata ; done
and on the bottom right is lock distribution of the secondary:
while sleep 1; do date ; pwsh -c 'sudo ./bin/rados ls -p us-west.rgw.log | grep mdlog | foreach -Parallel { sudo ./bin/rados lock info $_ -p us-west.rgw.log sync_lock } | convertFrom-json | foreach {$_.lockers.name} | group | select Count, Name' ; done

with sync-fairness caught-up after 5 sec:

without sync-fairness caught-up after 31 sec

smanjara · 2023-04-26T20:32:41Z

metadata sync performance test results comparing before and after the commits in the PR, MS env with 3x3 dedicated to synching RGWs workload is s3 put-bucket-tagging on 1000 buckets in the primary MS zone, and measure how long until the secondary MS zone reaches metadata is caught up with master (top in the screenshots below):
time echo -n {1..999} | xargs -P$(( $(numactl -N 0 -- nproc) / 1 )) -L1 -d' ' -i sh -c 'echo $1 ; set -x ; nice aws --endpoint-url [http://0:8000](about:blank) s3api put-bucket-tagging  --bucket  hsbench-1-$(printf "%012d" $1) --tagging "TagSet=[{Key=organization,Value=marketing}]" ; set +x' sh {} ; date
on the bottom left is metadata sync status of the secondary: while sleep 1; do date ; nice sudo ./bin/radosgw-admin sync status | grep metadata ; done and on the bottom right is lock distribution of the secondary: while sleep 1; do date ; pwsh -c 'sudo ./bin/rados ls -p us-west.rgw.log | grep mdlog | foreach -Parallel { sudo ./bin/rados lock info $_ -p us-west.rgw.log sync_lock } | convertFrom-json | foreach {$_.lockers.name} | group | select Count, Name' ; done

with sync-fairness caught-up after 5 sec:

without sync-fairness caught-up after 31 sec

thanks @mkogan1

smanjara · 2023-04-26T20:32:57Z

https://pulpito.ceph.com/cbodley-2023-04-21_14:36:40-rgw-wip-smanjara-testing-distro-default-smithi/

cbodley

there are a few leftover whitespace changes, but otherwise looks good. great work @ofriedma @smanjara

ofriedma · 2023-04-30T08:09:38Z

there are a few leftover whitespace changes, but otherwise looks good. great work @ofriedma @smanjara

Thank you @cbodley @smanjara

caisan · 2023-05-10T02:53:57Z

@ofriedma @cbodley hi , i just wonder if this PR support the rgw instance scale out? i mean every newly-joined rgw sync instance can spread metadata sync workload. This would be significate for improving the sync performence.

Signed-off-by: Casey Bodley <cbodley@redhat.com>

multisite metadata sync fairness The approach of this commit is to allow multiple RGWs to participate in the multisite metadata sync. Before this commit only single RGW has caught the all the sync locks. This feature is using bidding algorithm. For each lock, RGW is randomizing a number from 0 to shard count and for each shard is picking randomally one number and giving it as the bid_amount. each one of those vectors each RGW handles are being sent using watch notify (based on RADOS watch notify). Each time the RGW tries to lock it will compare its bid for the lock and the bids of other rgws, if the current RGW has the highest bid it will try to acquire the lock. Important configs: rgw_sync_work_period - For how long the RGW will sync until it will send unlock (very important in the beggining, because in the beginning only single RGW holds the locks) rgw_sync_lease_period - not new to this commit but affecting it, For how many seconds the RGW will request from the RADOS to keep the lock, mainly important in case of failure, so automatically the RGW will lose a lock if it's down Fixes: https://tracker.ceph.com/issues/41230 Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> Signed-off-by: Or Friedmann <ofriedma@ibm.com> Signed-off-by: Casey Bodley <cbodley@redhat.com>

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

If we are outbid, exit after updating sync status. Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

…) control flow - check for is_highest_bidder() before even attempting to take the lock - don't block on RGWMetaSyncShardNotifyCR() - other minor fixes Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

ljflores · 2023-05-10T16:07:04Z

It looks like there are still changes going on with this PR. If it's ready for testing please re-add "needs-qa".

smanjara · 2023-05-15T14:01:40Z

Some known failures including the crash seen in https://tracker.ceph.com/issues/57905 unrelated to this PR.

https://pulpito.ceph.com/cbodley-2023-05-10_20:34:35-rgw-wip-shilpa-test-sync-fairness-distro-default-smithi/

@cbodley I'll go ahead and merge if you have no objection.

cbodley · 2023-05-15T15:26:28Z

thanks to @mattbenjamin @idryomov @athanatos @gregsfortytwo for all the design review and feedback

…s-bidding rgw: sync fairness bidding Resolves rhbz#1740782 Reviewed-by: J. Eric Ivancich <ivancich@redhat.com> Reviewed-by: Casey Bodley <cbodley@redhat.com> (cherry picked from commit 8c26848)

ofriedma requested review from cbodley and ivancich April 19, 2022 12:04

github-actions bot added common core rgw labels Apr 19, 2022

cbodley mentioned this pull request Apr 22, 2022

rgw: RGWCoroutine::set_sleeping() checks for null stack #46007

Merged

github-actions bot added the needs-rebase label Apr 26, 2022

djgalloway changed the base branch from master to main May 25, 2022 20:00

ofriedma force-pushed the wip-ofriedma-sync-fairness-bidding branch from 5d4dd5b to cb3355e Compare August 4, 2022 22:32

ofriedma marked this pull request as ready for review August 4, 2022 22:32

ofriedma requested review from a team as code owners August 4, 2022 22:32

github-actions bot added tests and removed needs-rebase labels Aug 4, 2022

ofriedma requested review from mattbenjamin and soumyakoduri August 4, 2022 22:35

ivancich approved these changes Aug 5, 2022

View reviewed changes

cbodley reviewed Aug 5, 2022

View reviewed changes

cbodley reviewed Aug 6, 2022

View reviewed changes

yuvalif self-requested a review August 10, 2022 16:31

smanjara self-requested a review August 10, 2022 17:05

cbodley mentioned this pull request Aug 17, 2022

rgw: Fix races in initialization of data sync #47566

Closed

18 tasks

github-actions bot added the needs-rebase label Aug 28, 2022

mattbenjamin mentioned this pull request Sep 26, 2022

rgw: multisite sync fairness #28119

Closed

3 tasks

ofriedma force-pushed the wip-ofriedma-sync-fairness-bidding branch from cb3355e to 5f71198 Compare October 24, 2022 13:18

smanjara force-pushed the wip-ofriedma-sync-fairness-bidding branch from 3c3b4d5 to 917b392 Compare April 20, 2023 17:14

caisan reviewed Apr 23, 2023

View reviewed changes

cbodley approved these changes Apr 26, 2023

View reviewed changes

cbodley added the needs-qa label May 1, 2023

cbodley and others added 8 commits May 10, 2023 11:24

rgw: add BidManager for sync fairness via watch/notify

40c0179

Signed-off-by: Casey Bodley <cbodley@redhat.com>

rgw/sync: add a BidManager to metadata sync

43e95e6

Signed-off-by: Casey Bodley <cbodley@redhat.com>

rgw/multisite: add a timer coroutine to call notify_cr() at intervals

6c99d33

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

rgw/multisite: periodically check if we still have the highest bid.

8ee42ab

If we are outbid, exit after updating sync status. Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

rgw/multisite: clean up

31f129f

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

rgw/multisite: - add a lost_bid variable to handle is_highest_bidder(…

14c4713

…) control flow - check for is_highest_bidder() before even attempting to take the lock - don't block on RGWMetaSyncShardNotifyCR() - other minor fixes Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

rgw/multisite: use intrusive_ptr instead of raw ptr for notify_stack

410246f

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>

smanjara force-pushed the wip-ofriedma-sync-fairness-bidding branch from 917b392 to 410246f Compare May 10, 2023 15:26

ljflores removed the needs-qa label May 10, 2023

cbodley removed the core label May 10, 2023

cbodley merged commit 8c26848 into ceph:main May 15, 2023

smanjara mentioned this pull request Jul 19, 2023

[DNM]reef: rgw: sync fairness bidding #52550

Closed

cbodley mentioned this pull request Jul 19, 2023

rgw: define a watch/notify procotol for sync fairness #49262

Closed

		class RGWBiddedRadosLockCR : public RGWSimpleRadosLockCR {

		RGWGetBidFunc get_bid_func;

		// NOTE: once full_sync is also switched to bidded lease, these two
		// sets can be merged

		// polling loop; as long as lock is busy wait and poll again
		do {

		while (!rebiddable_lease_cr->is_locked()) {
		if (rebiddable_lease_cr->is_done()) {

Conversation

ofriedma commented Apr 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbodley commented Apr 20, 2022

Uh oh!

ofriedma commented Apr 20, 2022

Uh oh!

github-actions bot commented Apr 26, 2022

Uh oh!

ivancich left a comment

Choose a reason for hiding this comment

Uh oh!

cbodley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattbenjamin Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbodley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbodley Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbodley Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 28, 2022

Uh oh!

trociny commented Sep 26, 2022

ofriedma commented Apr 19, 2022 •

edited

Loading

mattbenjamin Dec 2, 2022 •

edited

Loading

gregsfortytwo Dec 2, 2022 •

edited

Loading

cbodley Aug 5, 2022 •

edited

Loading

cbodley Aug 6, 2022 •

edited

Loading