Conversation
|
please make sure that we preserve the original commits and their attributions (Author/Signed-off-by) |
|
because the code base has changed a lot I will need to change the commits, it was difficult to preserve it, anyway once I will make reasonable commits I will add ofcourse the signed off by of the original ones |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
5d4dd5b to
cb3355e
Compare
ivancich
left a comment
There was a problem hiding this comment.
I reviewed the non-co-routine portion of the code and approve it.
I believe other reviewers are better-suited to review the co-routine components.
This represents an amazing effort on Or's behalf and I'm very appreciative. Thanks, Or!!!
cbodley
left a comment
There was a problem hiding this comment.
it might be better to split out the cls_lock changes (at the very least, into its own commit), so they can be more easily reviewed by the rbd team
src/cls/lock/cls_lock.cc
Outdated
| static std::mutex bids_mtx; | ||
| static ObjectBidMap bids; |
There was a problem hiding this comment.
i'm concerned about the use of global state here. shouldn't these bids be specific to one rados object? if so, this map should be stored in the object's xattr with the rest of lock_info_t
There was a problem hiding this comment.
Yeah, I don't understand this as well. It appears to be specific to one object, but because there is a single map holding data for all objects a mutex is needed? And then, even if bids doesn't need to persist between OSD restarts (why -- because data that helps to ensure fairness is safe to discard?), I think we still need to consider what happens when the primary moves to a different OSD and then returns back after a while? Wouldn't bids go stale?
There was a problem hiding this comment.
This...appears to be a per-process bid map and associated mutex. This approach seems entirely untenable. The osd which handles invocation of an object class is the current primary for the pg containing that object and object classes do not get notified when the primary changes since they are categorically not supposed to maintain any state outside of their rados objects.
- What handles removing the bid object map when this specific osd ceases to be the primary for one of these objects?
- How does this ensure that the in-memory state actually reflects the most recent updates to the object if the primary changed briefly and the new primary handled one of these calls?
- How would this state get updated if the OSD's copy of the object is recovered?
I'll try to take a closer look at this tomorrow or Monday, but I strongly discourage you from trying to be clever with stashing state in memory. It seems like you could pretty easily use the existing object class api to actually store each object's state within its own omap entries. If performance is demonstrated to be a problem with that approach and no other simpler solution seems viable, it might be possible to add some object class owned ephemeral state to the ObjectContext, but that would be much more complicated.
There was a problem hiding this comment.
inline
This...appears to be a per-process bid map and associated mutex. This approach seems entirely untenable. The osd which handles invocation of an object class is the current primary for the pg containing that object and object classes do not get notified when the primary changes since they are categorically not supposed to maintain any state outside of their rados objects.
* What handles removing the bid object map when this specific osd ceases to be the primary for one of these objects? * How does this ensure that the in-memory state actually reflects the most recent updates to the object if the primary changed briefly and the new primary handled one of these calls? * How would this state get updated if the OSD's copy of the object is recovered?I'll try to take a closer look at this tomorrow or Monday, but I strongly discourage you from trying to be clever with stashing state in memory. It seems like you could pretty easily use the existing object class api to actually store each object's state within its own omap entries. If performance is demonstrated to be a problem with that approach and no other simpler solution seems viable, it might be possible to add some object class owned ephemeral state to the ObjectContext, but that would be much more complicated.
"Completely untenable" seems rather overdone. And the suggestion to transform potentially transient/discardable/reconstructable state into costly, transactional state seems bad advice. (And apparently that would conflict with RADOS semantics some of the time.)
From what I understand, a key point is that the bid state is valuable but not critical--the state is a set of hints, basically. As I observed elsewhere, it's probably indeed intended to be per pg, and not "per process," which would suggest partitioning the map and its mutex by pg, maybe? It seems to me the currently biggest gap is in not bounding the size or lifetime of the state in the map(s), but agree that while the stored state is apparently not critical and thus, apparently ok to allow to be stale (as in your pg primary change/change back and object recovery scenarios). OTOH, it would probably be attractive if something could induce this module to drop state for a pg entirely in that event.
Matt
There was a problem hiding this comment.
Iow, no one is (unreasonably) trying to "be clever." Folks are trying to implement an efficient work-sharing mechanism that is, weakly, stateful. I think @cbodley's suggestion to stash some specific data in ObjectContext was made on the assumption that it would be easy, and not intrusive. I buy that that isn't the case, but statefulness of the type intended here is not overreach, it's just ordinary application of standard art.
There was a problem hiding this comment.
The librados interface is designed around objects and the osd is set up to maintain durable state around those objects; and to handle their consistency. Nothing in any interface is set up to handle in-memory state or for the object_class api to be stateful separate from what it writes out to objects. A "weakly stateful" Ceph interface doesn't exist.
Now, if this is actually a critical capability to develop, we can talk about that. But trying to shoehorn it in through an accident of the class implementation is indeed not tenable. There needs to be either a fundamentally different solution or a fundamentally different approach to getting it into the OSD, with a new system that has defined semantics and a proper interface for dealing with...all the things the osd does.
There was a problem hiding this comment.
That said, there is a mechanism for users of the osd to communicate between each other and send around state. It's watch-notify. Can this bid-and-lock algorithm be constructed that way, instead?
There was a problem hiding this comment.
thanks for the feedback!
It's watch-notify. Can this bid-and-lock algorithm be constructed that way, instead?
good point, let's try to build a design around that. multisite already uses watch-notify to coordinate bilog trimming, so that could serve as an example
There was a problem hiding this comment.
I had a quick discussion with @mattbenjamin and it sounds like the crux of the issue is that you can't update a hypothetical persistent version of this state in the event that the lock operation fails. Is that a reasonable characterization of the issue?
IIRC, object class writes can return payloads now. We could get around this by modifying the cls_lock behavior to always succeed if the client sent a bid (or add a differently named method with that behavior) and instead indicate success via the returned payload.
There was a problem hiding this comment.
IIRC, object class writes can return payloads now.
Correct. It looks like on the client side you have to set the flag OPERATION_RETURNVEC, and then just fill "outdata" in the class (same as for a read method), and the OSD checks it's less than osd_max_write_op_reply_len and passes it back as part of the result. This looks to be used by rgw/cls_fifo_legacy.cc and rgw/rgw_notify.cc, plus examples in cls_hello.
src/cls/lock/cls_lock_client.h
Outdated
| Lock(const std::string& _n) : name(_n), | ||
| flags(0), | ||
| bid_amount(-1), | ||
| bid_duration(utime_t()) |
There was a problem hiding this comment.
nit: this is the same as default-constructing bid_duration, so we can remove it from the constructor's member initializer list entirely
src/cls/lock/cls_lock.cc
Outdated
|
|
||
| entity_inst_t origin; | ||
| int r = cls_get_request_origin(hctx, &origin); | ||
| ceph_assert(r == 0); |
There was a problem hiding this comment.
please log and return these errors, instead of asserting on them. same with the cls_get_obj_info() call below. unexpected errors here should never crash the osd
src/rgw/rgw_cr_rados.h
Outdated
| class RGWBiddedRadosLockCR : public RGWSimpleRadosLockCR { | ||
|
|
||
| RGWGetBidFunc get_bid_func; |
There was a problem hiding this comment.
this inheritance structure doesn't make a lot of sense. BiddedRadosLock inherits SimpleRadosLock for its member variables, but overrides all of its logic. and the only difference is this RGWGetBidFunc piece
instead of creating this new Bidded subclass, maybe we just add an optional<RGWLockBid> argument to SimpleRadosLock? then ContinuousLeaseCR would pass std::nullopt, and RebiddableLeaseCR would pass the result of its bid func
src/rgw/rgw_cr_rados.h
Outdated
| class RGWRebiddableLeaseCR : public RGWContinuousLeaseCR { | ||
|
|
||
| protected: | ||
|
|
||
| std::atomic<bool> releasing_lock = { false }; | ||
| RGWGetBidFunc get_bid_func; |
There was a problem hiding this comment.
std::function can be a very useful abstraction, but it comes with a cost. whenever the lambda capture is bigger than ~24 bytes, it has to allocate memory. full_bid_func = [this, bid_func, bid_duration_secs]() -> RGWLockBid { is always going to allocate because it captures another function. passing that function around will allocate again every time it's copied
in this case, we don't really need the abstraction - we're defining RGWBidManagerCR just below. we should be able to take a pointer to that and call into it directly
There was a problem hiding this comment.
there's also the fu2 (more efficient std::function alternative) by @adamemerson ?
There was a problem hiding this comment.
It's not more efficient, the main reason for pulling it in is for unique_function. (For functions that aren't copyable.)
There was a problem hiding this comment.
We will not use the std::function here at all, the solution that @cbodley should work just fine and maybe we can remove the shuffle mechanism, from my tests, it doesn't improve the locks distribution (I mean the rgw will need to shuffle only to initial the bids for one time only) over time and it makes the solution a little bit more complex.
src/rgw/rgw_sync.cc
Outdated
| // NOTE: once full_sync is also switched to bidded lease, these two | ||
| // sets can be merged |
There was a problem hiding this comment.
full_sync already switched, right?
src/rgw/rgw_sync.cc
Outdated
| if(ceph::coarse_mono_clock::now() >= work_period_end) | ||
| { | ||
| ldpp_dout(sync_env->dpp, 20) << "metadata sync work period end releasing lock" << dendl; | ||
| yield rebiddable_lease_cr->release_lock(); | ||
| } |
There was a problem hiding this comment.
it doesn't look like we're checking this until we've already finished all the work. i feel like the best place for this check is at the very top of the work loop above:
for (log_iter = log_entries.begin(); log_iter != log_entries.end() && !done_with_period; ++log_iter) {
if (ceph::coarse_mono_clock::now() >= work_period_end) {if we see that we're passed the work period, we should spawning more work. but we can't drop the lock immediately - first we need to wait for any pending work to finish, then write the updated sync status. it doesn't look like metadata sync ever flushes the sync status, but here's an example from data sync: https://github.com/ceph/ceph/blob/e9d361f6/src/rgw/rgw_data_sync.cc#L4080-L4084
this approach should minimize how far we overshoot the work period. does that make sense? do you think that's consistent with your design?
src/rgw/rgw_sync.cc
Outdated
| } else if (r == R_WORK_COMPLETE) { | ||
| ldpp_dout(sync_env->dpp, 20) << "RGWMetaSyncShardCR::" << __func__ << | ||
| " incremental_sync work complete" << dendl; | ||
| return 0; | ||
| } else if (r == R_WORK_PARTIAL) { | ||
| ldpp_dout(sync_env->dpp, 20) << "RGWMetaSyncShardCR::" << __func__ << | ||
| " incremental_sync work partially complete" << dendl; | ||
| continue; // loop back up and try for more work | ||
| } |
There was a problem hiding this comment.
can you help me understand what we're trying to do differently between these two cases? if i understand correctly, there's very little difference between return 0 and continue here
return set_cr_done() would complete the coroutine successfully, and i agree that's not what we want to do here. RGWMetaSyncShardControlCR spawned us, and it doesn't retry on success, so nothing would ever run sync on this shard again
return 0 just acts like a yield that leaves the coroutine scheduled. that would let RGWCoroutinesManager::run() go and process any other scheduled coroutines, then call us right back
what do you think should happen here? should R_WORK_COMPLETE wait for a bit before trying to grab the lock again? or should both cases retry right away?
src/rgw/rgw_cr_rados.cc
Outdated
| int RGWBidManagerCR::operate(const DoutPrefixProvider *dpp) | ||
| { | ||
| if (aborted) { | ||
| return set_cr_done(); | ||
| } | ||
| reenter(this) { | ||
| while (!going_down) { | ||
| yield wait(utime_t(recalc_interval_secs, 0)); | ||
| bids.shuffle(); | ||
|
|
||
| } | ||
| return set_state(RGWCoroutine_Done); |
There was a problem hiding this comment.
this bid manager coroutine is simple and straightforward, but it complicates RGWRemoteMetaLog::run_sync() with an extra background coroutine. that's passing a raw RGWBidManagerCR pointer into RGWMetaSyncCR, but i don't think we can guarantee that the former coroutine will outlive the latter. to make that safe, RGWMetaSyncCR would need to hold a reference with intrusive_ptr<RGWBidManagerCR>, as well as the intrusive_ptr<RGWCoroutinesStack> it was spawned on
instead of this coroutine, what if the shuffle was just part of the 'bid function'?
class BidFunc {
ceph::timespan interval;
ceph::timespan bid_duration;
ceph::coarse_mono_time last_shuffle;
rados::cls::lock::BidSet<int32_t> bids;
public:
BidFunc(...);
RGWLockBid operator()(int shard_id, ceph::coarse_mono_time now) {
if (last_shuffle + interval < now) {
bids.shuffle();
last_shuffle = now;
}
return {bids.get_bid(shard_id), bid_duration};
}
};this could just live on the stack in RGWRemoteMetaLog::run_sync(), so wouldn't need any special memory management
src/cls/lock/cls_lock_client.h
Outdated
| template<typename T> | ||
| class BidSet { | ||
|
|
||
| static_assert(std::is_integral<T>::value, "Integral required."); |
There was a problem hiding this comment.
do we really need a template here if the functions all require int32_t bid_amount?
src/cls/lock/cls_lock_client.h
Outdated
| void shuffle() { | ||
| std::uniform_int_distribution<T> dist(0, bids.size() - 1); | ||
| std::random_device urandom("/dev/urandom"); | ||
| T count = bids.size(); | ||
| for (T i = count - 1; i >= 1; --i) { | ||
| T swap = dist(urandom); | ||
| if (i != swap) { | ||
| std::swap(bids[i], bids[swap]); |
There was a problem hiding this comment.
here's a nice 3-line example of a random shuffle from rgw_lc.cc:
std::random_device rd;
std::default_random_engine rng{rd()};
std::shuffle(v.begin(), v.end(), rng);we can avoid reopening the random device on every call by doing it once in the constructor to seed a std::default_random_engine member variable. don't forget to #include <random> and <algorithm>
src/rgw/rgw_cr_rados.cc
Outdated
| // polling loop; as long as lock is busy wait and poll again | ||
| do { |
There was a problem hiding this comment.
RGWContinuousLeaseCR doesn't have a polling loop like this, it just fails and returns EBUSY up through RGWMetaSyncShardCR, and RGWMetaSyncShardControlCR keeps retrying with backoff. it shouldn't be necessary to have a retry loop here in RGWRebiddableLeaseCR
otherwise, the rest of this coroutine looks essentially the same as RGWContinuousLeaseCR except that it calls RGWBiddedRadosLockCR instead of RGWSimpleRadosLockCR
in another comment, i suggested combining RGWBiddedRadosLockCR into RGWSimpleRadosLockCR by using an optional<RGWLockBid> argument. could we do the same with RGWRebiddableLeaseCR and RGWContinuousLeaseCR, except with an optional<BidFunc>?
src/rgw/rgw_sync.cc
Outdated
| while (!rebiddable_lease_cr->is_locked()) { | ||
| if (rebiddable_lease_cr->is_done()) { |
There was a problem hiding this comment.
didn't full_sync() already wait on this lock acquisition above on line 1637?
maybe this part inside the do-while loop should just check if (!rebiddable_lease_cr->is_locked()) { for an early exit like you do below in incremental_sync()?
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
The tracker ticket https://tracker.ceph.com/issues/41230 is still pointing to the closed #28119. Should the ticket be updated to point to this PR? |
cb3355e to
5f71198
Compare
@caisan multisite sync has always used this polling strategy with cls_lock on log shards. the problem is that the first radosgw to start up usually gets all of the shard locks. the goal of this PR is to spread those locks evenly across all participating radosgws |
3c3b4d5 to
917b392
Compare
for example, lease lock will get all shards locks if without bid? @cbodley |
| can_adjust_marker = true; | ||
| /* grab lock */ | ||
|
|
||
| if (!sync_env->bid_manager->is_highest_bidder(shard_id)) { |
There was a problem hiding this comment.
RGWMetaSyncShardControlCR is where the retry loop lives. we want to run even if we aren't the highest bidder yet
|
metadata sync performance test results comparing before and after the commits in the PR, on the bottom left is metadata sync status of the secondary: |
thanks @mkogan1 |
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
multisite metadata sync fairness
The approach of this commit is to allow multiple RGWs to participate in the multisite metadata sync.
Before this commit only single RGW has caught the all the sync locks.
This feature is using bidding algorithm.
For each lock, RGW is randomizing a number from 0 to shard count and for each shard is picking randomally one number and giving it as the bid_amount.
each one of those vectors each RGW handles are being sent using watch notify (based on RADOS watch notify).
Each time the RGW tries to lock it will compare its bid for the lock and the bids of other rgws, if the current RGW has the highest bid it will try to acquire the lock.
Important configs:
rgw_sync_work_period - For how long the RGW will sync until it will send unlock (very important in the beggining, because in the beginning only single RGW holds the locks)
rgw_sync_lease_period - not new to this commit but affecting it, For how many seconds the RGW will request from the RADOS to keep the lock, mainly important in case of failure, so automatically the RGW will lose a lock if it's down
Fixes: https://tracker.ceph.com/issues/41230
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Signed-off-by: Or Friedmann <ofriedma@ibm.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
If we are outbid, exit after updating sync status. Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
…) control flow - check for is_highest_bidder() before even attempting to take the lock - don't block on RGWMetaSyncShardNotifyCR() - other minor fixes Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
917b392 to
410246f
Compare
|
It looks like there are still changes going on with this PR. If it's ready for testing please re-add "needs-qa". |
|
Some known failures including the crash seen in https://tracker.ceph.com/issues/57905 unrelated to this PR. @cbodley I'll go ahead and merge if you have no objection. |
|
thanks to @mattbenjamin @idryomov @athanatos @gregsfortytwo for all the design review and feedback |
…s-bidding rgw: sync fairness bidding Resolves rhbz#1740782 Reviewed-by: J. Eric Ivancich <ivancich@redhat.com> Reviewed-by: Casey Bodley <cbodley@redhat.com> (cherry picked from commit 8c26848)
…s-bidding rgw: sync fairness bidding Resolves rhbz#1740782 Reviewed-by: J. Eric Ivancich <ivancich@redhat.com> Reviewed-by: Casey Bodley <cbodley@redhat.com> (cherry picked from commit 8c26848)


Thank you Eric for writing most of the code and for the algorithm
Fixes: https://tracker.ceph.com/issues/41230
Signed-off-by: J. Eric Ivancich ivancich@redhat.com
Signed-off-by: Or Friedmann ofriedma@ibm.com
Signed-off-by: Casey Bodley cbodley@redhat.com