Skip to content

mon: update PaxosService::cached_first_committed in PaxosService::maybe_trim()#19397

Merged
liewegas merged 1 commit intoceph:masterfrom
xxhdx1985126:master
Jan 16, 2018
Merged

mon: update PaxosService::cached_first_committed in PaxosService::maybe_trim()#19397
liewegas merged 1 commit intoceph:masterfrom
xxhdx1985126:master

Conversation

@xxhdx1985126
Copy link
Copy Markdown
Member

@xxhdx1985126 xxhdx1985126 commented Dec 8, 2017

This should be able to avoid interleaving execution of Paxos::commit_finish() and
check_sub(), which could lead to unexpected failure of ceph-mon.

Fixes: http://tracker.ceph.com/issues/11332
Signed-off-by: Xuehan Xu xuxuehan@360.cn
Signed-off-by: yupeng chen chenyupeng-it@360.cn

@liewegas liewegas changed the title Monitor: queue "send_incremental" in waiting_for_finished_proposal when OSDMonitor isn't active mon: queue "send_incremental" in waiting_for_finished_proposal when OSDMonitor isn't active Dec 9, 2017
Copy link
Copy Markdown
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better fix is to change hte check_osdmap_sub call to return (and do nothing) if not active, and then call check_osdmap_subs() in on_active(). the subscribe is already recording that the client wants the map; i don't think add a new ad hoc async way of sending the map is a good way to get there.

This might be a good time to check that the other check_sub methods also bail out if not active (and check their subs when they become active).

@liewegas
Copy link
Copy Markdown
Member

liewegas commented Dec 9, 2017

@jecluis want to take a look at this? i wonder if instead of calling the check_subs() from update_from_paxos() it should always happen in on_active() instead?

@xxhdx1985126
Copy link
Copy Markdown
Member Author

@liewegas Thanks for your reply:-)
I read the source code and found that "send_incremental" is not only called in check_osdmap_sub, but also in the process of some other events such as osd boot, osd failure and so on, in which cases the "send_incremental" operation is not recorded, so I think it might still be necessary to add this async callback. Is this right? Thanks:-)

@liewegas
Copy link
Copy Markdown
Member

For those other paths the service is active. It looks like it's just the handle_subscribe case where we aren't checking if the service is active.

@xxhdx1985126 xxhdx1985126 changed the title mon: queue "send_incremental" in waiting_for_finished_proposal when OSDMonitor isn't active mon: check_osdmap_sub to return when OSDMonitor is not active Dec 15, 2017
@xxhdx1985126 xxhdx1985126 force-pushed the master branch 2 times, most recently from 5ae93e9 to 6f3f156 Compare December 15, 2017 06:32
@xxhdx1985126 xxhdx1985126 reopened this Dec 19, 2017
@xxhdx1985126 xxhdx1985126 changed the title mon: check_osdmap_sub to return when OSDMonitor is not active mon: check_osdmap_sub to be called in "on_active" when OSDMonitor is not active Dec 19, 2017
@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Dec 19, 2017

I just modified this PR as suggested, and it looks fine in my local test. Please review this PR again, thanks:-)

@xxhdx1985126 xxhdx1985126 force-pushed the master branch 4 times, most recently from 1135c32 to 0a4cc8b Compare December 20, 2017 04:14
@xxhdx1985126
Copy link
Copy Markdown
Member Author

Hi, everyone. It seems that the failure of docs build is not caused by this PR:

Reading package lists...
E: Failed to fetch http://mirror.cs.uchicago.edu/ubuntu-toolchain-r/dists/xenial/main/binary-i386/Packages 404 Not Found
E: Failed to fetch http://mirror.yandex.ru/mirrors/launchpad/ubuntu-toolchain-r/dists/xenial/main/i18n/Translation-en 404 Not Found [IP: 213.180.204.183 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.

@xxhdx1985126
Copy link
Copy Markdown
Member Author

Um, Hi, everyone. Please review this PR, thanks:-)

@xxhdx1985126 xxhdx1985126 changed the title mon: check_osdmap_sub to be called in "on_active" when OSDMonitor is not active mon: check_osdmap_sub to be called in "on_active" when OSDMonitor is not active; common: manually count SimpleLRU's size instead of relying on list::size() Dec 30, 2017
Comment thread src/common/simple_cache.hpp Outdated

void trim_cache() {
while (lru.size() > max_size) {
while (size > max_size) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using "size" field as a substitute of "lru.size()" only here, so it'd be safer and easier to make it local for this function, something like

size_t size = lru.size();
while (size > max_size) {
   contents.erase(lru.back().first);
   lru.pop_back();
   size--;
}

That way you'll save 8 bytes of memory and won't slow down _add() and _clear().

Copy link
Copy Markdown
Contributor

@tchaikov tchaikov Jan 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not true. size is updated in _add() and clear(). but I'd suggest check __GLIBCXX__ and _GLIBCXX_USE_CXX11_ABI for this workaround. so we keep tracking the size only if libstdc++ with old ABI is used. like

#if defined(__GLIBCXX__) && !_GLIBCXX_USE_CXX11_ABI
#define SLOW_LIST_SIZE
#endif
....

#ifdef SLOW_LIST_SIZE
size_t _size;
void inc_size() {
  ++_size;
}
void dec_size() {
  --_size;
}
size_t size() const {
  return _size;
}
#else
void inc_size() {}
void dec_size() {}
size_t size() const {
  return contents.size();
}
#endif
...

void _add(K key, V&& value) {
  ...
  inc_size();
}

what do you think?

/// edited per @xxhdx1985126 's comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov yes, the value itself is updated in _add() and _clear(), but the actual use of it is made in trim_cache() (also, I think it's missing an update after a call to lru.splice(). My point is - I think that with a such usage pattern, having it count manually will leave extra code that'll need to be removed once we're sure that list<>.size() is no longer broken.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov Thanks for your advise.
By the way, do mean "#if defined(GLIBCXX) && !_GLIBCXX_USE_CXX11_ABI" instead of "#if defined(GLIBCXX) && _GLIBCXX_USE_CXX11_ABI"? I thought it's when _GLIBCXX_USE_CXX11_ABI is 0 that we need a manual count. Right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branch-predictor oh, thanks, it makes sense!

@xxhdx1985126 ahh, right! updated the code snippet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branch-predictor Thanks for your advise:-)
I think, since trim_cache() is called for every omap header creation, it could still lead to performance regression when doing massive file/directory creation even if we call list::size() outside the "while" block . Is this right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xxhdx1985126 Only if you're doing really massive file/directory creation and lru's size is large enough (say, 4096 entries or more). Still, if you do just single size() call per creation, it will be overshadowed by everything else around. Even that can be reduced by calling trim_cache() randomly (with, say, 33% probability).

@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Jan 6, 2018

@liewegas We are testing @branch-predictor's way:-)

@xxhdx1985126 xxhdx1985126 force-pushed the master branch 2 times, most recently from c123dd3 to 8a18ecf Compare January 6, 2018 02:55
@xxhdx1985126 xxhdx1985126 changed the title mon: reschedule all "check_sub()"s to be called in on_active() when PaxosService is not active; common: manually count SimpleLRU's size instead of relying on list::size() mon: reschedule all "check_sub()"s to be called in on_active() when PaxosService is not active Jan 6, 2018
@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Jan 6, 2018

@branch-predictor @liewegas I've move the SimpleLRU size modification to #19813 , please review it when you have time:-) Thanks:-)

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Jan 8, 2018

i think, instead of guarding check_subs() using is_active(), a better approach might be to update the cached_first_committed in PaxosService::maybe_trim(), and make this variable an atomic<version_t>. the available epochs in [first_committed, last_committed] are always valid even when a trim is being proposed.

also, this avoids the weird is_leader() checkings due to different call paths in leader and peons.

@xxhdx1985126
Copy link
Copy Markdown
Member Author

@tchaikov Hi, thanks for the advise:-)
Do you mean cached_first_committed should be modified before the actual commit happens?

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Jan 8, 2018

yes, i expected it's a controversial change. but after pondering over this, i think it's okay.

@liewegas
Copy link
Copy Markdown
Member

liewegas commented Jan 8, 2018

I think updating first_committed in Paxos::trim() makes perfect sense. There shouldn't be any need for a std::atomic, though, since trim() itself runs under the mon lock...?

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Jan 8, 2018

There shouldn't be any need for a std::atomic

agreed. i missed that.

@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Jan 9, 2018

@liewegas @tchaikov Hi, sorry, I don't quite follow you. first_committed seems not relevant to our problem which is caused by non-synchronized accesses to cached_first_committed. Do you mean that we should update first_committed in Paxos::trim() besides updating cached_first_committed in PaxosService::maybe_trim()?

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Jan 9, 2018

@xxhdx1985126 sorry, should be PaxosService::maybe_trim(). i updated my comment. it has nothing to do with Paxos::trim().

@liewegas
Copy link
Copy Markdown
Member

liewegas commented Jan 9, 2018

Basically add cached_first_committed = trim_to;, right?

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Jan 9, 2018

@liewegas exactly. something like

diff --git a/src/mon/PaxosService.cc b/src/mon/PaxosService.cc
index de732c3223..e2b9376bdc 100644
--- a/src/mon/PaxosService.cc
+++ b/src/mon/PaxosService.cc
@@ -399,6 +399,7 @@ void PaxosService::maybe_trim()
   MonitorDBStore::TransactionRef t = paxos->get_pending_transaction();
   trim(t, get_first_committed(), trim_to);
   put_first_committed(t, trim_to);
+  cached_first_committed = trim_to;

   // let the service add any extra stuff
   encode_trim_extra(t, trim_to);

…be_trim()

This should be able to avoid interleaving execution of Paxos::commit_finish() and
check_sub(), which could lead to unexpected failure of ceph-mon.

Fixes: http://tracker.ceph.com/issues/11332
Signed-off-by: Xuehan Xu <xuxuehan@360.cn>
Signed-off-by: yupeng chen <chenyupeng-it@360.cn>
@xxhdx1985126 xxhdx1985126 changed the title mon: reschedule all "check_sub()"s to be called in on_active() when PaxosService is not active mon: update PaxosService::cached_first_committed in PaxosService::maybe_trim() Jan 10, 2018
@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Jan 10, 2018

@tchaikov @liewegas
Hi, after rethinking this appoarch, since PaxosService::maybe_trim is called from Monitor::tick() which is executed in another thread other than the monitor's main thread, I wonder whether the following scenario would happen and whether it would lead to something unacceptable once happened.

  1. PaxosService::maybe_trim() is invoked and push the cached_first_committed forward.
  2. An OSD subscribes osdmap, and is returned an incremental with the new oldest_map
  3. Some monitor went down and makes the result of the previous PaxosService::maybe_trim() fail to pass the paxos procedure and the in-flight proposal reset, which would leave an inconsistent view, between OSDs and monitors, about the oldest osdmap that the cluster should maintain.

Would this happen? And would it lead to anything bad?
Thanks:-)

@tchaikov
Copy link
Copy Markdown
Contributor

this could happen. but it does not hurt. assuming

  1. on mon.a, the leader, PaxosService::maybe_trim() moves cached_first_committed from 12 to 42, because it wants to trim the maps[12,42)
  2. OSD.n now believes that the oldest available map is osdmap.42
  3. mon.b, the peon, goes offline before the trim proposal is accepted by the quorum.
  4. accept times out, a new election will be called. and the new quorum is formed with or without mon.b
  5. monmap will be updated. in Monitor::refresh_from_paxos(), all paxos services' refresh() will be called. so cached_first_committed will be reset to the value before the trim is proposed.

@tchaikov tchaikov removed their request for review January 11, 2018 06:32
@xxhdx1985126
Copy link
Copy Markdown
Member Author

@tchaikov
Yes, and at that time, OSD.n thinks that the oldest available map is osdmap.42 while monitors thinks that to be osdmap.12. Can we confirm that this won't lead to any problem?
Thanks:-)

@tchaikov
Copy link
Copy Markdown
Contributor

i think i've replied that this "does not hurt". if you have any concerns, please just spill out. i am not sure the Socratic method is more efficient here.

@xxhdx1985126
Copy link
Copy Markdown
Member Author

@tchaikov
Sorry, I didn't understand you correctly:-)
I've no further concerns since I can't see anything bad either:-)
Thanks:-)

@xxhdx1985126
Copy link
Copy Markdown
Member Author

@tchaikov @liewegas I’ve modified this pr as suggested earlier today, please review it when you have time:)

@liewegas
Copy link
Copy Markdown
Member

@liewegas liewegas merged commit fdc3f5c into ceph:master Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants