osd/PrimaryLogPG: fix potential pg-log overtrimming#23317
osd/PrimaryLogPG: fix potential pg-log overtrimming#23317xiexingguo merged 1 commit intoceph:masterfrom
Conversation
800fbe3 to
bc41646
Compare
|
Now that pg_log size is bounded (#23098) would it be more efficient to convert pg_log_t's log and dups to pre-reserved vectors or other constant-size()-time data structures? |
|
@liewegas points out a deque would be a good match for pg_log |
|
@jdurgin I agree that switch to deque might be the preferred option. While I take a close look at #23098 , I think it is not safe or even wrong to drop the min_last_complete_ondisk cap when preforming log trimming as 1ae5fd3 does as we definitely don't want trim past each peer's last_complete iter. Instead of doing that, I think we can figure out a better and safer way to limit the pg-log size... |
|
As long as we are still applying last_update_on_disk it should be safe to ignore lcod.. it just means if we trim past it we'll fall back to backfill. Which is basically the whole point of this PR. |
|
@jdurgin @liewegas It took me quite a few days to figure out that it is impossible to simply switch pg_log from list to deque, vector or something similar without changing the code logic obviously. According to https://en.cppreference.com/w/cpp/container/deque, due to the To avoid introducing more complexity, shall we consider this an acceptable fix instead? |
bc41646 to
4492db8
Compare
jdurgin
left a comment
There was a problem hiding this comment.
yeah, in that case this fix seems ok, certainly simpler than rewriting the iterator code in PGLog.
|
this change breaks http://pulpito.ceph.com/kchai-2018-09-18_07:16:16-rados-wip-kefu2-testing-2018-09-18-1224-distro-basic-smithi/3037266/ could you take a look? |
src/osd/PrimaryLogPG.cc
Outdated
| dout(10) << "calc_trim_to trimming to limit: " << limit << dendl; | ||
| break; | ||
| size_t i = 0; | ||
| while(it != pg_log.get_log().log.end()) { |
src/osd/PrimaryLogPG.cc
Outdated
| } | ||
| dout(10) << "calc_trim_to " << pg_trim_to << " -> " << new_trim_to << dendl; | ||
| dout(10) << __func__ << " " << pg_trim_to << " -> " | ||
| << new_trim_to << dendl; |
There was a problem hiding this comment.
can we simplify this loop like
auto trim_to = pg_log.get_log().log.cbegin();
eversion_t trim_to_ver;
for (unsigned n_to_trim = 0, n_to_reserve = pg_log.get_log().log.size();
n_to_trim < osd_pg_log_trim_max && n_to_reserve >= target;
++n_to_trim, --n_to_reserve) {
if (trim_to == pg_log.get_log().log.cend()) {
break;
}
trim_to_ver = trim_to->version;
if (trim_to_ver >= limit) {
break;
}
++trim_to;
}so we can stop the loop earlier once the trim_to_ver reaches the limit.
There was a problem hiding this comment.
also, i think we can update pg_log.get_log().approx_size() to take the advantage of O(1) implementation of list::size(). as we've moved to C++17, and GCC-7 is required.
if approx_size() is accurate, probably we don't need to fuss over this anymore.
There was a problem hiding this comment.
@tchaikov This change needs to be backported to mimc
There was a problem hiding this comment.
the same also applies to mimic.
There was a problem hiding this comment.
sorry, i forgot that the libstdc++ shipped along with the GCC-7 packaged by SCLs is still using the old ABI. so it has O(n) list::size().
4492db8 to
570970b
Compare
src/osd/PrimaryLogPG.cc
Outdated
| new_trim_to = rit->version; | ||
| } | ||
| if (i >= num_to_trim && candidate == eversion_t()) { | ||
| candidate = it->version; // maybe |
There was a problem hiding this comment.
could rename candidate to by_n_to_trim . as it just serves as another criterion for choosing the upper bound of trimming. so, we could improve the code like:
eversion_t by_n_to_keep; // start from tail
eversion_t by_n_to_trim = eversion_t::max(); // start from head
for (size_t i = 0; it != pg_log.get_log().log.end(); ++it, ++rit) {
i++;
if (i >= target && by_n_to_keep == eversion_t()) {
by_n_to_keep = rit->version;
}
if (i >= num_to_trim && by_n_to_trim == eversion_t::max()) {
by_n_to_trim = it->version;
if (by_n_to_keep != eversion_t()) {
break;
}
}
}
if (by_n_to_keep == eversion_t()) {
return;
}
pg_trim_to = std::min({by_n_to_keep, by_n_to_trim, limit});please note the parameter of s of PGLog::IndexedLog::trim() is an inclusive bound. so we need to use >= when comparing i with target.
83286a3 to
d4234ee
Compare
|
|
||
| if (by_n_to_keep == eversion_t()) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
num_to_trim here is the upper bound of the number of entries to trim. now, the only concern is that the actual num to trim could be less than osd_pg_log_trim_min. in that case, by_n_to_trim is eversion_t::max().
not sure how often it happens though, or if it really happens in real world. if we care about this, and want to avoid it. we need to note down i when by_n_to_keep is set, and the total number of pg_log.get_log().log to calculate the number to trim in case by_n_to_trim is not set.
There was a problem hiding this comment.
I don't understand. Wouldn't pg_trim_to also be hard-caped at by_n_to_keep or head?
There was a problem hiding this comment.
it has nothing to do with by_n_to_keep , but is related the number to trim, there is chance that we never reach the num_to_trim when iterating thru the log. in that case, the number to trim could be less than osd_pg_log_trim_min.
your PR tries to avoid "overtrimming". my concern is about the too small a batch size of trimming.
There was a problem hiding this comment.
I see... I guess we can leave it as it is? Because there is no O(1) way to get the log size and that is why this (complicated) fix is needed...
In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
d4234ee to
3654d56
Compare
|
I also took a close look at the failure case at: http://qa-proxy.ceph.com/teuthology/xxg-2018-09-21_02:02:01-rados-wip-fix-polog-overtrim-distro-basic-smithi/3047956/, which revealed that bluestore were starting dropping callbacks(e.g., on_commit) from high-level callers. I guess it is caused by #22739? |
|
@xiexingguo . I found the reason. and update http://tracker.ceph.com/issues/36099. Can you confirm this? Thanks |
|
https://tracker.ceph.com/issues/36239 created, to backport this. |
In #21580 I set a trap to catch some wired
and random segmentfaults and in a recent QA run I was able to observe it was
successfully triggered by one of the test case, see:
The root cause is that there might be holes on log versions, thus the
approx_size() method should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of overtrimming log entries.
#18338 reveals a probably easier way
to fix the above problem but unfortunately it also can cause big performance regression
and hence comes this pr..
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn