osd/PrimaryLogPG: fix potential pg-log overtrimming by xiexingguo · Pull Request #23317 · ceph/ceph

xiexingguo · 2018-07-30T11:30:02Z

In #21580 I set a trap to catch some wired
and random segmentfaults and in a recent QA run I was able to observe it was
successfully triggered by one of the test case, see:

http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log

The root cause is that there might be holes on log versions, thus the
approx_size() method should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of overtrimming log entries.

#18338 reveals a probably easier way
to fix the above problem but unfortunately it also can cause big performance regression
and hence comes this pr..

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

jdurgin · 2018-07-30T14:29:58Z

Now that pg_log size is bounded (#23098) would it be more efficient to convert pg_log_t's log and dups to pre-reserved vectors or other constant-size()-time data structures?

jdurgin · 2018-07-30T15:06:04Z

@liewegas points out a deque would be a good match for pg_log

xiexingguo · 2018-07-31T01:29:32Z

@jdurgin I agree that switch to deque might be the preferred option.

While I take a close look at #23098 , I think it is not safe or even wrong to drop the min_last_complete_ondisk cap when preforming log trimming as 1ae5fd3 does as we definitely don't want trim past each peer's last_complete iter.

Instead of doing that, I think we can figure out a better and safer way to limit the pg-log size...

liewegas · 2018-07-31T13:13:34Z

As long as we are still applying last_update_on_disk it should be safe to ignore lcod.. it just means if we trim past it we'll fall back to backfill. Which is basically the whole point of this PR.

xiexingguo · 2018-08-20T06:58:10Z

@jdurgin @liewegas It took me quite a few days to figure out that it is impossible to simply switch pg_log from list to deque, vector or something similar without changing the code logic obviously.

According to https://en.cppreference.com/w/cpp/container/deque, due to the Iterator invalidation behaviour causing by push_back, the complete_to and rollback_info_trimmed_to_riter will probably get invalidated simultaneously each time we try to append a new log to the new deque log container(same for vector).

To avoid introducing more complexity, shall we consider this an acceptable fix instead?

jdurgin

yeah, in that case this fix seems ok, certainly simpler than rewriting the iterator code in PGLog.

tchaikov · 2018-09-19T08:23:53Z

this change breaks http://pulpito.ceph.com/kchai-2018-09-18_07:16:16-rados-wip-kefu2-testing-2018-09-18-1224-distro-basic-smithi/3037266/

could you take a look?

tchaikov · 2018-09-19T08:24:31Z

src/osd/PrimaryLogPG.cc

-        dout(10) << "calc_trim_to trimming to limit: " << limit << dendl;
-	break;
+    size_t i = 0;
+    while(it != pg_log.get_log().log.end()) {


add a space after while

tchaikov · 2018-09-19T09:41:19Z

src/osd/PrimaryLogPG.cc

    }
-    dout(10) << "calc_trim_to " << pg_trim_to << " -> " << new_trim_to << dendl;
+    dout(10) << __func__ << " " << pg_trim_to << " -> "
+             << new_trim_to << dendl;


can we simplify this loop like

auto trim_to = pg_log.get_log().log.cbegin(); eversion_t trim_to_ver; for (unsigned n_to_trim = 0, n_to_reserve = pg_log.get_log().log.size(); n_to_trim < osd_pg_log_trim_max && n_to_reserve >= target; ++n_to_trim, --n_to_reserve) { if (trim_to == pg_log.get_log().log.cend()) { break; } trim_to_ver = trim_to->version; if (trim_to_ver >= limit) { break; } ++trim_to; }

so we can stop the loop earlier once the trim_to_ver reaches the limit.

also, i think we can update pg_log.get_log().approx_size() to take the advantage of O(1) implementation of list::size(). as we've moved to C++17, and GCC-7 is required.

if approx_size() is accurate, probably we don't need to fuss over this anymore.

@tchaikov This change needs to be backported to mimc

the same also applies to mimic.

in master:

ceph/src/CMakeLists.txt

Lines 137 to 146 in eac8014

if(CMAKE_COMPILER_IS_GNUCXX AND

CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.1)

# this is not always correct, as one can use clang with libstdc++ or

# use old gcc with new libstdc++, but it covers the most cases.

#

# libstdc++ 4.9 has O(n) list::size(), and its regex is buggy

message(SEND_ERROR "performance regression is expected due to an O(n) "

"implementation of 'std::list::size()' in libstdc++ older than 5.1.0, "

"Please use GCC 5.1 and up.")

endif()

in mimic:

ceph/src/CMakeLists.txt

Lines 184 to 193 in d46e684

if(CMAKE_COMPILER_IS_GNUCXX AND

CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.1)

# this is not always correct, as one can use clang with libstdc++ or

# use old gcc with new libstdc++, but it covers the most cases.

#

# libstdc++ 4.9 has O(n) list::size(), and its regex is buggy

message(SEND_ERROR "performance regression is expected due to an O(n) "

"implementation of 'std::list::size()' in libstdc++ older than 5.1.0, "

"Please use GCC 5.1 and up.")

endif()

sorry, i forgot that the libstdc++ shipped along with the GCC-7 packaged by SCLs is still using the old ABI. so it has O(n) list::size().

tchaikov · 2018-09-20T07:25:52Z

src/osd/PrimaryLogPG.cc

+        new_trim_to = rit->version;
+      }
+      if (i >= num_to_trim && candidate == eversion_t()) {
+        candidate = it->version; // maybe


could rename candidate to by_n_to_trim . as it just serves as another criterion for choosing the upper bound of trimming. so, we could improve the code like:

eversion_t by_n_to_keep; // start from tail eversion_t by_n_to_trim = eversion_t::max(); // start from head for (size_t i = 0; it != pg_log.get_log().log.end(); ++it, ++rit) { i++; if (i >= target && by_n_to_keep == eversion_t()) { by_n_to_keep = rit->version; } if (i >= num_to_trim && by_n_to_trim == eversion_t::max()) { by_n_to_trim = it->version; if (by_n_to_keep != eversion_t()) { break; } } } if (by_n_to_keep == eversion_t()) { return; } pg_trim_to = std::min({by_n_to_keep, by_n_to_trim, limit});

please note the parameter of s of PGLog::IndexedLog::trim() is an inclusive bound. so we need to use >= when comparing i with target.

tchaikov · 2018-09-20T08:59:19Z

src/osd/PrimaryLogPG.cc

+
+    if (by_n_to_keep == eversion_t()) {
+      return;
+    }


num_to_trim here is the upper bound of the number of entries to trim. now, the only concern is that the actual num to trim could be less than osd_pg_log_trim_min. in that case, by_n_to_trim is eversion_t::max().

not sure how often it happens though, or if it really happens in real world. if we care about this, and want to avoid it. we need to note down i when by_n_to_keep is set, and the total number of pg_log.get_log().log to calculate the number to trim in case by_n_to_trim is not set.

I don't understand. Wouldn't pg_trim_to also be hard-caped at by_n_to_keep or head?

it has nothing to do with by_n_to_keep , but is related the number to trim, there is chance that we never reach the num_to_trim when iterating thru the log. in that case, the number to trim could be less than osd_pg_log_trim_min.

your PR tries to avoid "overtrimming". my concern is about the too small a batch size of trimming.

I see... I guess we can leave it as it is? Because there is no O(1) way to get the log size and that is why this (complicated) fix is needed...

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo · 2018-09-21T06:44:04Z

standalone: http://pulpito.ceph.com/xxg-2018-09-21_01:01:40-rados-wip-fix-polog-overtrim-distro-basic-smithi/

http://pulpito.ceph.com/xxg-2018-09-21_02:02:01-rados-wip-fix-polog-overtrim-distro-basic-smithi/

xiexingguo · 2018-09-21T06:50:47Z

I also took a close look at the failure case at: http://qa-proxy.ceph.com/teuthology/xxg-2018-09-21_02:02:01-rados-wip-fix-polog-overtrim-distro-basic-smithi/3047956/, which revealed that bluestore were starting dropping callbacks(e.g., on_commit) from high-level callers.

I guess it is caused by #22739?

majianpeng · 2018-09-21T10:32:50Z

@xiexingguo . I found the reason. and update http://tracker.ceph.com/issues/36099. Can you confirm this? Thanks

neha-ojha · 2018-09-27T14:58:19Z

https://tracker.ceph.com/issues/36239 created, to backport this.

xiexingguo added bug-fix core labels Jul 30, 2018

xiexingguo requested review from Yan-waller and liewegas July 30, 2018 11:30

xiexingguo force-pushed the wip-fix-polog-overtrim branch 2 times, most recently from 800fbe3 to bc41646 Compare July 30, 2018 12:17

jdurgin requested a review from neha-ojha July 30, 2018 14:15

xiexingguo force-pushed the wip-fix-polog-overtrim branch from bc41646 to 4492db8 Compare August 20, 2018 07:07

jdurgin approved these changes Sep 17, 2018

View reviewed changes

jdurgin added the needs-qa label Sep 17, 2018

tchaikov added the wip-kefu2-testing label Sep 18, 2018

tchaikov removed needs-qa wip-kefu2-testing labels Sep 19, 2018

tchaikov reviewed Sep 19, 2018

View reviewed changes

xiexingguo force-pushed the wip-fix-polog-overtrim branch from 4492db8 to 570970b Compare September 20, 2018 06:37

Yan-waller approved these changes Sep 20, 2018

View reviewed changes

tchaikov reviewed Sep 20, 2018

View reviewed changes

xiexingguo force-pushed the wip-fix-polog-overtrim branch from 83286a3 to d4234ee Compare September 20, 2018 08:45

tchaikov reviewed Sep 20, 2018

View reviewed changes

xiexingguo force-pushed the wip-fix-polog-overtrim branch from d4234ee to 3654d56 Compare September 20, 2018 09:20

xiexingguo merged commit 3c7c8c9 into ceph:master Sep 21, 2018

xiexingguo deleted the wip-fix-polog-overtrim branch September 21, 2018 06:44

This was referenced Sep 27, 2018

luminous: osd: PrimaryLogPG: fix potential pg-log overtrimming #24308

Merged

mimic: osd/PrimaryLogPG: fix potential pg-log overtrimming #24309

Merged

xiexingguo mentioned this pull request Mar 12, 2020

osd/osd_types: reset head etc. if the whole pg-log list is divergent #33875

Closed

3 tasks

	if(CMAKE_COMPILER_IS_GNUCXX AND
	CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.1)
	# this is not always correct, as one can use clang with libstdc++ or
	# use old gcc with new libstdc++, but it covers the most cases.
	#
	# libstdc++ 4.9 has O(n) list::size(), and its regex is buggy
	message(SEND_ERROR "performance regression is expected due to an O(n) "
	"implementation of 'std::list::size()' in libstdc++ older than 5.1.0, "
	"Please use GCC 5.1 and up.")
	endif()

Conversation

xiexingguo commented Jul 30, 2018

Uh oh!

jdurgin commented Jul 30, 2018

Uh oh!

jdurgin commented Jul 30, 2018

Uh oh!

xiexingguo commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liewegas commented Jul 31, 2018

Uh oh!

xiexingguo commented Aug 20, 2018

Uh oh!

jdurgin left a comment

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Sep 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaikov Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaikov Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaikov Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiexingguo commented Sep 21, 2018

Uh oh!

xiexingguo commented Sep 21, 2018

Uh oh!

majianpeng commented Sep 21, 2018

Uh oh!

neha-ojha commented Sep 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xiexingguo commented Jul 31, 2018 •

edited

Loading

tchaikov Sep 20, 2018 •

edited

Loading

tchaikov Sep 20, 2018 •

edited

Loading

tchaikov Sep 20, 2018 •

edited

Loading