Skip to content

osd/PrimaryLogPG: fix potential pg-log overtrimming#23317

Merged
xiexingguo merged 1 commit intoceph:masterfrom
xiexingguo:wip-fix-polog-overtrim
Sep 21, 2018
Merged

osd/PrimaryLogPG: fix potential pg-log overtrimming#23317
xiexingguo merged 1 commit intoceph:masterfrom
xiexingguo:wip-fix-polog-overtrim

Conversation

@xiexingguo
Copy link
Copy Markdown
Member

In #21580 I set a trap to catch some wired
and random segmentfaults and in a recent QA run I was able to observe it was
successfully triggered by one of the test case, see:

http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log

The root cause is that there might be holes on log versions, thus the
approx_size() method should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of overtrimming log entries.

#18338 reveals a probably easier way
to fix the above problem but unfortunately it also can cause big performance regression
and hence comes this pr..

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

@xiexingguo xiexingguo force-pushed the wip-fix-polog-overtrim branch 2 times, most recently from 800fbe3 to bc41646 Compare July 30, 2018 12:17
@jdurgin jdurgin requested a review from neha-ojha July 30, 2018 14:15
@jdurgin
Copy link
Copy Markdown
Member

jdurgin commented Jul 30, 2018

Now that pg_log size is bounded (#23098) would it be more efficient to convert pg_log_t's log and dups to pre-reserved vectors or other constant-size()-time data structures?

@jdurgin
Copy link
Copy Markdown
Member

jdurgin commented Jul 30, 2018

@liewegas points out a deque would be a good match for pg_log

@xiexingguo
Copy link
Copy Markdown
Member Author

xiexingguo commented Jul 31, 2018

@jdurgin I agree that switch to deque might be the preferred option.

While I take a close look at #23098 , I think it is not safe or even wrong to drop the min_last_complete_ondisk cap when preforming log trimming as 1ae5fd3 does as we definitely don't want trim past each peer's last_complete iter.

Instead of doing that, I think we can figure out a better and safer way to limit the pg-log size...

@liewegas
Copy link
Copy Markdown
Member

As long as we are still applying last_update_on_disk it should be safe to ignore lcod.. it just means if we trim past it we'll fall back to backfill. Which is basically the whole point of this PR.

@xiexingguo
Copy link
Copy Markdown
Member Author

@jdurgin @liewegas It took me quite a few days to figure out that it is impossible to simply switch pg_log from list to deque, vector or something similar without changing the code logic obviously.

According to https://en.cppreference.com/w/cpp/container/deque, due to the Iterator invalidation behaviour causing by push_back, the complete_to and rollback_info_trimmed_to_riter will probably get invalidated simultaneously each time we try to append a new log to the new deque log container(same for vector).

To avoid introducing more complexity, shall we consider this an acceptable fix instead?

@xiexingguo xiexingguo force-pushed the wip-fix-polog-overtrim branch from bc41646 to 4492db8 Compare August 20, 2018 07:07
Copy link
Copy Markdown
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in that case this fix seems ok, certainly simpler than rewriting the iterator code in PGLog.

@tchaikov
Copy link
Copy Markdown
Contributor

dout(10) << "calc_trim_to trimming to limit: " << limit << dendl;
break;
size_t i = 0;
while(it != pg_log.get_log().log.end()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after while

}
dout(10) << "calc_trim_to " << pg_trim_to << " -> " << new_trim_to << dendl;
dout(10) << __func__ << " " << pg_trim_to << " -> "
<< new_trim_to << dendl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify this loop like

auto trim_to = pg_log.get_log().log.cbegin();
eversion_t trim_to_ver;
for (unsigned n_to_trim = 0, n_to_reserve = pg_log.get_log().log.size();
     n_to_trim < osd_pg_log_trim_max && n_to_reserve >= target;
     ++n_to_trim, --n_to_reserve) {
  if (trim_to == pg_log.get_log().log.cend()) {
    break;
  }
  trim_to_ver = trim_to->version;
  if (trim_to_ver >= limit) {
    break;
  }
  ++trim_to;
}

so we can stop the loop earlier once the trim_to_ver reaches the limit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i think we can update pg_log.get_log().approx_size() to take the advantage of O(1) implementation of list::size(). as we've moved to C++17, and GCC-7 is required.

if approx_size() is accurate, probably we don't need to fuss over this anymore.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov This change needs to be backported to mimc

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same also applies to mimic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in master:

ceph/src/CMakeLists.txt

Lines 137 to 146 in eac8014

if(CMAKE_COMPILER_IS_GNUCXX AND
CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.1)
# this is not always correct, as one can use clang with libstdc++ or
# use old gcc with new libstdc++, but it covers the most cases.
#
# libstdc++ 4.9 has O(n) list::size(), and its regex is buggy
message(SEND_ERROR "performance regression is expected due to an O(n) "
"implementation of 'std::list::size()' in libstdc++ older than 5.1.0, "
"Please use GCC 5.1 and up.")
endif()

in mimic:

ceph/src/CMakeLists.txt

Lines 184 to 193 in d46e684

if(CMAKE_COMPILER_IS_GNUCXX AND
CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.1)
# this is not always correct, as one can use clang with libstdc++ or
# use old gcc with new libstdc++, but it covers the most cases.
#
# libstdc++ 4.9 has O(n) list::size(), and its regex is buggy
message(SEND_ERROR "performance regression is expected due to an O(n) "
"implementation of 'std::list::size()' in libstdc++ older than 5.1.0, "
"Please use GCC 5.1 and up.")
endif()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i forgot that the libstdc++ shipped along with the GCC-7 packaged by SCLs is still using the old ABI. so it has O(n) list::size().

new_trim_to = rit->version;
}
if (i >= num_to_trim && candidate == eversion_t()) {
candidate = it->version; // maybe
Copy link
Copy Markdown
Contributor

@tchaikov tchaikov Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could rename candidate to by_n_to_trim . as it just serves as another criterion for choosing the upper bound of trimming. so, we could improve the code like:

eversion_t by_n_to_keep; // start from tail
eversion_t by_n_to_trim = eversion_t::max(); // start from head
for (size_t i = 0; it != pg_log.get_log().log.end(); ++it, ++rit) {
  i++;
  if (i >= target && by_n_to_keep == eversion_t()) {
    by_n_to_keep = rit->version;
  }
  if (i >= num_to_trim && by_n_to_trim == eversion_t::max()) {
    by_n_to_trim = it->version;
    if (by_n_to_keep != eversion_t()) {
      break;
    }
  }
}

if (by_n_to_keep == eversion_t()) {
  return;
}

pg_trim_to = std::min({by_n_to_keep, by_n_to_trim, limit});

please note the parameter of s of PGLog::IndexedLog::trim() is an inclusive bound. so we need to use >= when comparing i with target.

@xiexingguo xiexingguo force-pushed the wip-fix-polog-overtrim branch from 83286a3 to d4234ee Compare September 20, 2018 08:45

if (by_n_to_keep == eversion_t()) {
return;
}
Copy link
Copy Markdown
Contributor

@tchaikov tchaikov Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_to_trim here is the upper bound of the number of entries to trim. now, the only concern is that the actual num to trim could be less than osd_pg_log_trim_min. in that case, by_n_to_trim is eversion_t::max().

not sure how often it happens though, or if it really happens in real world. if we care about this, and want to avoid it. we need to note down i when by_n_to_keep is set, and the total number of pg_log.get_log().log to calculate the number to trim in case by_n_to_trim is not set.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Wouldn't pg_trim_to also be hard-caped at by_n_to_keep or head?

Copy link
Copy Markdown
Contributor

@tchaikov tchaikov Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has nothing to do with by_n_to_keep , but is related the number to trim, there is chance that we never reach the num_to_trim when iterating thru the log. in that case, the number to trim could be less than osd_pg_log_trim_min.

your PR tries to avoid "overtrimming". my concern is about the too small a batch size of trimming.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... I guess we can leave it as it is? Because there is no O(1) way to get the log size and that is why this (complicated) fix is needed...

In ceph#21580 I set a trap to catch some wired
and random segmentfaults and in a recent QA run I was able to observe it was
successfully triggered by one of the test case, see:

```
http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log
```

The root cause is that there might be holes on log versions, thus the
approx_size() method should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of overtrimming log entries.

ceph#18338 reveals a probably easier way
to fix the above problem but unfortunately it also can cause big performance regression
and hence comes this pr..

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo xiexingguo force-pushed the wip-fix-polog-overtrim branch from d4234ee to 3654d56 Compare September 20, 2018 09:20
@xiexingguo xiexingguo merged commit 3c7c8c9 into ceph:master Sep 21, 2018
@xiexingguo xiexingguo deleted the wip-fix-polog-overtrim branch September 21, 2018 06:44
@xiexingguo
Copy link
Copy Markdown
Member Author

I also took a close look at the failure case at: http://qa-proxy.ceph.com/teuthology/xxg-2018-09-21_02:02:01-rados-wip-fix-polog-overtrim-distro-basic-smithi/3047956/, which revealed that bluestore were starting dropping callbacks(e.g., on_commit) from high-level callers.

I guess it is caused by #22739?

@majianpeng
Copy link
Copy Markdown
Member

@xiexingguo . I found the reason. and update http://tracker.ceph.com/issues/36099. Can you confirm this? Thanks

@neha-ojha
Copy link
Copy Markdown
Member

https://tracker.ceph.com/issues/36239 created, to backport this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants