osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly by xiexingguo · Pull Request #21598 · ceph/ceph

xiexingguo · 2018-04-23T11:58:14Z

Upon overwritting an already-missing object on async_recovery_targets,
besides updating the missing set async_recovery_targets should
call reset_complete_to() simultaneously to reset complete-to iter and
advance last_complete properly, as those old log entries which we used
to rely on for recovery are now getting replaced by new committed ones.

Since the recovery process of an async_recovery_target can be arbitrarily
delayed until we hit the osd_max_pg_log_entries limit, the main befenit
of an extra call to reset_complete_to() here is to make sure primary perform
log-trim in time. Otherwise accumulating of huge amounts of obsolete log
entries can slow the system down obviously.

Signed-off-by: xiexingguo xie.xingguo@zte.com.cn

xiexingguo · 2018-04-23T12:09:39Z

@jdurgin

…_targets properly Upon overwritting an already-missing object on async_recovery_targets, besides updating the **missing** set async_recovery_targets should call reset_complete_to() simultaneously to reset **complete-to** iter and advance **last_complete** properly, as those old log entries which we used to rely on for recovery are now getting replaced by new **committed** ones. Since the recovery process of an async_recovery_target can be arbitrarily delayed until we hit the osd_max_pg_log_entries limit, the main befenit of an extra call to reset_complete_to() here is to make sure primary perform log-trim in time. Otherwise accumulating of huge amounts of **obsolete** log entries can slow the system down obviously. Signed-off-by: xiexingguo <xie.xingguo@zte.com.cn>

neha-ojha

This will probably be useful. Needs to be run against the rados suite to ensure that there are no side effects.

xiexingguo · 2018-04-25T00:53:35Z

@tchaikov

This change is currently triggering crashes. Please drop it from your test branch. :-(

xiexingguo · 2018-04-25T08:33:44Z

@neha-ojha

I appended another fix to address the crashes from @tchaikov 's test branch and the newest test results seem to be positive:

http://pulpito.ceph.com/xxg-2018-04-25_06:14:51-rados:thrash-wip-refresh-lc-distro-basic-smithi/

http://pulpito.ceph.com/xxg-2018-04-25_04:38:44-rados-wip-refresh-lc-distro-basic-smithi/

Mind taking another glance at?

neha-ojha · 2018-04-25T20:16:58Z

src/osd/ECBackend.cc

  }
  clear_temp_objs(op.temp_removed);
-  dout(30) << __func__ << " missing before " << get_parent()->get_log().get_missing().get_items() << dendl;
+  get_parent()->log_operation(


Is there a reason why we want to move the dout of "missing before"?

dropped that cosmetic change ( I was trying to wrap that line into 80 chars, but I think it is fine to leave as it is...)

neha-ojha · 2018-04-26T17:20:18Z

src/osd/ReplicatedBackend.cc

    update_snaps = true;
  }

+  parent->update_stats(m->pg_stats);


update_stats() may not be required to move as well? looks good overall.

The problem is that if we update pg_missing first, **reset_complete_to()** will probably try to move **complete-to** iter to the newest log entry which we haven't persisted into local pg_log list (we haven't called **log_operation()** yet) and hence trigger the coredump below: ``` 0> 2018-04-24 15:42:47.270 7fb4694d5700 -1 /build/ceph-13.0.2-1706-g01c4f53/src/osd/PGLog.h: In function 'void PGLog::reset_complete_to(pg_info_t*)' thread 7fb4694d5700 time 2018-04-24 15:42:47.270648 /build/ceph-13.0.2-1706-g01c4f53/src/osd/PGLog.h: 778: FAILED assert(log.complete_to != log.log.end()) ceph version 13.0.2-1706-g01c4f53 (01c4f53802f46966433ab1d453afac6db7e5b707) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fb494648f82] 2: (()+0x2e5157) [0x7fb494649157] 3: (non-virtual thunk to PrimaryLogPG::add_local_next_event(pg_log_entry_t const&)+0x478) [0x55d70493aff8] 4: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0x855) [0x55d704a3ead5] ``` See also: http://qa-proxy.ceph.com/teuthology/kchai-2018-04-24_14:54:28-rados-wip-kefu-testing-2018-04-24-1145-distro-basic-mira/2434437/ Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

tchaikov · 2018-04-30T13:52:33Z

http://pulpito.ceph.com/kchai-2018-04-30_00:59:17-rados-wip-kefu-testing-2018-04-29-1248-distro-basic-smithi/2454381/

could be relevant .

liewegas · 2018-04-30T14:49:20Z

This looks like itw as causing scrub errors in my run

liewegas · 2018-05-20T15:24:01Z

yep, still causing scrub errors http://pulpito.ceph.com/sage-2018-05-18_18:17:59-rados-wip-sage3-testing-2018-05-18-1124-distro-basic-smithi/

xiexingguo added the core label Apr 23, 2018

xiexingguo changed the title ~~osd/PrimaryLogPG: refresh local last-complete iter for async_recovery…~~ osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly Apr 23, 2018

xiexingguo force-pushed the wip-refresh-lc branch from 4e8f33c to 218caca Compare April 23, 2018 12:13

liewegas added this to the mimic milestone Apr 23, 2018

liewegas requested a review from neha-ojha April 23, 2018 14:00

liewegas added the bug-fix label Apr 23, 2018

neha-ojha approved these changes Apr 23, 2018

View reviewed changes

neha-ojha added the needs-qa label Apr 23, 2018

tchaikov added the wip-kefu-testing label Apr 24, 2018

xiexingguo added DNM and removed needs-qa wip-kefu-testing labels Apr 25, 2018

xiexingguo removed the DNM label Apr 25, 2018

neha-ojha reviewed Apr 25, 2018

View reviewed changes

xiexingguo force-pushed the wip-refresh-lc branch from 01c628c to bddf91e Compare April 26, 2018 10:41

neha-ojha reviewed Apr 26, 2018

View reviewed changes

xiexingguo force-pushed the wip-refresh-lc branch from bddf91e to 2125ee3 Compare April 27, 2018 00:27

xiexingguo added the needs-qa label Apr 27, 2018

liewegas added wip-sage-testing wip-sage2-testing and removed wip-sage-testing wip-sage2-testing labels Apr 27, 2018

tchaikov added the wip-kefu-testing label Apr 28, 2018

tchaikov removed the wip-kefu-testing label Apr 30, 2018

xiexingguo added DNM and removed needs-qa labels May 1, 2018

xiexingguo mentioned this pull request May 2, 2018

osd: calc_min_last_complete_ondisk() should use actingset #21508

Closed

liewegas changed the base branch from master to mimic May 3, 2018 18:17

liewegas changed the base branch from mimic to master May 16, 2018 21:10

liewegas added wip-sage3-testing DNM and removed DNM wip-sage3-testing labels May 16, 2018

liewegas removed this from the mimic milestone May 20, 2018

xiexingguo closed this Aug 28, 2018

xiexingguo deleted the wip-refresh-lc branch August 28, 2018 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly#21598

osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly#21598
xiexingguo wants to merge 2 commits intoceph:masterfrom
xiexingguo:wip-refresh-lc

xiexingguo commented Apr 23, 2018 •

edited

Loading

Uh oh!

xiexingguo commented Apr 23, 2018

Uh oh!

neha-ojha left a comment

Uh oh!

xiexingguo commented Apr 25, 2018

Uh oh!

xiexingguo commented Apr 25, 2018

Uh oh!

neha-ojha Apr 25, 2018 •

edited

Loading

Uh oh!

xiexingguo Apr 26, 2018

Uh oh!

neha-ojha Apr 26, 2018

Uh oh!

tchaikov commented Apr 30, 2018

Uh oh!

liewegas commented Apr 30, 2018

Uh oh!

liewegas commented May 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xiexingguo commented Apr 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiexingguo commented Apr 23, 2018

Uh oh!

neha-ojha left a comment

Choose a reason for hiding this comment

Uh oh!

xiexingguo commented Apr 25, 2018

Uh oh!

xiexingguo commented Apr 25, 2018

Uh oh!

neha-ojha Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiexingguo Apr 26, 2018

Choose a reason for hiding this comment

Uh oh!

neha-ojha Apr 26, 2018

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Apr 30, 2018

Uh oh!

liewegas commented Apr 30, 2018

Uh oh!

liewegas commented May 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiexingguo commented Apr 23, 2018 •

edited

Loading

neha-ojha Apr 25, 2018 •

edited

Loading