Skip to content

osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly#21598

Closed
xiexingguo wants to merge 2 commits intoceph:masterfrom
xiexingguo:wip-refresh-lc
Closed

osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly#21598
xiexingguo wants to merge 2 commits intoceph:masterfrom
xiexingguo:wip-refresh-lc

Conversation

@xiexingguo
Copy link
Copy Markdown
Member

@xiexingguo xiexingguo commented Apr 23, 2018

Upon overwritting an already-missing object on async_recovery_targets,
besides updating the missing set async_recovery_targets should
call reset_complete_to() simultaneously to reset complete-to iter and
advance last_complete properly, as those old log entries which we used
to rely on for recovery are now getting replaced by new committed ones.

Since the recovery process of an async_recovery_target can be arbitrarily
delayed until we hit the osd_max_pg_log_entries limit, the main befenit
of an extra call to reset_complete_to() here is to make sure primary perform
log-trim in time. Otherwise accumulating of huge amounts of obsolete log
entries can slow the system down obviously.

Signed-off-by: xiexingguo xie.xingguo@zte.com.cn

@xiexingguo xiexingguo changed the title osd/PrimaryLogPG: refresh local last-complete iter for async_recovery… osd/PrimaryLogPG: refresh local last-complete iter for async_recovery_targets properly Apr 23, 2018
@xiexingguo
Copy link
Copy Markdown
Member Author

@jdurgin

…_targets properly

Upon overwritting an already-missing object on async_recovery_targets,
besides updating the **missing** set async_recovery_targets should
call reset_complete_to() simultaneously to reset **complete-to** iter and
advance **last_complete** properly, as those old log entries which we used
to rely on for recovery are now getting replaced by new **committed** ones.

Since the recovery process of an async_recovery_target can be arbitrarily
delayed until we hit the osd_max_pg_log_entries limit, the main befenit
of an extra call to reset_complete_to() here is to make sure primary perform
log-trim in time. Otherwise accumulating of huge amounts of **obsolete** log
entries can slow the system down obviously.

Signed-off-by: xiexingguo <xie.xingguo@zte.com.cn>
@liewegas liewegas added this to the mimic milestone Apr 23, 2018
@liewegas liewegas requested a review from neha-ojha April 23, 2018 14:00
Copy link
Copy Markdown
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably be useful. Needs to be run against the rados suite to ensure that there are no side effects.

@xiexingguo
Copy link
Copy Markdown
Member Author

@tchaikov

This change is currently triggering crashes. Please drop it from your test branch. :-(

@xiexingguo
Copy link
Copy Markdown
Member Author

@neha-ojha

I appended another fix to address the crashes from @tchaikov 's test branch and the newest test results seem to be positive:

http://pulpito.ceph.com/xxg-2018-04-25_06:14:51-rados:thrash-wip-refresh-lc-distro-basic-smithi/

http://pulpito.ceph.com/xxg-2018-04-25_04:38:44-rados-wip-refresh-lc-distro-basic-smithi/

Mind taking another glance at?

@xiexingguo xiexingguo removed the DNM label Apr 25, 2018
}
clear_temp_objs(op.temp_removed);
dout(30) << __func__ << " missing before " << get_parent()->get_log().get_missing().get_items() << dendl;
get_parent()->log_operation(
Copy link
Copy Markdown
Member

@neha-ojha neha-ojha Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we want to move the dout of "missing before"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped that cosmetic change ( I was trying to wrap that line into 80 chars, but I think it is fine to leave as it is...)

update_snaps = true;
}

parent->update_stats(m->pg_stats);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_stats() may not be required to move as well? looks good overall.

The problem is that if we update pg_missing first, **reset_complete_to()**
will probably try to move **complete-to** iter to the newest log entry
which we haven't persisted into local pg_log list (we haven't called
**log_operation()** yet) and hence trigger the coredump below:
```
0> 2018-04-24 15:42:47.270 7fb4694d5700 -1 /build/ceph-13.0.2-1706-g01c4f53/src/osd/PGLog.h: In function 'void PGLog::reset_complete_to(pg_info_t*)' thread 7fb4694d5700 time 2018-04-24 15:42:47.270648
/build/ceph-13.0.2-1706-g01c4f53/src/osd/PGLog.h: 778: FAILED assert(log.complete_to != log.log.end())

 ceph version 13.0.2-1706-g01c4f53 (01c4f53802f46966433ab1d453afac6db7e5b707) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fb494648f82]
 2: (()+0x2e5157) [0x7fb494649157]
 3: (non-virtual thunk to PrimaryLogPG::add_local_next_event(pg_log_entry_t const&)+0x478) [0x55d70493aff8]
 4: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0x855) [0x55d704a3ead5]
```

See also:
http://qa-proxy.ceph.com/teuthology/kchai-2018-04-24_14:54:28-rados-wip-kefu-testing-2018-04-24-1145-distro-basic-mira/2434437/

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@tchaikov
Copy link
Copy Markdown
Contributor

@liewegas
Copy link
Copy Markdown
Member

This looks like itw as causing scrub errors in my run

@xiexingguo xiexingguo added DNM and removed needs-qa labels May 1, 2018
@liewegas liewegas changed the base branch from master to mimic May 3, 2018 18:17
@liewegas liewegas changed the base branch from mimic to master May 16, 2018 21:10
@liewegas
Copy link
Copy Markdown
Member

@liewegas liewegas removed this from the mimic milestone May 20, 2018
@xiexingguo xiexingguo closed this Aug 28, 2018
@xiexingguo xiexingguo deleted the wip-refresh-lc branch August 28, 2018 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants