mon: Create a local copy of scrub_state to avoid a crash#59050
mon: Create a local copy of scrub_state to avoid a crash#59050
Conversation
|
jenkins test make check |
|
I don't quite understand how bootstrap could interfere with scrub. Can you expand on that relationship a bit? Does it actually make sense for bootstrap and scrub to occur at the same time? |
|
I think the problem is actually about // called by bootstrap(), or on leader|peon -> electing
void Monitor::_reset()
{
dout(10) << __func__ << dendl;
// disable authentication
{
std::lock_guard l(auth_lock);
authmon()->_set_mon_num_rank(0, 0);
}
// ...
scrub_reset();
// ...which can happen when a scrub is ongoing.
void Monitor::handle_scrub(MonOpRequestRef op)
{
MMonScrub *m = static_cast<MMonScrub*>(op->get_req());
dout(10) << __func__ << " " << *m << dendl;
switch (m->op) {
case MMonScrub::OP_SCRUB:
{
if (!is_peon())
break;
// ...
case MMonScrub::OP_RESULT:
{
if (!is_leader())
break;but this doesn't prevent the race with |
Thanks Radek for replying the same. For specific to test case below are the commands those are hitting above stacktrace ceph mon set election_strategy disallow |
Sorry not previous one below is the right stacktrace |
src/mon/Monitor.cc
Outdated
| // it might be possible the scrub_state is reset via bootstrap | ||
| // code path and daemon is crashed. | ||
| std::shard_ptr<ScrubState> local_scrub_state = scrub_state; | ||
| if (nullptr == local_scrub_state) { |
There was a problem hiding this comment.
scrub_version and scrub_results are also modified by Monitor::scrub_reset(), so similar race affects them.
Perhaps we could group these members to e.g. ScrubContext and manage a pointer with the atomic shared ptr. Alternatively, a lock can be used but – as the scrub can take some time – better to have a dedicated one (like Monitor::auth_lock).
|
jenkins test make check |
rzarzynski
left a comment
There was a problem hiding this comment.
Generally looks good; just some nits.
src/mon/Monitor.cc
Outdated
| return -EBUSY; | ||
| if (scrub_con.load() != nullptr) { | ||
| std::shared_ptr<ScrubContext> local_con = scrub_con.load(); | ||
| if (local_con) { |
There was a problem hiding this comment.
local_con sounds like local connection but I believe the idea was about context. Perhaps local_context or even local_ctx would do.
src/mon/Monitor.h
Outdated
| scrub_state.reset(); | ||
| } | ||
| }; | ||
| std::atomic<std::shared_ptr<ScrubContext>> scrub_con; ///< keeps track of scrub context |
There was a problem hiding this comment.
scrub_context or scrub_ctx.
src/mon/Monitor.h
Outdated
| std::shared_ptr<ScrubState> scrub_state; ///< keeps track of current scrub | ||
|
|
||
| struct ScrubContext { | ||
| std::shared_ptr<ScrubState> scrub_state; ///< keeps track of current scrub |
There was a problem hiding this comment.
Do we really need this to be a pointer?
src/mon/Monitor.cc
Outdated
| if (!scrub_result.empty()) { | ||
| clog->info() << "scrub already in progress"; | ||
| return -EBUSY; | ||
| if (scrub_con.load() != nullptr) { |
There was a problem hiding this comment.
Do we need the double check? How about:
| if (scrub_con.load() != nullptr) { | |
| if (std::shared_ptr<ScrubContext> local_ctx = scrub_con.load(); scrub_ctx.load() != nullptr) { |
0782521 to
1e1601a
Compare
|
jenkins test make check |
src/mon/Monitor.cc
Outdated
| if (!scrub_result.empty()) { | ||
| clog->info() << "scrub already in progress"; | ||
| return -EBUSY; | ||
| if (std::shared_ptr<ScrubContext> local_ctx = scrub_con.load(); scrub_con.load() != |
There was a problem hiding this comment.
There are 2 loads from scrub_con; they can race. I think the intention was to compare just local_ctx:
| if (std::shared_ptr<ScrubContext> local_ctx = scrub_con.load(); scrub_con.load() != | |
| if (std::shared_ptr<ScrubContext> local_ctx = scrub_con.load(); local_ctx) { |
|
jenkins test make check |
rzarzynski
left a comment
There was a problem hiding this comment.
LGTM! Thank you, @mohit84!
src/mon/Monitor.h
Outdated
| scrub_result.clear(); | ||
| } | ||
| }; | ||
| std::atomic<std::shared_ptr<ScrubContext>> scrub_con; ///< keeps track of scrub_context |
There was a problem hiding this comment.
Hey @mohit84 there is a build failure related to this line:
/usr/include/c++/11/atomic: In instantiation of ‘struct std::atomic<std::shared_ptr<Monitor::ScrubContext> >’:
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.3.0-5886-gc279fd8b/rpm/el9/BUILD/ceph-19.3.0-5886-gc279fd8b/src/mon/Monitor.h:360:46: required from here
/usr/include/c++/11/atomic:211:21: error: static assertion failed: std::atomic requires a trivially copyable type
211 | static_assert(__is_trivially_copyable(_Tp),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/11/atomic:211:21: note: ‘false’ evaluates to false
make[2]: *** [src/mon/CMakeFiles/mon.dir/build.make:398: src/mon/CMakeFiles/mon.dir/NVMeofGwMap.cc.o] Error 1
There was a problem hiding this comment.
There was a problem hiding this comment.
As per error it seems std::atomic only works with trivially copyable type and std::shared_ptr is not trivially copyable. As per cpp reference for c++20 it should work fine but not sure why it is not working?
@rzarzynski
Do you think we need to use mutex instead of using atomic to avoid the same?
There was a problem hiding this comment.
I think the reason is buggy / laggy compiler.
-- The CXX compiler identification is GNU 11.2.0
-- The C compiler identification is GNU 11.2.0
...
/usr/include/c++/11/atomic: In instantiation of 'struct std::atomic<std::shared_ptr<Monitor::ScrubContext> >':
/build/ceph-19.3.0-5886-gc4040e92/src/mon/Monitor.h:360:46: required from here
/usr/include/c++/11/atomic:211:21: error: static assertion failed: std::atomic requires a trivially copyable type
211 | static_assert(__is_trivially_copyable(_Tp),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
godbolt.org shows GCC 11 is unable to compile even:
std::atomic<std::shared_ptr<int>> atomic_ptr_test;complaining about non-trivial copy-constructibility. I bet the reason is GCC simply lacks the partial specialization for shared_ptr<T>. GCC 12 is fine.
The workaround could be https://www.boost.org/doc/libs/1_75_0/libs/smart_ptr/doc/html/smart_ptr.html#atomic_shared_ptr
There was a problem hiding this comment.
Thanks, I have changed the code and use boost::atomic_shared_ptr.
|
jenkins test make check |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
jenkins test make check |
rzarzynski
left a comment
There was a problem hiding this comment.
Just nits, LGTM otherwise.
src/mon/Monitor.cc
Outdated
|
|
||
| boost::shared_ptr<ScrubContext> local_ctx = scrub_con.load(); | ||
| if (!local_ctx) { | ||
| ++errors; |
src/mon/Monitor.h
Outdated
| scrub_result.clear(); | ||
| } | ||
| }; | ||
| boost::atomic_shared_ptr<ScrubContext> scrub_con; ///< keeps track of scrub_context |
There was a problem hiding this comment.
nit: is there any reason for not naming it as scrub_ctx? _con suggest rather a connection which is misleading.
|
@mohit84: it looks there is some (trivial) problem from the renaming: |
| scrub_result.clear(); | ||
| } | ||
| }; | ||
| boost::atomic_shared_ptr<ScrubContext> scrub_ctx; ///< keeps track of scrub_context |
src/mon/Monitor.cc
Outdated
| if (scrub_state->finished) { | ||
| local_ctx->scrub_result.clear(); | ||
| if (local_ctx->scrub_state.finished) | ||
| const utime_t lat = ceph_clock_now() - scrub_state->start; |
There was a problem hiding this comment.
Oh, here is the residue.
| const utime_t lat = ceph_clock_now() - scrub_state->start; | |
| const utime_t lat = ceph_clock_now() - local_ctx->scrub_state->start; |
Thanks for pointing out the build error. |
|
jenkins test make check |
|
jenkins test make check |
…Context
During handle of scrub operation if scrub_state is reset via
bootstrap then a process might be crash.
Solution: Encapsulate all scrub related objects into a single
ScrubContext and manage it as via boost::atomic_shared_ptr.
Fixes: https://tracker.ceph.com/issues/67270
Credits: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
jenkins test make check |
2 similar comments
|
jenkins test make check |
|
jenkins test make check |
|
jenkins test make check arm64 |
|
jenkins test make check arm64 |
During handle of scrub operation if scrub_state is reset via bootstrap then a process might be crash.
Solution: Create a local copy of scrub_state shared_ptr. By doing this the refcounter will be increment
and object will not be destroyed in that scope.
Fixes: https://tracker.ceph.com/issues/67270
Credits: Radoslaw Zarzynski rzarzyns@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e