Bug #61651
closedentries.begin()->version > info.last_update
0%
Description
osd is failing with:
DEBUG 2023-06-12 18:03:25,954 [shard 0] osd - op_trim_to = <null>, op_roll_forward_to = <null>
DEBUG 2023-06-12 18:03:25,954 [shard 1] osd - pg_epoch 281 pg[2.4( empty local-lis/les=13/14 n=0 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [2,3] r=0 lpr=13 crt=0'0 mlcod 0'0 active+clean PeeringState::recalc_readable_until peer osd.3 ruub 426.395782471s
DEBUG 2023-06-12 18:03:25,954 [shard 1] osd - pg_epoch 281 pg[2.4( empty local-lis/les=13/14 n=0 ec=13/13 lis/c=13/13 les/c/f=14/14/0 sis=13) [2,3] r=0 lpr=13 crt=0'0 mlcod 0'0 active+clean PeeringState::recalc_readable_until readable_until[_ub] 426.395782471s (sent 434.396270752s)
DEBUG 2023-06-12 18:03:25,954 [shard 1] osd - log is not dirty
DEBUG 2023-06-12 18:03:25,954 [shard 1] osd - peering_event(id=96136, detail=PeeringEvent(from=? pgid=97.1a sent=280 requested=280 evt=epoch_sent: 280 epoch_requested: 280 NullEvt +create_info)): have_pg
DEBUG 2023-06-12 18:03:25,954 [shard 1] osd - peering_event(id=96136, detail=PeeringEvent(from=? pgid=97.1a sent=280 requested=280 evt=epoch_sent: 280 epoch_requested: 280 NullEvt +create_info)): pg present
ERROR 2023-06-12 18:03:25,955 [shard 0] none - /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4374-g0625ecb2/rpm/el8/BUILD/ceph-18.0.0-4374-g0625ecb2/src/osd/PeeringState.cc:4038 : In function 'bool PeeringState::append_log_entries_update_missing(mempool::osd_pglog::list<pg_log_entry_t>&, ObjectStore::Transaction&, std::optional<eversion_t>, std::optional<eversion_t>)', ceph_assert(%s)
entries.begin()->version > info.last_update
DEBUG 2023-06-12 18:03:25,955 [shard 1] osd - peering_event(id=96617, detail=PeeringEvent(from=3 pgid=2.4 sent=274 requested=274 evt=epoch_sent: 274 epoch_requested: 274 MLeaseAck epoch 274 from osd.3 pg_lease_ack(ruub 426.395782471s))): submitting ctx
DEBUG 2023-06-12 18:03:25,955 [shard 1] osd - ShardServices::dispatch_context_transaction: empty transaction
DEBUG 2023-06-12 18:03:25,955 [shard 1] osd - do_peering_event handling epoch_sent: 280 epoch_requested: 280 NullEvt +create_info for pg: 97.1a
Aborting on shard 0.
Backtrace:
0# gsignal in /lib64/libc.so.6
1# abort in /lib64/libc.so.6
2# ceph::__ceph_assert_fail(char const*, char const*, int, char const*) in ceph-osd
3# PeeringState::append_log_entries_update_missing(std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)22, pg_log_entry_t> > const&, ceph::os::Transaction&, std::optional<eversion_t>, std::optional<eversion_t>) in ceph-osd
4# crimson::osd::PG::do_update_log_missing(boost::intrusive_ptr<MOSDPGUpdateLogMissing>, crimson::local_shared_foreign_ptr<seastar::shared_ptr<crimson::net::Connection> >) in ceph-osd
5# 0x0000560D90B9150C in ceph-osd
6# 0x0000560D90B93320 in ceph-osd
7# 0x0000560D90B94932 in ceph-osd
8# crimson::osd::LogMissingRequest::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>) in ceph-osd
9# seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> > (seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)>::direct_vtable_for<seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_wrapped_maybe_erase<false, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> >, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::osd::PG> > >::safe_then<crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::LogMissingRequest>(crimson::osd::LogMissingRequest::IRef)::{lambda(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>)#1}::operator()(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>) const::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#2}, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further>(crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::LogMissingRequest>(crimson::osd::LogMissingRequest::IRef)::{lambda(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>)#1}::operator()(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>) const::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#2}&&, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further&&)::{lambda(crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::LogMissingRequest>(auto:1::IRef)::{lambda(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>)#1}::operator()(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>) const::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#2})#1}>(crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<crimson::osd::PG> > >::safe_then<crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::LogMissingRequest>(crimson::osd::LogMissingRequest::IRef)::{lambda(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>)#1}::operator()(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>) const::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#2}, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further>(crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::LogMissingRequest>(crimson::osd::LogMissingRequest::IRef)::{lambda(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>)#1}::operator()(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>) const::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#2}&&, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::pass_further&&)::{lambda(crimson::osd::PGShardManager::run_with_pg_maybe_wait<crimson::osd::LogMissingRequest>(auto:1::IRef)::{lambda(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>)#1}::operator()(crimson::osd::PerShardState&, crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::LogMissingRequest>) const::{lambda(boost::intrusive_ptr<crimson::osd::PG>)#2})#1}&&)::{lambda(seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}>::call(seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> > (seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)> const*, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&) in ceph-osd
10# seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> > (seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_wrapped_nrvo<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> >, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> > (seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)> >(seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> > (seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<125> > >::_future<crimson::errorated_future_marker<void> > (seastar::future<boost::intrusive_ptr<crimson::osd::PG> >&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd
11# 0x0000560D9D07DE2B in ceph-osd
rados_api_tests:
https://pulpito.ceph.com/matan-2023-06-12_17:29:11-crimson-rados-wip-matanb-crimson-testing-bug-61504-v3-distro-crimson-smithi/7301383/
rbd_api_tests:
https://pulpito.ceph.com/matan-2023-06-12_17:29:11-crimson-rados-wip-matanb-crimson-testing-bug-61504-v3-distro-crimson-smithi/7301385
rbd_api_tests_old_format:
https://pulpito.ceph.com/matan-2023-06-12_17:29:11-crimson-rados-wip-matanb-crimson-testing-bug-61504-v3-distro-crimson-smithi/7301394/
https://pulpito.ceph.com/matan-2023-06-12_17:29:11-crimson-rados-wip-matanb-crimson-testing-bug-61504-v3-distro-crimson-smithi/7301396/
grep "is failing for" to find the relevant osd crash.
Updated by Matan Breizman almost 3 years ago
Updated by Matan Breizman over 2 years ago
- Status changed from New to In Progress
- Assignee set to Matan Breizman
- Pull request ID set to 52192
Updated by Samuel Just over 2 years ago
I think an interesting question is why we only see this reordering with LogMissingRequest messages. If this were simply a bug in messenger send ordering, I'd expect to also see this with MOSDRepOp messages as well.
ClientRequest has two pipeline stages that are relevant here: Process (OrderedExclusivePhaseT) and WaitRepop (OrderedConcurrentPhaseT). The idea is that we stay in the Process stage until all of the work of processing the op up to and including sending the MOSDRepOp (or LogMissingRequest) is complete (and associated futures resolve). We then go into the WaitRepop stage to wait for the responses to come back without blocking other ops from processing or waiting.
To that end, PG::do_osd_ops returns two futures, one representing the processing portion (including sending the requests to the replicas) and the second representing the full completion of the op. ClientRequest::do_process waits for the first before moving to the WaitRepop stage:
return pg->do_osd_ops(m, conn, obc, op_info, snapc).safe_then_unpack_interruptible(
[this, pg, &ihref](auto submitted, auto all_completed) mutable {
return submitted.then_interruptible([this, pg, &ihref] {
return ihref.enter_stage<interruptor>(client_pp(*pg).wait_repop, *this);
PG::do_osd_ops invokes PG::do_osd_ops_execute with success and failure callbacks, but the return value comes directly from PG::do_osd_ops_execute. I think the problem is in the error pathway at the end of do_osd_ops_execute:
}, OpsExecuter::osd_op_errorator::all_same_way(
[rollbacker, failure_func_ptr]
(const std::error_code& e) mutable {
return PG::do_osd_ops_iertr::make_ready_future<pg_rep_op_fut_t<Ret>>(
seastar::now(),
e.value() == ENOENT ? (*failure_func_ptr)(e) :
rollbacker.rollback_obc_if_modified(e).then_interruptible(
[e, failure_func_ptr] {
return (*failure_func_ptr)(e);
}));
}));
Notice, the first of the two futures returned is seastar::now() -- that will correspond to the submitted future in the previous snippet. The return value of (*failure_func_ptr)(e) (chained possibly after rollback) ends up in all_completed. I think this is our problem. (*failure_func_ptr)(e) contains the logic where we send the LogMissingRequest message to the replicas, so that part won't happen until the WaitRepop stage, which isn't exclusive so successive ops can reorder. PG::do_osd_ops thus packs two different things into the failure function: sending the LogMissingRequest messages and waiting for their responses. Those two things need to be split up the way the success path is.
The way PG::do_osd_ops's two overloads interact with do_osd_ops_execute is probably going to need to be rethought -- the MOSDRepop and LogMissingRequest pathways are too asymmetric at the moment.
Updated by Yingxin Cheng over 2 years ago
Updated by Yingxin Cheng over 2 years ago
https://pulpito.ceph.com/yingxin-2023-11-27_09:03:31-crimson-rados-wip-yingxin-crimson-make-crosscore-send-ordered-distro-default-smithi/7467522/
https://pulpito.ceph.com/yingxin-2023-11-27_09:03:31-crimson-rados-wip-yingxin-crimson-make-crosscore-send-ordered-distro-default-smithi/7467528/
https://pulpito.ceph.com/yingxin-2023-11-27_09:03:31-crimson-rados-wip-yingxin-crimson-make-crosscore-send-ordered-distro-default-smithi/7467545/
Updated by Matan Breizman about 2 years ago
- Status changed from In Progress to Resolved
- Pull request ID changed from 52192 to 54287
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 3d760f113bd9f3a2e513474d34227633cd06f1ea
- Fixed In set to v18.0.0-7550-g3d760f113bd
- Released In set to v19.2.0~1161
- Upkeep Timestamp set to 2025-07-12T07:38:46+00:00