Project

General

Profile

Actions

Bug #70501

open

Backfill: PeeringState::Recovered assert !ps->needs_recovery()

Added by Matan Breizman about 1 year ago. Updated 4 days ago.

Status:
Fix Under Review
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

http://qa-proxy.ceph.com/teuthology/matan-2025-03-16_14:34:08-crimson-rados-wip-matanb-crimson-complete-write-distro-crimson-smithi/8193834/teuthology.log

Found on a new test case added "radosbench-even-higher-concurrency" with "short_pg_log":

short_pg_log clusters/{fixed-2} objectstore/bluestore thrashers/default thrashosds-health workloads/radosbench-even-higher-concurrency

osd.2:

DEBUG 2025-03-16 15:37:02,790 [shard 1:main] osd - do_peering_event handling epoch_sent: 280 epoch_requested: 280 Backfilled for pg: 7.d
DEBUG 2025-03-16 15:37:02,790 [shard 2:main] osd - do_peering_event handling epoch_sent: 280 epoch_requested: 280 MQuery 7.a from 3 query_epoch 280 query: query(info 0'0 epoch_sent 280) for pg: 7.a
DEBUG 2025-03-16 15:37:02,790 [shard 1:main] osd -  pg_epoch 280 pg[7.d( v 280'1726 (251'976,280'1726] local-lis/les=253/254 n=22 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=253) [2,1,3]/[2,1] backfill=[3] r=0 lpr=253 pi=[239,253)/1 pct=251'976 lua=251'976 crt=280'1726 lcod 0'0 mlcod 0'0 active+undersized+remapped+backfilling  PG::cancel_local_background_io_reservation: 
DEBUG 2025-03-16 15:37:02,790 [shard 1:main] osd - ShardServices::local_cancel_reservation: sending to singleton pgid 7.d
DEBUG 2025-03-16 15:37:02,790 [shard 2:main] osd -  pg_epoch 280 pg[7.a( v 279'1869 (274'1824,279'1869] local-lis/les=253/254 n=39 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=280) [3,2,0] r=1 lpr=280 pi=[239,280)/1 crt=279'1869 unknown NOTIFY pruub 187.193038940s@ PeeringState::update_history advanced history from ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=280
DEBUG 2025-03-16 15:37:02,790 [shard 1:main] osd -  pg_epoch 280 pg[7.d( v 280'1726 (251'976,280'1726] local-lis/les=253/254 n=22 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=253) [2,1,3]/[2,1] backfill=[3] r=0 lpr=253 pi=[239,253)/1 pct=251'976 lua=251'976 crt=280'1726 lcod 0'0 mlcod 0'0 active+undersized+remapped+backfilling  PG::send_cluster_message: message MBackfillReserve(7.d RELEASE e280/280) v5 to 3 share_map_update false
DEBUG 2025-03-16 15:37:02,790 [shard 2:main] osd -  pg_epoch 280 pg[7.a( v 279'1869 (274'1824,279'1869] local-lis/les=253/254 n=39 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=280) [3,2,0] r=1 lpr=280 pi=[239,280)/1 crt=279'1869 unknown NOTIFY PeeringState::fulfill_info sending info
INFO  2025-03-16 15:37:02,790 [shard 1:main] osd -  pg_epoch 280 pg[7.d( v 280'1726 (251'976,280'1726] local-lis/les=253/254 n=22 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=253) [2,1,3]/[2,1] backfill=[3] r=0 lpr=253 pi=[239,253)/1 pct=251'976 lua=251'976 crt=280'1726 lcod 0'0 mlcod 0'0 active+undersized+remapped+backfilling exit Started/Primary/Active/Backfilling 6.308049 15 0.016434
INFO  2025-03-16 15:37:02,790 [shard 1:main] osd - Exiting state: Started/Primary/Active/Backfilling, entered at 2025-03-16T15:36:56.482721+0000, 0.016434 spent on 15 events
INFO  2025-03-16 15:37:02,790 [shard 1:main] osd -  pg_epoch 280 pg[7.d( v 280'1726 (251'976,280'1726] local-lis/les=253/254 n=22 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=253) [2,1,3]/[2,1] backfill=[3] r=0 lpr=253 pi=[239,253)/1 pct=251'976 lua=251'976 crt=280'1726 lcod 0'0 mlcod 0'0 active+undersized+remapped enter Started/Primary/Active/Recovered
DEBUG 2025-03-16 15:37:02,790 [shard 2:main] osd - log is not dirty
DEBUG 2025-03-16 15:37:02,790 [shard 2:main] osd - log is not dirty
INFO  2025-03-16 15:37:02,790 [shard 1:main] osd - Entering state: Started/Primary/Active/Recovered
DEBUG 2025-03-16 15:37:02,790 [shard 1:main] osd -  pg_epoch 280 pg[7.d( v 280'1726 (251'976,280'1726] local-lis/les=253/254 n=22 ec=232/232 lis/c=253/239 les/c/f=254/240/0 sis=253) [2,1,3]/[2,1] backfill=[3] r=0 lpr=253 pi=[239,253)/1 pct=251'976 lua=251'976 crt=280'1726 lcod 0'0 mlcod 0'0 active+undersized+remapped PeeringState::needs_recovery osd.3 has 1 missing
DEBUG 2025-03-16 15:37:02,791 [shard 2:main] osd - 0x5030007234a0 PeeringEvent<T>::complete_rctx: peering_event(id=16008, detail=PeeringEvent(from=3 pgid=7.a sent=280 requested=280 evt=epoch_sent: 280 epoch_requested: 280 MQuery 7.a from 3 query_epoch 280 query: query(info 0'0 epoch_sent 280))): submitting ctx
DEBUG 2025-03-16 15:37:02,791 [shard 2:main] osd - OSDSingletonState::dispatch_context_transaction: do_transaction ...
ERROR 2025-03-16 15:37:02,791 [shard 1:main] none - /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-520-gd45b51e8/rpm/el9/BUILD/ceph-20.0.0-520-gd45b51e8/src/osd/PeeringState.cc:5890 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(%s)
!ps->needs_recovery()
DEBUG 2025-03-16 15:37:02,791 [shard 2:main] alienstore - do_transaction_no_callbacks
Aborting on shard 1.
Backtrace:
 3# ceph::__ceph_assert_fail(ceph::assert_data const&) in ceph-osd
 4# PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) in ceph-osd
 5# boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PeeringState::Active> const&, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&) in ceph-osd
 6# boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<PeeringState::Active> const&, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&) in ceph-osd
 7# boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PeeringState::Recovered, PeeringState::PeeringMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&) in ceph-osd
 8# boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit<PeeringState::Recovered>() in ceph-osd
 9# PeeringState::Backfilling::react(PeeringState::Backfilled const&) in ceph-osd
10# boost::statechart::detail::reaction_result boost::statechart::custom_reaction<PeeringState::Backfilled>::react<PeeringState::Backfilling, boost::statechart::event_base, void const*>(PeeringState::Backfilling&, boost::statechart::event_base const&, void const* const&) in ceph-osd
11# boost::statechart::detail::reaction_result boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PeeringState::Backfilled>, boost::statechart::custom_reaction<DeferBackfill>, boost::statechart::custom_reaction<PeeringState::UnfoundBackfill>, boost::statechart::custom_reaction<RemoteReservationRejectedTooFull>, boost::statechart::custom_reaction<RemoteReservationRevokedTooFull>, boost::statechart::custom_reaction<RemoteReservationRevoked>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*) in ceph-osd
12# boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) in ceph-osd
13# boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<boost::statechart::none>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>::operator()() in ceph-osd
14# boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&) in ceph-osd
15# boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&) in ceph-osd
16# PeeringState::handle_event(boost::statechart::event_base const&, PeeringCtx*) in ceph-osd
17# crimson::osd::PG::do_peering_event(PGPeeringEvent&, PeeringCtx&) in ceph-osd
18# crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const in ceph-osd

Actions #2

Updated by Laura Flores 11 months ago · Edited

/a/lflores-2025-04-17_15:29:43-crimson-rados-wip-lflores-testing-2-2025-04-11-1133-distro-default-smithi/8246088/

osd.2

Actions #3

Updated by Matan Breizman 11 months ago · Edited

  • Priority changed from Normal to High

https://pulpito.ceph.com/matan-2025-04-29_13:40:47-crimson-rados-wip-matanb-crimson-testing-april-29-distro-crimson-smithi/8264692/
osd.3

short_pg_log ... workloads/radosbench-high-concurrency

Looking at pg 5.1

Up set: [3,0,2]
Acting set: [3,2]

When backfilling osd.0:

2025-04-29T15:16:21.599 INFO:tasks.thrashosds.thrasher:in_osds:  [0, 1, 2, 3] out_osds:  [] dead_osds:  [0] live_osds:  [1, 3, 2]
2025-04-29T15:16:21.599 INFO:tasks.thrashosds.thrasher:choose_action: min_in 4 min_out 0 min_live 2 min_dead 0 chance_down 0.40
2025-04-29T15:16:21.599 INFO:tasks.thrashosds.thrasher:Reviving osd 0

We got a OP_BACKFILL_FINISH_ACK on osd.3 (primary), which in turn moves the state to RequestDone (in backfill_target_done()) which made us exit from Waiting state.
The issue is when marking the backfill as Done, the peer osd.0 ( not in the acting set) still had 1 missing and has caused the assertion.

DEBUG 2025-04-29 15:17:12,396 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  RecoveryBackend::handle_backfill: 
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  RecoveryBackend::handle_backfill_finish_ack: 
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] osd - BackfillState::StateHelper: exit crimson::osd::BackfillState::Waiting
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] osd - pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  BackfillState::StateHelper: enter crimson::osd::BackfillState::Done
INFO  2025-04-29 15:17:12,396 [shard 1:main] osd - pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  BackfillState::Done::Done: backfill is done
INFO  2025-04-29 15:17:12,396 [shard 1:main] osd - PerShardState::start_operation, peering_event(id=16778619, detail=PeeringEvent(from=3 pgid=5.1 sent=95 requested=95 evt=epoch_sent: 95 epoch_requested: 95 Backfilled))
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] osd - 0x503000dac2a0 RecoverySubRequest::with_pg: background_recovery_sub(id=16778618, detail=pg_backfill(finish_ack 5.1 e 95/95 lb MIN)): complete
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] ms - [0x51100030b640 osd.3(cluster) v2:172.21.15.152:6803/1582786528 >> osd.0 v2:172.21.15.47:6803/4239440132] got 1572 do_send() from core 0 -- osd_repop(client.4579.0:13220 5.e e95/90 5:74d06e98:::benchmark_data_smithi047_42855_object1045:head v 95'759, pct=94'749) v3
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] ms - [0x51100030b640 osd.3(cluster) v2:172.21.15.152:6803/1582786528 >> osd.0 v2:172.21.15.47:6803/4239440132] --> #1572 === osd_repop(client.4579.0:13220 5.e e95/90 5:74d06e98:::benchmark_data_smithi047_42855_object1045:head v 95'759, pct=94'749) v3 (112)
DEBUG 2025-04-29 15:17:12,396 [shard 1:main] osd - background_recovery_sub(id=16778618, detail=pg_backfill(finish_ack 5.1 e 95/95 lb MIN)): exit
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd - 0x0 LocalPeeringEvent::start: peering_event(id=16778619, detail=PeeringEvent(from=3 pgid=5.1 sent=95 requested=95 evt=epoch_sent: 95 epoch_requested: 95 Backfilled)): start
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd - 0x0 PeeringEvent<T>::with_pg: start
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd - 0x503000b9d530 PeeringEvent<T>::with_pg: 0x503000b9d530 peering_event(id=16778619, detail=PeeringEvent(from=3 pgid=5.1 sent=95 requested=95 evt=epoch_sent: 95 epoch_requested: 95 Backfilled)): pg present
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd - do_peering_event handling epoch_sent: 95 epoch_requested: 95 Backfilled for pg: 5.1
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  PG::cancel_local_background_io_reservation: 
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  PG::cancel_local_background_io_reservation: 
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd - ShardServices::local_cancel_reservation: sending to singleton pgid 5.1
DEBUG 2025-04-29 15:17:12,397 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling  PG::send_cluster_message: message MBackfillReserve(5.1 RELEASE e95/95) v5 to 0 share_map_update false
INFO  2025-04-29 15:17:12,398 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped+backfilling exit Started/Primary/Active/Backfilling 0.519735 3 0.005763
INFO  2025-04-29 15:17:12,398 [shard 1:main] osd - Exiting state: Started/Primary/Active/Backfilling, entered at 2025-04-29T15:17:11.878349+0000, 0.005763 spent on 3 events
INFO  2025-04-29 15:17:12,398 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped enter Started/Primary/Active/Recovered
INFO  2025-04-29 15:17:12,398 [shard 1:main] osd - Entering state: Started/Primary/Active/Recovered
DEBUG 2025-04-29 15:17:12,398 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped PeeringState::needs_recovery osd.0 has 1 missing
DEBUG 2025-04-29 15:17:12,398 [shard 1:main] osd -  pg_epoch 95 pg[5.1( v 95'737 (94'727,95'737] local-lis/les=79/80 n=33 ec=71/71 lis/c=79/71 les/c/f=80/72/0 sis=79) [3,0,2]/[3,2] backfill=[0] r=0 lpr=79 pi=[71,79)/1 pct=94'730 lua=77'608 crt=95'737 lcod 94'729 mlcod 94'729 active+undersized+remapped PeeringState::needs_recovery osd.0 has 1 missing
ERROR 2025-04-29 15:17:12,398 [shard 1:main] none - /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-1701-g3130b187/rpm/el9/BUILD/ceph-20.0.0-1701-g3130b187/src/osd/PeeringState.cc:6178 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(%s)
!ps->needs_recovery()

Looking at OSD.0, we're missing pg prefixes in PGRecovery - addressed here: https://github.com/ceph/ceph/pull/62847

DEBUG 2025-04-29 15:14:27,417 [shard 1:main] osd - update_peers_last_backfill: new_last_backfill=MAX
INFO  2025-04-29 15:14:27,417 [shard 1:main] osd - update_peers_last_backfill: peer 3 num_objects now 54 / 54   <--- this is where we send OP_BACKFILL_FINISH_ACK to primary (3), which pg?

Actions #4

Updated by Matan Breizman 11 months ago

  • Assignee set to Matan Breizman
Actions #5

Updated by Matan Breizman 10 months ago

  • Priority changed from High to Normal

Lowering due to frequency

Actions #6

Updated by Matan Breizman 8 months ago

Release flavor, seastore
https://pulpito.ceph.com/teuthology-2025-07-12_21:04:03-crimson-rados-main-distro-crimson-release-smithi/8382592
osd.1:

ERROR 2025-07-13 00:24:45,103 [shard 0:main] osd - /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-1560-g569ca195/rpm/el9/BUILD/ceph-20.3.0-1560-g569ca195/src/osd/PeeringState.cc:6279 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery())

ERROR 2025-07-13 00:24:45,103 [shard 0:main] osd - Aborting Got SIGABRT on shard 0 - Stopping all shards
Aborting Got SIGABRT on shard 0 - Stopping all shardsERROR 2025-07-13 00:24:46,405 [shard 0:main] osd - Got SIGABRT on shard 0  
Backtrace:
  0# 0x00007F55C268C0FC in /lib64/libc.so.6
 1# raise in /lib64/libc.so.6
 2# abort in /lib64/libc.so.6
 3# ceph::__ceph_assert_fail(char const*, char const*, int, char const*) in ceph-osd
 4# ceph::__ceph_assert_fail(ceph::assert_data const&) in ceph-osd
 5# crimson::osd::PGActivationBlocker::stop() in ceph-osd
 6# PGLog::write_log_and_missing(ceph::os::Transaction&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*, pg_log_t&, coll_t const&, ghobject_t const&, pg_missing_set<true> const&, bool, bool*, DoutPrefixProvider const*) in ceph-osd
 7# crimson::osd::PG::do_peering_event(PGPeeringEvent&, PeeringCtx&) in ceph-osd

Actions #7

Updated by Matan Breizman 8 months ago

  • Tags set to RelWithDebInfo
Actions #10

Updated by Matan Breizman 7 months ago

  • Tags changed from RelWithDebInfo to RelWithDebInfo, frequent-failure
  • Assignee deleted (Matan Breizman)
  • Priority changed from Normal to High
Actions #12

Updated by Matan Breizman 5 months ago

https://pulpito.ceph.com/matan-2025-11-06_07:47:13-crimson-rados-wip-kautilya-renable-testmigration-distro-crimson-debug-smithi/8586652

osd.0:

ERROR 2025-11-06 10:08:17,659 [shard 2:main] osd - /ceph/rpmbuild/BUILD/ceph-20.3.0-3795-gcd87e5ba/src/osd/PeeringState.cc:6398 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery())
 3# ceph::__ceph_assert_fail(char const*, char const*, int, char const*) in ceph-osd
 4# ceph::__ceph_assert_fail(ceph::assert_data const&) in ceph-osd
 5# crimson::osd::PGActivationBlocker::stop() in ceph-osd
 6# PGStateHistory::dump(ceph::Formatter*) const in ceph-osd
 7# crimson::osd::PG::do_peering_event(PGPeeringEvent&, PeeringCtx&) in ceph-osd
 8# seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<crimson::interruptible::interruptor<crimson::osd::IOInterruptCondition>::async<crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<void> > >(crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>(seastar::thread_attributes, crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>::call(seastar::noncopyable_function<void ()> const*) in ceph-osd

Actions #13

Updated by MOHIT AGRAWAL 4 months ago

I have triggered a same job 20 times but the issue is not reproducible on latest crimson-debug build.

teuthology-suite -m smithi -s crimson-rados -p 50 --ceph-repo https://github.com/ceph/ceph.git --suite-repo https://github.com/mohit84/ceph.git --suite-branch run_radosbench -S 093fdbeba20252909bbd39822b408f18385ee7e7 --flavor crimson-debug --filter-all seastore-segmented,short_pg_log,thrash_seastore_radosbench,radosbench-high-concurrency -N 10

The result is below
https://pulpito.ceph.com/moagrawa-2025-11-07_05:36:16-crimson-rados-main-distro-crimson-debug-smithi/
https://pulpito.ceph.com/moagrawa-2025-11-07_07:26:10-crimson-rados-main-distro-crimson-debug-smithi/

Actions #14

Updated by Jose J Palacios Perez 4 months ago · Edited

Probably a duplicate, eventhough the backtrace is slightly different: https://pulpito.ceph.com/jjperez-2025-11-11_12:44:02-crimson-rados-wip-perezjos-crimson-only-11-11-2025-PR65726-distro-crimson-debug-smithi/8595690

osd.3

kernel callstack: 0xffffffffffffff80 0xffffffffa4fc4bb3 0xffffffffa4fc4f40 0xffffffffa4fcbe14 0xffffffffa4fcc1a7 0xffffffffa4d0f9e1 0xffffffffa4d0fd86 0xffffffffa4d10b77 0xffffffffa4d117ea 0xffffffffa4d11c66 0xffffffffa5939f8f 0xffffffffa5a0012b
ERROR 2025-11-11 22:45:40,334 [shard 0:main] osd - Got SIGABRT on shard 0
Backtrace:
  0# 0x00007F6CB803EA60 in /lib64/libc.so.6
 1# 0x00007F6CB808C0FC in /lib64/libc.so.6
 2# raise in /lib64/libc.so.6
 3# abort in /lib64/libc.so.6
 4# ceph::__ceph_assert_fail(ceph::assert_data const&) in ceph-osd
 5# PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) in ceph-osd
 6# PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context) in ceph-osd
 7# PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context) in ceph-osd
 8# PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context) in ceph-osd
 9# PeeringState::pre_submit_op(hobject_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, eversion_t) in ceph-osd
10# PeeringState::Backfilling::react(PeeringState::Backfilled const&) in ceph-osd
11# PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context) in ceph-osd
12# PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context) in ceph-osd
13# PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context) in ceph-osd
14# crimson::osd::BackfillState::Enqueuing::all_emptied(PrimaryBackfillInterval const&, std::map<pg_shard_t, ReplicaBackfillInterval, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ReplicaBackfillInterval> > > const&) const in ceph-osd
15# non-virtual thunk to crimson::osd::PG::on_change(ceph::os::Transaction&) in ceph-osd
16# non-virtual thunk to crimson::osd::PG::on_change(ceph::os::Transaction&) in ceph-osd
17# std::chrono::operator<<(std::ostream&, std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) in ceph-osd
18# crimson::osd::PG::do_peering_event(PGPeeringEvent&, PeeringCtx&) in ceph-osd
19# crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>) in ceph-osd
20# _ZN7seastar20noncopyable_functionIFN7crimson13interruptible27interruptible_future_detailINS1_3osd20IOInterruptConditionENS_6futureIvEEEEvEE17direct_vtable_forIZNS7_4thenIZNS8_18then_interruptibleIZZNS4_12PeeringEventINS4_18RemotePeeringEventEE7with_pgERNS4_13ShardServicesEN5boost13intrusive_ptrINS4_2PGEEEENKUlvE_clEvEUlvE0_EEDaOT_EUlvE_S8_EET0_SQ_EUlDpOT_E_E7destroyEPNS_8internal25noncopyable_function_baseE in ceph-osd
21# _ZN7seastar20noncopyable_functionIFN7crimson13interruptible27interruptible_future_detailINS1_3osd20IOInterruptConditionENS_6futureIvEEEEvEE17direct_vtable_forIZNS7_4thenIZNS8_18then_interruptibleIZZNS4_12PeeringEventINS4_18RemotePeeringEventEE7with_pgERNS4_13ShardServicesEN5boost13intrusive_ptrINS4_2PGEEEENKUlvE_clEvEUlvE0_EEDaOT_EUlvE_S8_EET0_SQ_EUlDpOT_E_E7destroyEPNS_8internal25noncopyable_function_baseE in ceph-osd
22# _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson13interruptible27interruptible_future_detailINS7_3osd20IOInterruptConditionES2_E18then_interruptibleIZZZNSA_12PeeringEventINSA_18RemotePeeringEventEE7with_pgERNSA_13ShardServicesEN5boost13intrusive_ptrINSA_2PGEEEENKUlvE_clEvENKUlvE0_clEvEUlvE0_EEDaOT_EUlvE_S2_EET0_SR_EUlDpOT_E_E4callEPKS4_ in ceph-osd
23# _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson13interruptible27interruptible_future_detailINS7_3osd20IOInterruptConditionES2_E18then_interruptibleIZZZNSA_12PeeringEventINSA_18RemotePeeringEventEE7with_pgERNSA_13ShardServicesEN5boost13intrusive_ptrINSA_2PGEEEENKUlvE_clEvENKUlvE0_clEvEUlvE0_EEDaOT_EUlvE_S2_EET0_SR_EUlDpOT_E_E4callEPKS4_ in ceph-osd
24# seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<crimson::interruptible::interruptor<crimson::osd::IOInterruptCondition>::async<crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<void> > >(crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>(seastar::thread_attributes, crimson::osd::PeeringEvent<crimson::osd::LocalPeeringEvent>::with_pg(crimson::osd::ShardServices&, boost::intrusive_ptr<crimson::osd::PG>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>::call(seastar::noncopyable_function<void ()> const*) in ceph-osd
25# std::noskipws(std::ios_base&) in ceph-osd
26# std::noskipws(std::ios_base&) in ceph-osd
Actions #15

Updated by Matan Breizman 4 months ago · Edited

osd.3:

http://qa-proxy.ceph.com/teuthology/matan-2025-11-13_09:09:15-crimson-rados-wip-build-crimson-osd-kefu-distro-crimson-debug-smithi/8601174/teuthology.log

 4# ceph::__ceph_assert_fail(ceph::assert_data const&) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/src/crimson/common/assert.cc:15
 5# PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/src/osd/PeeringState.cc:6398 (discriminator 2)
 6# boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PeeringState::Active> const&, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/state.hpp:89 (discriminator 4)
 7# boost::statechart::state<PeeringState::Recovered, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<PeeringState::Active> const&, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/state.hpp:80
 8# boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PeeringState::Recovered, boost::mpl::l_end>, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator> >::construct(boost::intrusive_ptr<PeeringState::Active> const&, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/detail/constructor.hpp:94
 9# boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit<PeeringState::Recovered>() at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/simple_state.hpp:311
10# PeeringState::Backfilling::react(PeeringState::Backfilled const&) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/src/osd/PeeringState.cc:5602
11# boost::statechart::detail::result_utility::get_result(boost::statechart::detail::safe_reaction_result) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/result.hpp:110
12# boost::statechart::detail::reaction_result boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PeeringState::Backfilled>, boost::statechart::custom_reaction<DeferBackfill>, boost::statechart::custom_reaction<PeeringState::UnfoundBackfill>, boost::statechart::custom_reaction<RemoteReservationRejectedTooFull>, boost::statechart::custom_reaction<RemoteReservationRevokedTooFull>, boost::statechart::custom_reaction<RemoteReservationRevoked>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/simple_state.hpp:818
13# boost::statechart::simple_state<PeeringState::Backfilling, PeeringState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/simple_state.hpp:493
14# boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<boost::statechart::none>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>::operator()() at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/state_machine.hpp:87 (discriminator 1)
15# boost::statechart::detail::safe_reaction_result boost::statechart::null_exception_translator::operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<boost::statechart::none>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::exception_event_handler>(boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<boost::statechart::none>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>, boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::exception_event_handler) at /usr/src/debug/ceph-20.3.0-4088.g4e0abfb0.el9.x86_64/redhat-linux-build/boost/include/boost/statechart/null_exception_translator.hpp:33

Actions #16

Updated by Matan Breizman 4 months ago

  • Assignee set to Shraddha Agrawal

Hey Shraddha, can you please take a look at this one?
@Ronen Friedman is joining Crimson and this could be a good opportunity to investigate this tracker together!
Thanks!

Actions #17

Updated by Shraddha Agrawal 3 months ago · Edited

Scheduled the test 5 times to replicate, all passed successfully on latest main (with normal_pg_log instead of short_pg_log). See: https://pulpito.ceph.com/shraddhaag-2025-12-08_10:22:12-crimson-rados-main-distro-crimson-debug-smithi/.

Command I used to schedule these tests:

teuthology-suite -m gibba -s crimson-rados -p 90 -l 2 -N 9 --flavor crimson-debug --force-priority --sha1 6ce249e0e13e12a74d5c855ed12d6b50671977c9 --filter "crimson-rados/thrash/{0-crimson_install 0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{default} clusters/{crimson-fixed-2} crimson-supported-all-distro/centos_latest crimson_qa_overrides deploy/ceph objectstore/crimson_bluestore thrashers/default thrashosds-health workloads/radosbench" --distro centos --distro-version 9 --suite-branch main --suite-repo https://github.com/ceph/ceph.git --filter-out "workloads/radosbench-high-concurrency" 

Trying again to reproduce, see:

Actions #18

Updated by Shraddha Agrawal 3 months ago

shraddhaag-2025-12-04_08:39:06-crimson-rados-main-distro-crimson-debug-smithi/8641104

2025-12-04T09:29:40.722 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~303s
2025-12-04T09:29:40.722 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
2025-12-04T09:29:41.226 INFO:tasks.ceph.osd.0:Sent signal 15
2025-12-04T09:29:41.227 INFO:tasks.ceph.osd.1:Sent signal 15
2025-12-04T09:29:41.227 INFO:tasks.ceph.osd.3:Sent signal 15
2025-12-04T09:29:41.228 INFO:tasks.ceph.mon.a:Sent signal 15
2025-12-04T09:29:41.228 INFO:tasks.ceph.mgr.x:Sent signal 15
2025-12-04T09:29:41.228 INFO:tasks.daemonwatchdog.daemon_watchdog:Killing thrasher OSDThrasher
2025-12-04T09:29:41.229 INFO:tasks.ceph.mon.a.smithi187.stderr:2025-12-04T09:29:41.226+0000 7f1bf1a60640 -1 received  signal: Terminated from /usr/bin/python3 /bin/daemon-helper kill ceph-mon -f --cluster ceph -i a  (PID: 35135) UID: 0
2025-12-04T09:29:41.229 INFO:tasks.ceph.mon.a.smithi187.stderr:2025-12-04T09:29:41.226+0000 7f1bf1a60640 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2025-12-04T09:29:42.992 DEBUG:teuthology.orchestra.run.smithi187:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json
2025-12-04T09:29:43.114 INFO:teuthology.orchestra.run.smithi187.stderr:2025-12-04T09:29:43.108+0000 7ff283707640 -1 WARNING: the following dangerous and experimental features are enabled: crimson
2025-12-04T09:31:43.032 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-12-04T09:31:43.038 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
DEBUG 2025-12-04 09:24:32,838 [shard 2:main] osd -  pg_epoch 224 pg[7.8( v 223'147 (222'143,223'147] local-lis/les=219/220 n=123 ec=196/196 lis/c=219/0 les/c/f=220/0/0 sis=219) [1,2,3]/[2,3] backfill=[1] r=0 lpr=219 pi=[196,219)/2 lua=208'135 crt=223'147 lcod 223'146 mlcod 223'146 active+undersized+degraded+remapped PeeringState::needs_recovery osd.1 has 2 missing
ERROR 2025-12-04 09:24:32,838 [shard 2:main] osd - /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-4434-g8611241d/rpm/el9/BUILD/ceph-20.3.0-4434-g8611241d/src/osd/PeeringState.cc:6415 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery())
Actions #19

Updated by Aishwarya Mathuria 2 months ago

https://pulpito.ceph.com/amathuri-2026-01-19_08:34:57-crimson-rados:thrash_pg_grow-main-distro-debug-trial/7514
osd.1:

DEBUG 2026-01-19 09:04:58,574 [shard 0:main] ms - [0x7cb11c118700 osd.1(cluster) v2:10.20.193.148:6801/1231154630@63972 >> osd.0 v2:10.20.193.148:6811/2690313804] GOT AckFrame: seq=226398
DEBUG 2026-01-19 09:04:58,574 [shard 1:main] osd -  pg_epoch 588 pg[9.3f( v 588'5413 (588'5409,588'5413] local-lis/les=587/588 n=5 ec=584/521 lis/c=587/584 les/c/f=588/585/0 sis=587) [1,2,0]/[1,2,3] backfill=[0] r=0 lpr=587 pi=[584,587)/1 pct=588'5412 lua=585'5388 crt=588'5413 lcod 588'5411 mlcod 588'5411 active+remapped PeeringState::needs_recovery osd.0 has 1 missing
ERROR 2026-01-19 09:04:58,574 [shard 1:main] osd - /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-4750-g0a5a6195/rpm/el9/BUILD/ceph-20.3.0-4750-g0a5a6195/src/osd/PeeringState.cc:6415 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery())

Actions #20

Updated by Aishwarya Mathuria 2 months ago

https://pulpito.ceph.com/amathuri-2026-01-19_15:49:24-crimson-rados:thrash_pg_grow-main-distro-debug-trial/7587
osd.0:

ERROR 2026-01-19 16:05:37,870 [shard 3:main] osd - /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-4750-g0a5a6195/rpm/el9/BUILD/ceph-20.3.0-4750-g0a5a6195/src/osd/PeeringState.cc:6415 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery()) 

DEBUG 2026-01-19 16:05:37,870 [shard 2:main] osd -  pg_epoch 273 pg[4.5b( v 270'2449 (270'2445,270'2449] local-lis/les=272/273 n=240 ec=269/175 lis/c=272/269 les/c/f=273/270/0 sis=272) [3,0,1]/[0,1,2] backfill=[3] r=0 lpr=272 pi=[269,272)/1 crt=270'2449 mlcod 0'0 active+remapped+backfilling PeeringState::update_calc_stats degraded 0
DEBUG 2026-01-19 16:05:37,870 [shard 2:main] osd -  pg_epoch 273 pg[4.5b( v 270'2449 (270'2445,270'2449] local-lis/les=272/273 n=240 ec=269/175 lis/c=272/269 les/c/f=273/270/0 sis=272) [3,0,1]/[0,1,2] backfill=[3] r=0 lpr=272 pi=[269,272)/1 crt=270'2449 mlcod 0'0 active+remapped+backfilling PeeringState::update_calc_stats misplaced 240
DEBUG 2026-01-19 16:05:37,870 [shard 2:main] osd -  pg_epoch 273 pg[4.5b( v 270'2449 (270'2445,270'2449] local-lis/les=272/273 n=240 ec=269/175 lis/c=272/269 les/c/f=273/270/0 sis=272) [3,0,1]/[0,1,2] backfill=[3] r=0 lpr=272 pi=[269,272)/1 crt=270'2449 mlcod 0'0 active+remapped+backfilling PeeringState::prepare_stats_for_publish reporting purged_snaps []
DEBUG 2026-01-19 16:05:37,870 [shard 2:main] osd -  pg_epoch 273 pg[4.5b( v 270'2449 (270'2445,270'2449] local-lis/les=272/273 n=240 ec=269/175 lis/c=272/269 les/c/f=273/270/0 sis=272) [3,0,1]/[0,1,2] backfill=[3] r=0 lpr=272 pi=[269,272)/1 crt=270'2449 mlcod 0'0 active+remapped+backfilling PeeringState::prepare_stats_for_publish publish_stats_to_osd 273:3102
DEBUG 2026-01-19 16:05:37,870 [shard 2:main] osd -  pg_epoch 273 pg[4.5b( v 270'2449 (270'2445,270'2449] local-lis/les=272/273 n=240 ec=269/175 lis/c=272/269 les/c/f=273/270/0 sis=272) [3,0,1]/[0,1,2] backfill=[3] r=0 lpr=272 pi=[269,272)/1 crt=270'2449 mlcod 0'0 active+remapped+backfilling PeeringState::prepare_stats_for_publish publish_stats_to_osd 273:3102
ERROR 2026-01-19 16:05:37,870 [shard 3:main] osd - Aborting Got SIGABRT on shard 3 - Stopping all shards

Actions #21

Updated by Aishwarya Mathuria about 2 months ago

https://qa-proxy.ceph.com/teuthology/amathuri-2026-01-22_07:29:48-crimson-rados:thrash_pg_grow-main-distro-debug-trial/12183/
osd.2:

INFO  2026-01-22 08:08:36,081 [shard 0:main] osd - Exiting state: Started/Primary/Active/Backfilling, entered at 2026-01-22T08:08:33.388777+0000, 0.002935 spent on 7 events
INFO  2026-01-22 08:08:36,081 [shard 0:main] osd -  pg_epoch 469 pg[7.37( v 469'942 (465'927,469'942] local-lis/les=460/461 n=10 ec=455/390 lis/c=460/401 les/c/f=461/402/0 sis=460) [2,3,1]/[2,1,0] backfill=[3] r=0 lpr=460 pi=[401,460)/2 pct=466'933 lua=454'904 crt=469'942 lcod 466'932 mlcod 466'932 active+remapped enter Started/Primary/Active/Recovered
INFO  2026-01-22 08:08:36,081 [shard 0:main] osd - Entering state: Started/Primary/Active/Recovered
DEBUG 2026-01-22 08:08:36,081 [shard 0:main] osd -  pg_epoch 469 pg[7.37( v 469'942 (465'927,469'942] local-lis/les=460/461 n=10 ec=455/390 lis/c=460/401 les/c/f=461/402/0 sis=460) [2,3,1]/[2,1,0] backfill=[3] r=0 lpr=460 pi=[401,460)/2 pct=466'933 lua=454'904 crt=469'942 lcod 466'932 mlcod 466'932 active+remapped PeeringState::needs_recovery osd.3 has 1 missing
ERROR 2026-01-22 08:08:36,081 [shard 0:main] osd - /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-4750-g0a5a6195/rpm/el9/BUILD/ceph-20.3.0-4750-g0a5a6195/src/osd/PeeringState.cc:6415 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery())
Actions #23

Updated by MOHIT AGRAWAL about 2 months ago

https://pulpito.ceph.com/moagrawa-2026-01-29_04:55:52-crimson-rados-main-distro-default-trial/24904
moagrawa-2026-01-29_04:55:52-crimson-rados-main-distro-default-trial/24904/remote/trial175/log/ceph-osd.2.log.gz

Actions #24

Updated by MOHIT AGRAWAL about 2 months ago

https://pulpito.ceph.com/moagrawa-2026-01-29_04:55:52-crimson-rados-main-distro-default-trial/24951
/teuthology/moagrawa-2026-01-29_04:55:52-crimson-rados-main-distro-default-trial/24951/remote/trial034/log
ceph-osd.1.log.gz

Actions #26

Updated by Jose J Palacios Perez about 1 month ago

Actions #27

Updated by Aishwarya Mathuria about 1 month ago

/a/amathuri-2026-02-12_10:22:28-crimson-rados:thrash-wip-amat-fix-74504-distro-debug-trial/46154
osd.2:

ERROR 2026-02-12 11:27:01,821 [shard 1:main] osd - /ceph/rpmbuild/BUILD/ceph-20.3.0-5155-ge9085d06/src/osd/PeeringState.cc:6415 : In function 'PeeringState::Recovered::Recovered(boost::statechart::state<PeeringState::Recovered, PeeringState::Active>::my_context)', ceph_assert(!ps->needs_recovery())

Actions #30

Updated by Xuehan Xu 15 days ago · Edited

I think I've just found the cause of this issue. https://pulpito.ceph.com/xuxuehan-2026-03-06_14:37:54-crimson-rados-wip-seastore-background-trans-cc-opt-new-distro-debug-trial/90874/

DEBUG 2026-03-06 15:22:01,100 [shard 0:main] osd -  pg_epoch 236 pg[4.75( v 236'253 (231'248,236'253] local-lis/les=235/236 n=53 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'253 lcod 236'252 mlcod 236'252 active+remapped+backfilling  PGRecovery::enqueue_push: obj=4:af66fdfb:::benchmark_data_trial068_49753_object3000:head v=172'115 peers=[2]
......
DEBUG 2026-03-06 15:22:01,195 [shard 0:main] osd -  pg_epoch 236 pg[4.75( v 236'253 (231'248,236'253] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'253 lcod 236'252 mlcod 236'252 active+remapped+backfilling  ReplicatedBackend::submit_transaction: object 4:af66fdfb:::benchmark_data_trial068_49753_object3000:head
DEBUG 2026-03-06 15:22:01,200 [shard 0:main] osd -  pg_epoch 236 pg[4.75( v 236'253 (231'248,236'253] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'253 lcod 236'252 mlcod 236'252 active+remapped+backfilling PeeringState::append_log append_log log((231'248,236'253], crt=236'253) [236'254 (172'115) delete   4:af66fdfb:::benchmark_data_trial068_49753_object3000:head by client.4696.0:12156 2026-03-06T15:22:01.091069+0000 0 ObjectCleanRegions clean_offsets: [65536~18446744073709486079], clean_omap: false, new_object: false]
......
DEBUG 2026-03-06 15:22:02,575 [shard 0:main] osd -  pg_epoch 237 pg[4.75( v 236'254 (231'249,236'254] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'254 lcod 236'253 mlcod 236'253 active+remapped+backfilling  PGRecovery::update_peers_last_backfill: obj=4:af66fdfb:::benchmark_data_trial068_49753_object3000:head v=0'0 target=2
......
DEBUG 2026-03-06 15:22:02,577 [shard 0:main] osd - pg[4.75( v 236'254 (231'249,236'254] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'254 lcod 236'253 mlcod 236'253 active+remapped+backfilling  BackfillState::Waiting::react::ObjectPushed: Waiting::react() on ObjectPushed; evt.object=4:af66fdfb:::benchmark_data_trial068_49753_object3000:head
DEBUG 2026-03-06 15:22:02,577 [shard 0:main] osd - pg[4.75( v 236'254 (231'249,236'254] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'254 lcod 236'253 mlcod 236'253 active+remapped+backfilling  BackfillState::ProgressTracker::complete_to: obj=4:af66fdfb:::benchmark_data_trial068_49753_object3000:head
......
DEBUG 2026-03-06 15:22:02,851 [shard 0:main] osd -  pg_epoch 237 pg[4.75( v 236'254 (231'249,236'254] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'254 lcod 236'253 mlcod 236'253 active+remapped PeeringState::needs_recovery osd.2 has 1 missing, first missing oid=4:af66fdfb:::DEBUG 2026-03-06 15:22:02,577 [shard 0:main] osd - pg[4.75( v 236'254 (231'249,236'254] local-lis/les=235/236 n=52 ec=232/149 lis/c=235/181 les/c/f=236/182/0 sis=235) [3,2,1]/[3,1,0] backfill=[2] r=0 lpr=235 pi=[181,235)/1 lua=231'251 crt=236'254 lcod 236'253 mlcod 236'253 active+remapped+backfilling  BackfillState::ProgressTracker::complete_to: obj=4:af66fdfb:::benchmark_data_trial068_49753_object3000:head
:head

As is shown above, pg 4.75 is being backfilled and has enqueued benchmark_data_trial068_49753_object3000 for push, however, a client DELETE for benchmark_data_trial068_49753_object3000 comes in, and get applied to the primary before the push happened, which actually cancelled the push.

So the push of benchmark_data_trial068_49753_object3000 is completed without issuing "PeeringState::on_peer_recover", which left the missing entry of benchmark_data_trial068_49753_object3000 in the peer info.

Actions #31

Updated by Shraddha Agrawal 12 days ago

@Xuehan Xu I had found out the exact same thing and had raised this PR to fix the issue: https://github.com/ceph/ceph/pull/67488.

But as the review suggests, this probably needs a bit of more work.

Actions #32

Updated by Xuehan Xu 12 days ago

  • Pull request ID set to 67488
Actions #33

Updated by Xuehan Xu 12 days ago

  • Status changed from New to Fix Under Review
Actions #34

Updated by Jose J Palacios Perez 5 days ago

Actions #35

Updated by Matan Breizman 5 days ago

  • Priority changed from High to Urgent
Actions #36

Updated by Shraddha Agrawal 4 days ago · Edited

Logs: https://pastebin.com/raw/X1Pchq9Z

Investigation:

TLDR: the above seems to be happening due to a stale peer_missing entry for an object. While backfilling this object, an incoming client request deletes this object. This causes the backfill to abort early as the object is deleted from the primary before it could be backfilled. But neither the backfill, nor the client delete removes the stale peer_missing entry.

Logs: https://pastebin.com/raw/X1Pchq9Z

Details:

1. Backfill finds our object (through scan_for_backfill_primary)  [this holds obc read lock on object]
2. In Enqueuing::Enqueuing,

  • check >= primary_bi.begin: If the peer is ahead or at the same position as primary: call update_on_peers.
  • check < primary_bi.begin: Check if peer has an object that primary doesn’t: call remove_on_peers.

In our case we call update_on_peers because the object has not been deleted by the client request.
3. In update_on_peers, we call enqueue_push.
4. enqueue_push adds the object to peer_missing and calls recover_object.
5. In recover_object, we encounter that the object does not exist on the primary and abort recovery.   [this holds obc read lock on object]

While this is happening, after 1 and before 4, a client request comes and deletes the object on primary. Since the client request comes before enqueue_push is called, is_degraded_or_backfilling_object does not block the client request.
After 4 but before 5, client delete operations (after deleting the object on primary), discovers the backfilling and invokes enqueue_delete_for_backfill, which sends MOSDPGBackfillRemove to peers, which does not clear peer_missing entries which results in the above abort.

Possible solutions:

1. Introduce a lock through finding the object in scan till it is recovered, thus blocking the client operation. (less preferred as its blocking)
2. In client's delete path, instead of MOSDPGBackfillRemove, we call MOSDPGRecoveryDelete, which will clear peer_missing entries. (most preferred)
3. In recover_delete if object doesn't exist on primary, issue MOSDPGBackfillRemove to clear peer_missing entries. (Partly fix in: https://github.com/ceph/ceph/pull/67488)

Actions

Also available in: Atom PDF