Project

General

Profile

Actions

Bug #66316

closed

AsyncReserver crash - !queue_pointers.count(item) && !in_progress.count(item)

Added by Samuel Just almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Fixed In:
v19.3.0-3744-gd812176759
Released In:
v20.2.0~2388
Upkeep Timestamp:
2025-11-01T01:17:55+00:00

Description


...
ERROR 2024-05-31 03:39:33,444 [shard 0:main] none - /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-4013-g149e0d82/rpm/el9/BUILD/ceph-19.0.0-4013-g149e0d82/src/common/AsyncReserver.h:264 : In function 'void AsyncReserver<T, F>::request_reservation(T, Context*, unsigned int, Context*) [with T = spg_t; F = crimso
n::osd::OSDSingletonState::DirectFinisher]', ceph_assert(%s)
!queue_pointers.count(item) && !in_progress.count(item)
INFO  2024-05-31 03:39:33,444 [shard 1:main] osd -  pg_epoch 300 pg[3.c( v 280'496 (0'0,280'496] local-lis/les=290/291 n=5 ec=16/16 lis/c=290/290 les/c/f=291/291/0 sis=297) [] r=-1 lpr=300 pi=[290,297)/1 crt=280'496 lcod 0'0 mlcod 0'0 unknown NOTIFY exit Started/Stray 0.000149 0 0.000000
Aborting on shard 0.
Backtrace:
...
 0# 0x00007FC15748B94C in /lib64/libc.so.6
 1# raise in /lib64/libc.so.6
 2# abort in /lib64/libc.so.6
 3# ceph::__ceph_assert_fail(ceph::assert_data const&) in ceph-osd
 4# AsyncReserver<spg_t, crimson::osd::OSDSingletonState::DirectFinisher>::request_reservation(spg_t, Context*, unsigned int, Context*) in ceph-osd
 5# crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}::operator()(crimson::osd::OSDSingletonState&, Context*, Context*) const in ceph-osd
 6# void std::__invoke_impl<void, crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}, crimson::osd::OSDSingletonState&, Context*, Context*>(std::__invoke_other, crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Conte
xt*)#1}&&, crimson::osd::OSDSingletonState&, Context*&&, Context*&&) in ceph-osd
 7# decltype(auto) std::__apply_impl<crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}, std::tuple<crimson::osd::OSDSingletonState&, Context*, Context*>, 0ul, 1ul, 2ul>(crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Contex
t*, Context*)#1}&&, std::tuple<crimson::osd::OSDSingletonState&, Context*, Context*>&&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>) in ceph-osd
 8# seastar::sharded<crimson::osd::OSDSingletonState>::invoke_on<crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}, Context*, Context*, seastar::future<void> >(unsigned int, seastar::smp_submit_to_options, crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(cri
mson::osd::OSDSingletonState&, Context*, Context*)#1}&&, Context*&&, Context*&&)::{lambda()#1}::operator()() in ceph-osd
 9# seastar::future<void> seastar::futurize<void>::invoke<seastar::sharded<crimson::osd::OSDSingletonState>::invoke_on<crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}, Context*, Context*, seastar::future<void> >(unsigned int, seastar::smp_submit_to_options, crimson::osd::ShardServices::local_request_reservation
(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}&&, Context*&&, Context*&&)::{lambda()#1}&>(crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}&&) in ceph-osd
10# seastar::smp_message_queue::async_work_item<seastar::sharded<crimson::osd::OSDSingletonState>::invoke_on<crimson::osd::ShardServices::local_request_reservation(spg_t, Context*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}, Context*, Context*, seastar::future<void> >(unsigned int, seastar::smp_submit_to_options, crimson::osd::ShardServices::local_request_reservation(spg_t, Co
ntext*, unsigned int, Context*)::{lambda(crimson::osd::OSDSingletonState&, Context*, Context*)#1}&&, Context*&&, Context*&&)::{lambda()#1}>::run_and_dispose() in ceph-osd
11# 0x000000000B7CA260 in ceph-osd
12# 0x000000000B7E44FA in ceph-osd
13# 0x000000000B8854F8 in ceph-osd
14# 0x000000000B886B46 in ceph-osd
15# 0x000000000B5637C2 in ceph-osd
16# 0x000000000B56413E in ceph-osd
17# main in ceph-osd

https://pulpito.ceph.com/sjust-2024-05-31_02:08:00-crimson-rados:thrash-wip-sjust-crimson-testing-2024-05-29-distro-default-smithi/7735328

Actions #1

Updated by Samuel Just almost 2 years ago

  • Subject changed from crimson: AsyncReserver crash on interval change to crimson: AsyncReserver crash
Actions #2

Updated by Samuel Just almost 2 years ago

This is probably happening because the PG::request_local_background_io_reservation and friends do not wait for the future to resolve. Fortunately, these days we handle peering events under seastar::thread (see PG::do_peering_event), so it should be ok to simply block.

Actions #3

Updated by Samuel Just almost 2 years ago

sjust-2024-06-14_20:06:29-crimson-rados:thrash-wip-sjust-crimson-testing-2024-06-14-distro-default-smithi/7757018/osd_logs

Actions #4

Updated by Matan Breizman almost 2 years ago

  • Subject changed from crimson: AsyncReserver crash to AsyncReserver crash - !queue_pointers.count(item) && !in_progress.count(item)
Actions #6

Updated by Samuel Just over 1 year ago

  • Pull request ID set to 58464
Actions #7

Updated by Samuel Just over 1 year ago

  • Status changed from New to Fix Under Review
Actions #8

Updated by Samuel Just over 1 year ago

The above fix is part of the story, but the other half is that continuations submitted to the singleton instance need to be executed in order (in particular, map advances can cause a reservation cancel and requeue in the same peering event sequence). Will come up with a mechanism to ensure that next.

Actions #9

Updated by Samuel Just over 1 year ago

Updated PR with (hopefully) complete fix.

Actions #10

Updated by Matan Breizman over 1 year ago

  • Status changed from Fix Under Review to Resolved

No new instances since fix is merged

Actions #11

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to d8121767596a8ced6cd21ca37d4cc971cc9690b6
  • Fixed In set to v19.3.0-3744-gd8121767596
  • Upkeep Timestamp set to 2025-07-11T01:38:27+00:00
Actions #12

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-3744-gd8121767596 to v19.3.0-3744-gd812176759
  • Upkeep Timestamp changed from 2025-07-11T01:38:27+00:00 to 2025-07-14T22:43:28+00:00
Actions #13

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2388
  • Upkeep Timestamp changed from 2025-07-14T22:43:28+00:00 to 2025-11-01T01:17:55+00:00
Actions

Also available in: Atom PDF