Project

General

Profile

Actions

Bug #73249

open

osd/MissingLoc.h: FAILED ceph_assert(0 == "unexpected need for missing item")

Added by Laura Flores 6 months ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
EC Pools
Target version:
-
% Done:

0%

Source:
Backport:
tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

description: rados/thrash-erasure-code/{ceph clusters/{fixed-4} ec_optimizations/ec_optimizations_on
fast/fast mon_election/classic msgr-failures/osd-dispatch-delay objectstore/{bluestore/{alloc$/{avl}
base mem$/{normal-1} onode-segment$/{none} write$/{random/{compr$/{yes$/{snappy}}
random}}}} rados recovery-overrides/{more-active-recovery} supported-random-distro$/{centos_latest}
thrashers/default thrashosds-health workloads/ec-rados-plugin=jerasure-k=8-m=6-crush}

/a/teuthology-2025-09-21_20:00:25-rados-main-distro-default-smithi/8513319

2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-3186-g12208e07/rpm/el9/BUILD/ceph-20.3.0-3186-g12208e07/src/osd/MissingLoc.h: 242: FAILED ceph_assert(0 == "unexpected need for missing item")
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr:
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: ceph version 20.3.0-3186-g12208e07 (12208e072faa66b12f712bc07f4245801ee9ebdb) tentacle (dev - RelWithDebInfo)
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x5620435be951]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 2: ceph-osd(+0x41b85d) [0x56204355985d]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 3: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x12bf) [0x562043a54b4f]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 4: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x268) [0x562043a81bf8]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 5: ceph-osd(+0x140aeeb) [0x562044548eeb]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 6: ceph-osd(+0x94f525) [0x562043a8d525]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 7: ceph-osd(+0x6ed510) [0x56204382b510]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 8: (PeeringState::activate_map(PeeringCtx&)+0xf0) [0x562043a41520]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 9: (PG::handle_activate_map(PeeringCtx&, unsigned int)+0x51) [0x56204383c3e1]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 10: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x875) [0x5620437b4cf5]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x267) [0x5620437c00b7]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 12: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x562043a2c1a2]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8bc) [0x5620437d60cc]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x23a) [0x562043d1c94a]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 15: ceph-osd(+0xbdef04) [0x562043d1cf04]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 16: /lib64/libc.so.6(+0x8a3b2) [0x7fcba868a3b2]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 17: /lib64/libc.so.6(+0x10f430) [0x7fcba870f430]

Another case: /a/skanta-2025-09-18_23:59:11-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/8511020


Related issues 3 (3 open0 closed)

Related to RADOS - Bug #56779: crash: void MissingLoc::add_active_missing(const pg_missing_t&): assert(0 == "unexpected need for missing item")New

Actions
Related to RADOS - Bug #56895: crash: void MissingLoc::add_active_missing(const pg_missing_t&): assert(0 == "unexpected need for missing item")New

Actions
Copied to RADOS - Backport #73511: tentacle: osd/MissingLoc.h: FAILED ceph_assert(0 == "unexpected need for missing item")In ProgressBill ScalesActions
Actions #1

Updated by Laura Flores 6 months ago · Edited

/a/skanta-2025-09-18_23:59:11-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/8511013
/a/skanta-2025-09-20_10:25:07-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/8512291

Actions #2

Updated by Laura Flores 6 months ago

  • Assignee set to Bill Scales
Actions #3

Updated by Bill Scales 6 months ago

I'm looking at this one

Actions #4

Updated by Laura Flores 6 months ago · Edited

Scheduled 15x of the same test to help determine repeatability: http://pulpito.ceph.com/lflores-2025-09-25_19:19:28-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/

./teuthology/virtualenv/bin/teuthology-suite -v -m smithi -c wip-bharath4-testing-2025-09-18-1250 -r skanta-2025-09-18_23:59:11-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi -p 60 -N 15 --filter-all "rados/thrash-erasure-code-isa/{arch/x86_64 ceph clusters/{fixed-4} ec_optimizations/ec_optimizations_on mon_election/connectivity msgr-failures/osd-dispatch-delay objectstore/{bluestore/{alloc$/{stupid} base mem$/{normal-1} onode-segment$/{512K-onoff} write$/{v1/{compr$/{no$/{no}} v1}}}} rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/pggrow_host thrashosds-health workloads/ec-rados-plugin=isa-k=10-m=4}" 
Actions #5

Updated by Bill Scales 6 months ago

Looked at /a/teuthology-2025-09-21_20:00:25-rados-main-distro-default-smithi/8513319 - understand what has gone wrong, it is a problem in optimized EC only. It is more likely to occur with wider EC (e.g. 8+6 as this test was using) and is much more likely to happen in teuthology than the real world (because teuthology compresses its error injects much closer together which means there is very little I/O between each inject).

A stray shard that was previously in recovery is applying pwlc to advance last_update but doesn't update its missing list (it can't because it hasn't been given the log entries it missed because of partial writes). Later this shard becomes active and its out of date missing list becomes a problem. Fix is probably to not advance last_update. Looks similar to 8370475 which we thought was fixed by a different change, but clearly isn't.

Still need to screen the other two instances - we've seen many different bugs in optimized EC cause this assert, so there is no guarantee that they are the same issue.

Actions #6

Updated by Laura Flores 6 months ago

Need to determine if this exists in tentacle yet.

Actions #7

Updated by Bill Scales 6 months ago

Confirmed that the two other issues are exactly the same issue.

This is a bug in the FastEC code, it only affects FastEC configurations. The bug is present in Tentacle and main.

For the rerun of 15 repeats of this test, there was 1 infrastructure issue, 13 tests did not hit this assert, 1 did (8520700) also because of exactly the same issue. That suggests the issue is recreatable, but might take 15+ runs to recreate.

Actions #8

Updated by Bill Scales 6 months ago

  • Category set to EC Pools
  • Pull request ID set to 65747

I've reviewed 5 instances of this problem, all are identical - a shard that is in recovery/backfill (and hence has last_complete != last_update and has a non empty missing list) is using pwlc to advance last_update for writes that this shard missed but were all partial writes that did not affect this shard. The problem with just advancing last_update is that the missing list is not updated - if object X was already on the missing list and one or more of the partial writes was for object X then we need to update the missing list to reflect the need for a newer version of X.

The fix is to avoid applying pwlc in this scenario, this forces the primary to send the log and update the missing list. Potential fix coded in https://github.com/ceph/ceph/pull/65747, still needs more testing to check for regressions

Actions #9

Updated by Bill Scales 6 months ago

  • Backport set to tentacle
Actions #10

Updated by Bill Scales 6 months ago

Teuthology tests of the first fix in the PR shows that it fixes the test that was repeated 15 times but that there are other tests that are regressed by the fix. A second version of the fix has been coded which is a bit more selective about when PWLC is and is not applied to a shard in recovery (with missing items and last_complete != last_update). Small teuthology run has completed with no issues, larger run is now in progress.

Actions #11

Updated by Bill Scales 5 months ago

  • Copied to Backport #73511: tentacle: osd/MissingLoc.h: FAILED ceph_assert(0 == "unexpected need for missing item") added
Actions #12

Updated by Bill Scales 5 months ago

  • Tags (freeform) set to backport_processed
Actions #13

Updated by Radoslaw Zarzynski 5 months ago

  • Status changed from New to Fix Under Review

scrub note: fix already approved, in QA.

Actions #14

Updated by Laura Flores 5 months ago

PR needed to be rebased; was added to a new QA batch.

Actions #15

Updated by Laura Flores 4 months ago

Actions #16

Updated by Laura Flores 4 months ago

Laura Flores wrote in #note-15:

QA in progress here: https://tracker.ceph.com/issues/73711

Same update

Actions #17

Updated by Laura Flores 4 months ago

/a/skanta-2025-11-18_05:30:29-rados-wip-bharath10-testing-2025-11-18-0557-distro-default-smithi/8609005

Actions #18

Updated by Radoslaw Zarzynski 4 months ago · Edited

The PR got the TESTED label but a comment has popped out: https://github.com/ceph/ceph/pull/65747#issuecomment-3552023260.

CC: @Bill Scales

Actions #19

Updated by Laura Flores 4 months ago

  • Related to Bug #56779: crash: void MissingLoc::add_active_missing(const pg_missing_t&): assert(0 == "unexpected need for missing item") added
Actions #20

Updated by Laura Flores 4 months ago

  • Related to Bug #56895: crash: void MissingLoc::add_active_missing(const pg_missing_t&): assert(0 == "unexpected need for missing item") added
Actions #21

Updated by Bill Scales 4 months ago

  • Status changed from Fix Under Review to In Progress

Looking at /teuthology/skanta-2025-11-05_00:00:19-rados-wip-bharath7-testing-2025-11-04-1337-distro-default-smithi/8584435/ the fix has improved things but it
still isn't right:

In epoch 1053 osd 15(0) has osd 13(2) as a stray shard with log entries 0'0,1030'568 and last complete 962'548. At the end of the peering cycle it sends it info and pwlc (but not log entries) to advance the head to 1050'570. Both writes 569 or 570 were partial writes that did not update that shard, however write 569 was to an object that 13(2) has marked as missing. Advancing the head without providing a log to allow the missing list to be updated means that 13(2) has an out of date missing list which causes this unexpected need assert in a later peering cycle where 13(2) has now become part of the acting set again.

We shouldn't have tried to update the log head using pwlc because this was a stray shard.

Actions #22

Updated by Radoslaw Zarzynski 4 months ago

scrub note: the PR is tested but not merging it right now, awaiting continuation.

Actions #23

Updated by Aishwarya Mathuria 4 months ago

/a/yuriw-2025-12-03_15:44:36-rados-wip-yuri5-testing-2025-12-02-1256-distro-default-smithi/8639578

Actions #24

Updated by Radoslaw Zarzynski 3 months ago

scrub note: bump up.

Actions #25

Updated by Laura Flores 3 months ago

Rework in progress..

Actions #26

Updated by Radoslaw Zarzynski 2 months ago

  • Status changed from In Progress to Fix Under Review
Actions #27

Updated by Radoslaw Zarzynski 2 months ago

Bump up.

Actions #28

Updated by Radoslaw Zarzynski about 2 months ago

Bump up.

Actions

Also available in: Atom PDF