Bug #73249: osd/MissingLoc.h: FAILED ceph_assert(0 == "unexpected need for missing item") - RADOS - Ceph

Actions

Copy link

Bug #73249

open

osd/MissingLoc.h: FAILED ceph_assert(0 == "unexpected need for missing item")

Added by Laura Flores 6 months ago. Updated about 2 months ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Bill Scales

Category:

EC Pools

Target version:

% Done:

Source:

Backport:

tentacle

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

65747

Tags (freeform):

backport_processed

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

description: rados/thrash-erasure-code/{ceph clusters/{fixed-4} ec_optimizations/ec_optimizations_on
fast/fast mon_election/classic msgr-failures/osd-dispatch-delay objectstore/{bluestore/{alloc$/{avl}
base mem$/{normal-1} onode-segment$/{none} write$/{random/{compr$/{yes$/{snappy}}
random}}}} rados recovery-overrides/{more-active-recovery} supported-random-distro$/{centos_latest}
thrashers/default thrashosds-health workloads/ec-rados-plugin=jerasure-k=8-m=6-crush}

/a/teuthology-2025-09-21_20:00:25-rados-main-distro-default-smithi/8513319

2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-3186-g12208e07/rpm/el9/BUILD/ceph-20.3.0-3186-g12208e07/src/osd/MissingLoc.h: 242: FAILED ceph_assert(0 == "unexpected need for missing item")
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr:
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: ceph version 20.3.0-3186-g12208e07 (12208e072faa66b12f712bc07f4245801ee9ebdb) tentacle (dev - RelWithDebInfo)
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x5620435be951]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 2: ceph-osd(+0x41b85d) [0x56204355985d]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 3: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x12bf) [0x562043a54b4f]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 4: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x268) [0x562043a81bf8]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 5: ceph-osd(+0x140aeeb) [0x562044548eeb]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 6: ceph-osd(+0x94f525) [0x562043a8d525]
2025-09-21T20:50:22.888 INFO:tasks.ceph.osd.3.smithi142.stderr: 7: ceph-osd(+0x6ed510) [0x56204382b510]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 8: (PeeringState::activate_map(PeeringCtx&)+0xf0) [0x562043a41520]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 9: (PG::handle_activate_map(PeeringCtx&, unsigned int)+0x51) [0x56204383c3e1]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 10: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x875) [0x5620437b4cf5]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x267) [0x5620437c00b7]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 12: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x562043a2c1a2]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8bc) [0x5620437d60cc]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x23a) [0x562043d1c94a]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 15: ceph-osd(+0xbdef04) [0x562043d1cf04]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 16: /lib64/libc.so.6(+0x8a3b2) [0x7fcba868a3b2]
2025-09-21T20:50:22.889 INFO:tasks.ceph.osd.3.smithi142.stderr: 17: /lib64/libc.so.6(+0x10f430) [0x7fcba870f430]

Another case: /a/skanta-2025-09-18_23:59:11-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/8511020

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Updated by Laura Flores 6 months ago · Edited

/a/skanta-2025-09-18_23:59:11-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/8511013
/a/skanta-2025-09-20_10:25:07-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/8512291

Actions

Copy link

Updated by Laura Flores 6 months ago

Assignee set to Bill Scales

Actions

Copy link

Updated by Bill Scales 6 months ago

I'm looking at this one

Actions

Copy link

Updated by Laura Flores 6 months ago · Edited

Scheduled 15x of the same test to help determine repeatability: http://pulpito.ceph.com/lflores-2025-09-25_19:19:28-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi/

./teuthology/virtualenv/bin/teuthology-suite -v -m smithi -c wip-bharath4-testing-2025-09-18-1250 -r skanta-2025-09-18_23:59:11-rados-wip-bharath4-testing-2025-09-18-1250-distro-default-smithi -p 60 -N 15 --filter-all "rados/thrash-erasure-code-isa/{arch/x86_64 ceph clusters/{fixed-4} ec_optimizations/ec_optimizations_on mon_election/connectivity msgr-failures/osd-dispatch-delay objectstore/{bluestore/{alloc$/{stupid} base mem$/{normal-1} onode-segment$/{512K-onoff} write$/{v1/{compr$/{no$/{no}} v1}}}} rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/pggrow_host thrashosds-health workloads/ec-rados-plugin=isa-k=10-m=4}"

Actions

Copy link

Updated by Bill Scales 6 months ago

Looked at /a/teuthology-2025-09-21_20:00:25-rados-main-distro-default-smithi/8513319 - understand what has gone wrong, it is a problem in optimized EC only. It is more likely to occur with wider EC (e.g. 8+6 as this test was using) and is much more likely to happen in teuthology than the real world (because teuthology compresses its error injects much closer together which means there is very little I/O between each inject).

A stray shard that was previously in recovery is applying pwlc to advance last_update but doesn't update its missing list (it can't because it hasn't been given the log entries it missed because of partial writes). Later this shard becomes active and its out of date missing list becomes a problem. Fix is probably to not advance last_update. Looks similar to 8370475 which we thought was fixed by a different change, but clearly isn't.

Still need to screen the other two instances - we've seen many different bugs in optimized EC cause this assert, so there is no guarantee that they are the same issue.

Actions

Copy link

Updated by Laura Flores 6 months ago

Need to determine if this exists in tentacle yet.

Actions

Copy link

Updated by Bill Scales 6 months ago

Confirmed that the two other issues are exactly the same issue.

This is a bug in the FastEC code, it only affects FastEC configurations. The bug is present in Tentacle and main.

For the rerun of 15 repeats of this test, there was 1 infrastructure issue, 13 tests did not hit this assert, 1 did (8520700) also because of exactly the same issue. That suggests the issue is recreatable, but might take 15+ runs to recreate.

Actions

Copy link

Updated by Bill Scales 6 months ago

Category set to EC Pools
Pull request ID set to 65747

I've reviewed 5 instances of this problem, all are identical - a shard that is in recovery/backfill (and hence has last_complete != last_update and has a non empty missing list) is using pwlc to advance last_update for writes that this shard missed but were all partial writes that did not affect this shard. The problem with just advancing last_update is that the missing list is not updated - if object X was already on the missing list and one or more of the partial writes was for object X then we need to update the missing list to reflect the need for a newer version of X.

The fix is to avoid applying pwlc in this scenario, this forces the primary to send the log and update the missing list. Potential fix coded in https://github.com/ceph/ceph/pull/65747, still needs more testing to check for regressions

Actions

Copy link

Updated by Bill Scales 6 months ago

Backport set to tentacle

Actions

Copy link

#10

Updated by Bill Scales 6 months ago

Teuthology tests of the first fix in the PR shows that it fixes the test that was repeated 15 times but that there are other tests that are regressed by the fix. A second version of the fix has been coded which is a bit more selective about when PWLC is and is not applied to a shard in recovery (with missing items and last_complete != last_update). Small teuthology run has completed with no issues, larger run is now in progress.

Actions

Copy link

#11

Updated by Bill Scales 5 months ago

Copied to Backport #73511: tentacle: osd/MissingLoc.h: FAILED ceph_assert(0 == "unexpected need for missing item") added

Actions

Copy link

#12

Updated by Bill Scales 5 months ago

Tags (freeform) set to backport_processed

Actions

Copy link

#13

Updated by Radoslaw Zarzynski 5 months ago

Status changed from New to Fix Under Review

scrub note: fix already approved, in QA.

Actions

Copy link

#14

Updated by Laura Flores 5 months ago

PR needed to be rebased; was added to a new QA batch.

Actions

Copy link

#15

Updated by Laura Flores 4 months ago

QA in progress here: https://tracker.ceph.com/issues/73711

Actions

Copy link

#16

Updated by Laura Flores 4 months ago

Laura Flores wrote in #note-15:

QA in progress here: https://tracker.ceph.com/issues/73711

Same update

Actions

Copy link

#17

Updated by Laura Flores 4 months ago

/a/skanta-2025-11-18_05:30:29-rados-wip-bharath10-testing-2025-11-18-0557-distro-default-smithi/8609005

Actions

Copy link

#18

Updated by Radoslaw Zarzynski 4 months ago · Edited

The PR got the TESTED label but a comment has popped out: https://github.com/ceph/ceph/pull/65747#issuecomment-3552023260.

CC: @Bill Scales

Actions

Copy link

#19

Updated by Laura Flores 4 months ago

Related to Bug #56779: crash: void MissingLoc::add_active_missing(const pg_missing_t&): assert(0 == "unexpected need for missing item") added

Actions

Copy link

#20

Updated by Laura Flores 4 months ago

Related to Bug #56895: crash: void MissingLoc::add_active_missing(const pg_missing_t&): assert(0 == "unexpected need for missing item") added

Actions

Copy link

#21

Updated by Bill Scales 4 months ago

Status changed from Fix Under Review to In Progress

Looking at /teuthology/skanta-2025-11-05_00:00:19-rados-wip-bharath7-testing-2025-11-04-1337-distro-default-smithi/8584435/ the fix has improved things but it
still isn't right:

In epoch 1053 osd 15(0) has osd 13(2) as a stray shard with log entries 0'0,1030'568 and last complete 962'548. At the end of the peering cycle it sends it info and pwlc (but not log entries) to advance the head to 1050'570. Both writes 569 or 570 were partial writes that did not update that shard, however write 569 was to an object that 13(2) has marked as missing. Advancing the head without providing a log to allow the missing list to be updated means that 13(2) has an out of date missing list which causes this unexpected need assert in a later peering cycle where 13(2) has now become part of the acting set again.

We shouldn't have tried to update the log head using pwlc because this was a stray shard.

Actions

Copy link

#22