osd: Optimized EC missing list not updated on recovering shard by bill-scales · Pull Request #65747 · ceph/ceph

bill-scales · 2025-10-01T15:10:06Z

Shards that are recovering (last_complete != last_update) are using pwlc to advance last_update for writes that did not effect the shard. However simply incrementing last_update means that the primary doesnt send the shard log entries that it missed and consequently it cannot update its missing list.

If the shard is already missing object X at version V1 and there was a partial write at V2 that did not update the shard, it does not need to retain the log entry, but it does need to update the missing list to say it needs V2 rather than V1. This ensures all shards report a need for an object at the same version and avoids an assert in MissingLoc::add_active_missing when the primary is trying to combine the missing lists from all the shards to work out what has to be recovered.

The fix is to avoid applying pwlc when last_complete != last_update, this forces the primary to send the log to the recovering shard which can then update its missing list (and discarding the log entries as they are partial writes).

Fixes: https://tracker.ceph.com/issues/73249

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

bill-scales · 2025-10-01T15:14:20Z

Think this is a fix for https://tracker.ceph.com/issues/73249, doing some local testing to check for other regressions and waiting for shaman to recover so I can kick off some teuthology tests

bill-scales · 2025-10-09T07:23:03Z

Test results so far for the latest version of the fix:

https://pulpito.ceph.com/billscales-2025-10-08_07:00:25-rados:thrash-erasure-code-optimizations-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-10-08_10:00:53-rados:thrash-erasure-code-optimizations-main-distro-default-smithi/

Lots of failed jobs due to scrubbing generating false data inconsistency warnings, a few infrastructure issues, 1 recreate of the bug that Alex has fixed, 1 recreate of the stats bug that Jon has fixed, 1 failed run due to an overaggressive error inject

aainscow

LGTM

bill-scales · 2025-10-10T06:35:29Z

https://pulpito.ceph.com/billscales-2025-10-09_06:16:15-rados:thrash-erasure-code-optimizations-main-distro-default-smithi/

Other than 38 scrub data inconsistency reports for otherwise successful runs which Jon has a fix for, a clean run

rzarzynski

LGTM, just a nit.

src/osd/PeeringState.cc

rzarzynski · 2025-10-10T18:07:45Z

src/osd/PeeringState.cc

  stats_last_update[from] = oinfo.last_update;

-  update_peer_info(from, oinfo);
+  update_peer_info(from, oinfo, BEFORE_ACTIVATE);


github-actions · 2025-10-23T14:46:59Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

yuriw · 2025-10-29T18:43:16Z

@bill-scales pls rebase

Shards that are recovering (last_complete != last_update) that skip transactions and log entries because of partial writes are using pwlc to advance last_update. However simply incrementing last_update is not sufficient - there are scenarios where the needed version of a missing object has to be updated. If the shard is already missing object X at version V1 and there was a partial write at V2 that did not update the shard, it does not need to retain the log entry, but it does need to update the missing list to say it needs V2 rather than V1. This ensures all shards report a need for an object at the same version and avoids an assert in MissingLoc::add_active_missing when the primary is trying to combine the missing lists from all the shards to work out what has to be recovered. Avoiding applying pwlc during the early phase of the peering process ensures the missing list gets updated. However if a shard is not missing object X and there was a partial write at V2 that did not update the shard then at the end of peering it is still necessary to advance last_upadte by applying pwlc. This ensures that in later peering cycles the code does not change its mind and think the shard is now missing object X. The fix is to be more sophisticated about when pwlc can be used to advance last_update for a recovering shard. The code now passes in a parameter indicating whether we are in the early (pre activate) or later phase of peering. This also means that additional calls to apply_pwlc are needed when peering gets to activating and is searching for missing to make updates that were not made earlier. Fixes: https://tracker.ceph.com/issues/73249 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

bill-scales · 2025-11-03T15:38:27Z

Rebased - addressed trivial merge conflict + s/peering_state/peering_state_t/g to address minor review comment

bill-scales · 2025-11-05T08:34:52Z

jenkins test make check arm64

ljflores · 2025-11-11T20:35:39Z

Testing in here: https://tracker.ceph.com/issues/73711

ljflores · 2025-11-11T20:35:52Z

Testing in here: https://tracker.ceph.com/issues/73711

JonBailey1993 · 2025-11-19T10:45:35Z

Hey @bill-scales,

We came across an instance of what looks to be https://tracker.ceph.com/issues/73249 during qa testing, which I believe is the issue you were fixing with this.

Teuthology run:
/a/skanta-2025-11-05_00:00:19-rados-wip-bharath7-testing-2025-11-04-1337-distro-default-smithi/8584435

Call stack:

 ceph version 20.3.0-3873-g2b46a960 (2b46a960734d02294090e4c816345268f495655a) tentacle (dev - RelWithDebInfo)
 1: /lib64/libc.so.6(+0x3ea60) [0x7fe168c3ea60]
 2: /lib64/libc.so.6(+0x8c0fc) [0x7fe168c8c0fc]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17f) [0x56411b2bbf08]
 6: ceph-osd(+0x41b6d3) [0x56411b2576d3]
 7: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x12af) [0x56411b74f49f]
 8: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x268) [0x56411b77c108]
 9: ceph-osd(+0x140274b) [0x56411c23e74b]
 10: ceph-osd(+0x94b8f5) [0x56411b7878f5]
 11: ceph-osd(+0x6ea3d0) [0x56411b5263d0]
 12: (PeeringState::activate_map(PeeringCtx&)+0xf0) [0x56411b732610]
 13: (PG::handle_activate_map(PeeringCtx&, unsigned int)+0x51) [0x56411b5372a1]
 14: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x875) [0x56411b4afb05]
 15: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x267) [0x56411b4baec7]
 16: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x56411b7269a2]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8bc) [0x56411b4d0f6c]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x23a) [0x56411ba1854a]
 19: ceph-osd(+0xbdcb04) [0x56411ba18b04]
 20: /lib64/libc.so.6(+0x8a3b2) [0x7fe168c8a3b2]
 21: /lib64/libc.so.6(+0x10f430) [0x7fe168d0f430]

Could you please take a look?

ljflores · 2025-11-21T23:57:51Z

Add the "needs-qa" label back when the above comment is addressed.

bill-scales · 2025-11-27T14:35:36Z

Test failure during the QA run shows that this fix has improved things but still isn't quite right. A bit more tuning is required...

rzarzynski · 2026-01-12T19:16:15Z

@bill-scales: DNM for now?

ljflores · 2026-01-27T18:47:58Z

Added DNM for now- let's add "needs-qa" when ready!

bill-scales requested a review from a team as a code owner October 1, 2025 15:10

github-actions bot added the core label Oct 1, 2025

bill-scales marked this pull request as draft October 1, 2025 15:10

bill-scales force-pushed the issue_73249 branch from 7f2222d to 722e9c9 Compare October 7, 2025 17:57

bill-scales force-pushed the issue_73249 branch from 722e9c9 to 7cb12d0 Compare October 9, 2025 07:33

bill-scales requested a review from aainscow October 9, 2025 07:34

bill-scales marked this pull request as ready for review October 9, 2025 07:35

aainscow approved these changes Oct 9, 2025

View reviewed changes

bill-scales added the needs-qa label Oct 10, 2025

rzarzynski approved these changes Oct 10, 2025

View reviewed changes

bill-scales mentioned this pull request Oct 13, 2025

tentacle: osd: Optimized EC missing list not updated on recovering shard #65908

Open

yuriw added the wip-yuri4-testing label Oct 13, 2025

github-actions bot added the needs-rebase label Oct 23, 2025

yuriw removed the wip-yuri4-testing label Oct 29, 2025

bill-scales force-pushed the issue_73249 branch from 7cb12d0 to 56a0687 Compare November 3, 2025 15:36

github-actions bot removed the needs-rebase label Nov 3, 2025

SrinivasaBharath added the wip-bharath7-testing label Nov 4, 2025

ljflores removed needs-qa wip-bharath7-testing labels Nov 21, 2025

ljflores added the TESTED label Nov 21, 2025

ljflores added the DNM label Jan 27, 2026

Conversation

bill-scales commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

bill-scales commented Oct 1, 2025

Uh oh!

bill-scales commented Oct 9, 2025

Uh oh!

aainscow left a comment

Choose a reason for hiding this comment

Uh oh!

bill-scales commented Oct 10, 2025

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rzarzynski Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

yuriw commented Oct 29, 2025

Uh oh!

bill-scales commented Nov 3, 2025

Uh oh!

bill-scales commented Nov 5, 2025

Uh oh!

ljflores commented Nov 11, 2025

Uh oh!

ljflores commented Nov 11, 2025

Uh oh!

JonBailey1993 commented Nov 19, 2025

Uh oh!

ljflores commented Nov 21, 2025

Uh oh!

bill-scales commented Nov 27, 2025

Uh oh!

rzarzynski commented Jan 12, 2026

Uh oh!

ljflores commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bill-scales commented Oct 1, 2025 •

edited

Loading