Skip to content

osd: Optimized EC missing list not updated on recovering shard#65747

Open
bill-scales wants to merge 1 commit intoceph:mainfrom
bill-scales:issue_73249
Open

osd: Optimized EC missing list not updated on recovering shard#65747
bill-scales wants to merge 1 commit intoceph:mainfrom
bill-scales:issue_73249

Conversation

@bill-scales
Copy link
Contributor

@bill-scales bill-scales commented Oct 1, 2025

Shards that are recovering (last_complete != last_update) are using pwlc to advance last_update for writes that did not effect the shard. However simply incrementing last_update means that the primary doesnt send the shard log entries that it missed and consequently it cannot update its missing list.

If the shard is already missing object X at version V1 and there was a partial write at V2 that did not update the shard, it does not need to retain the log entry, but it does need to update the missing list to say it needs V2 rather than V1. This ensures all shards report a need for an object at the same version and avoids an assert in MissingLoc::add_active_missing when the primary is trying to combine the missing lists from all the shards to work out what has to be recovered.

The fix is to avoid applying pwlc when last_complete != last_update, this forces the primary to send the log to the recovering shard which can then update its missing list (and discarding the log entries as they are partial writes).

Fixes: https://tracker.ceph.com/issues/73249

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@bill-scales bill-scales requested a review from a team as a code owner October 1, 2025 15:10
@github-actions github-actions bot added the core label Oct 1, 2025
@bill-scales bill-scales marked this pull request as draft October 1, 2025 15:10
@bill-scales
Copy link
Contributor Author

Think this is a fix for https://tracker.ceph.com/issues/73249, doing some local testing to check for other regressions and waiting for shaman to recover so I can kick off some teuthology tests

@bill-scales
Copy link
Contributor Author

Test results so far for the latest version of the fix:

https://pulpito.ceph.com/billscales-2025-10-08_07:00:25-rados:thrash-erasure-code-optimizations-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-10-08_10:00:53-rados:thrash-erasure-code-optimizations-main-distro-default-smithi/

Lots of failed jobs due to scrubbing generating false data inconsistency warnings, a few infrastructure issues, 1 recreate of the bug that Alex has fixed, 1 recreate of the stats bug that Jon has fixed, 1 failed run due to an overaggressive error inject

@bill-scales bill-scales requested a review from aainscow October 9, 2025 07:34
@bill-scales bill-scales marked this pull request as ready for review October 9, 2025 07:35
Copy link
Contributor

@aainscow aainscow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bill-scales
Copy link
Contributor Author

https://pulpito.ceph.com/billscales-2025-10-09_06:16:15-rados:thrash-erasure-code-optimizations-main-distro-default-smithi/

Other than 38 scrub data inconsistency reports for otherwise successful runs which Jon has a fix for, a clean run

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nit.

stats_last_update[from] = oinfo.last_update;

update_peer_info(from, oinfo);
update_peer_info(from, oinfo, BEFORE_ACTIVATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@yuriw
Copy link
Contributor

yuriw commented Oct 29, 2025

@bill-scales pls rebase

Shards that are recovering (last_complete != last_update) that
skip transactions and log entries because of partial writes are
using pwlc to advance last_update. However simply incrementing
last_update is not sufficient - there are scenarios where the
needed version of a missing object has to be updated.

If the shard is already missing object X at version V1 and there was
a partial write at V2 that did not update the shard, it does not need
to retain the log entry, but it does need to update the missing list
to say it needs V2 rather than V1. This ensures all shards report
a need for an object at the same version and avoids an assert in
MissingLoc::add_active_missing when the primary is trying to
combine the missing lists from all the shards to work out what has
to be recovered. Avoiding applying pwlc during the early phase
of the peering process ensures the missing list gets updated.

However if a shard is not missing object X and there was a partial
write at V2 that did not update the shard then at the end of peering
it is still necessary to advance last_upadte by applying pwlc. This
ensures that in later peering cycles the code does not change its
mind and think the shard is now missing object X.

The fix is to be more sophisticated about when pwlc can be used
to advance last_update for a recovering shard. The code now
passes in a parameter indicating whether we are in the early
(pre activate) or later phase of peering. This also means that
additional calls to apply_pwlc are needed when peering gets to
activating and is searching for missing to make updates that were
not made earlier.

Fixes: https://tracker.ceph.com/issues/73249
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
@bill-scales
Copy link
Contributor Author

Rebased - addressed trivial merge conflict + s/peering_state/peering_state_t/g to address minor review comment

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@ljflores
Copy link
Member

Testing in here: https://tracker.ceph.com/issues/73711

1 similar comment
@ljflores
Copy link
Member

Testing in here: https://tracker.ceph.com/issues/73711

@JonBailey1993
Copy link
Contributor

Hey @bill-scales,

We came across an instance of what looks to be https://tracker.ceph.com/issues/73249 during qa testing, which I believe is the issue you were fixing with this.

Teuthology run:
/a/skanta-2025-11-05_00:00:19-rados-wip-bharath7-testing-2025-11-04-1337-distro-default-smithi/8584435

Call stack:

 ceph version 20.3.0-3873-g2b46a960 (2b46a960734d02294090e4c816345268f495655a) tentacle (dev - RelWithDebInfo)
 1: /lib64/libc.so.6(+0x3ea60) [0x7fe168c3ea60]
 2: /lib64/libc.so.6(+0x8c0fc) [0x7fe168c8c0fc]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17f) [0x56411b2bbf08]
 6: ceph-osd(+0x41b6d3) [0x56411b2576d3]
 7: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x12af) [0x56411b74f49f]
 8: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x268) [0x56411b77c108]
 9: ceph-osd(+0x140274b) [0x56411c23e74b]
 10: ceph-osd(+0x94b8f5) [0x56411b7878f5]
 11: ceph-osd(+0x6ea3d0) [0x56411b5263d0]
 12: (PeeringState::activate_map(PeeringCtx&)+0xf0) [0x56411b732610]
 13: (PG::handle_activate_map(PeeringCtx&, unsigned int)+0x51) [0x56411b5372a1]
 14: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x875) [0x56411b4afb05]
 15: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x267) [0x56411b4baec7]
 16: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x56411b7269a2]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8bc) [0x56411b4d0f6c]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x23a) [0x56411ba1854a]
 19: ceph-osd(+0xbdcb04) [0x56411ba18b04]
 20: /lib64/libc.so.6(+0x8a3b2) [0x7fe168c8a3b2]
 21: /lib64/libc.so.6(+0x10f430) [0x7fe168d0f430]

Could you please take a look?

@ljflores
Copy link
Member

Add the "needs-qa" label back when the above comment is addressed.

@bill-scales
Copy link
Contributor Author

Test failure during the QA run shows that this fix has improved things but still isn't quite right. A bit more tuning is required...

@rzarzynski
Copy link
Contributor

@bill-scales: DNM for now?

@ljflores ljflores added the DNM label Jan 27, 2026
@ljflores
Copy link
Member

Added DNM for now- let's add "needs-qa" when ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants