Skip to content

os/bluestore: Fix problem with EC + elastic shared blob#62816

Closed
aclamk wants to merge 2 commits intoceph:mainfrom
aclamk:aclamk-bs-fix-reshard-bug-70390
Closed

os/bluestore: Fix problem with EC + elastic shared blob#62816
aclamk wants to merge 2 commits intoceph:mainfrom
aclamk:aclamk-bs-fix-reshard-bug-70390

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Apr 15, 2025

It looks like ESB exposed problems with _do_remove and ExtentMap::update() functions.

Fixes: https://tracker.ceph.com/issues/70390

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

<< std::dec << " hit new spanning blob " << *p << dendl;
request_reshard(p->blob_start(), p->blob_end());
must_reshard = true;
} else if (p->blob->is_spanning() && p->logical_end() > end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that resharding doesn't 100% guarantee that spanning blob goes away. It might happen or not depending on shards layout.
Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resharding does not guarantee that spanning blob goes away.
But resharding guarantees that extent will not spill over from one shard to another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I badly formulated the statement above. My concern was rather about no guarantees that resharding eliminates "spanning" extents. I.e. ones that span shard boudary.
And indeed IMO reshard UT from a7f0829 proves my concern.
Hence I extended your PR with some additional stuff (both to implement the above UT and fix the issue), please see #65065 which IMO is a full fix for the issue.

@ifed01
Copy link
Contributor

ifed01 commented Apr 15, 2025

@aclamk it would be great if we have a reproducer for the issue in store_test or something.
Also I'd like to see RCA for https://tracker.ceph.com/issues/70390

@aclamk
Copy link
Contributor Author

aclamk commented Apr 15, 2025

@aclamk it would be great if we have a reproducer for the issue in store_test or something. Also I'd like to see RCA for https://tracker.ceph.com/issues/70390

@ifed01 @Jayaprakash-ibm is working on a teuthology job that will do such a thing.
I was unable to create direct unittest that would create problem.

@aclamk
Copy link
Contributor Author

aclamk commented Apr 23, 2025

jenkins test make check arm64

@aclamk aclamk added the aclamk-testing-phoebe bluestore testing label Apr 23, 2025
@aclamk aclamk removed the aclamk-testing-phoebe bluestore testing label Apr 24, 2025
@aclamk aclamk added the aclamk-testing-phoebe bluestore testing label May 8, 2025
@Jayaprakash-ibm
Copy link
Contributor

aclamk added 2 commits June 6, 2025 11:15
Make sure that spanning blobs are not allowed to have extents crossing
shard boundary.

Partially fixes: https://tracker.ceph.com/issues/70390

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
dirty_range used to have length = 1 byte.
This is good if whole extent is inside shard.
But this has proven not to be the case.
dirty_range(offset, length) is slower only when it crosses shard.

Partially fixes: https://tracker.ceph.com/issues/70390

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk force-pushed the aclamk-bs-fix-reshard-bug-70390 branch from f4d361c to 4f566ea Compare June 6, 2025 11:16
@aclamk
Copy link
Contributor Author

aclamk commented Jun 18, 2025

jenkins test windows

@aclamk
Copy link
Contributor Author

aclamk commented Jun 18, 2025

jenkins test make check arm64

@ljflores ljflores requested a review from ifed01 June 24, 2025 15:58
@rzarzynski
Copy link
Contributor

@ifed01: ping.

Copy link
Contributor

@liu-chunmei liu-chunmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ifed01
Copy link
Contributor

ifed01 commented Aug 15, 2025

@ifed01: ping.

I made an extended fix for the issue, see #65065
The rationales behind could be found at: #62816 (comment)

@rzarzynski
Copy link
Contributor

@ifed01, @aclamk: as there is another PR, should we close this one?

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk closed this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants