Skip to content

osd: Multiple fixes to optimized EC and peering#63873

Merged
ljflores merged 42 commits intoceph:mainfrom
aainscow:ec_fixpack_pr
Jul 9, 2025
Merged

osd: Multiple fixes to optimized EC and peering#63873
ljflores merged 42 commits intoceph:mainfrom
aainscow:ec_fixpack_pr

Conversation

@aainscow
Copy link
Contributor

@aainscow aainscow commented Jun 11, 2025

This PR is a result of several weeks/months of teuthology tests for EC, running in a branch where the optimised EC is always on.

The commits themselves describe the individual problems that are being fixed and each fix is relatively self contained. Where multiple attempts at a fix were made, these have been squashed, but largely each commit reflects the result of a teuthology test failure.

There is one commit which is a collection of cosmetic/debug/typos, rather than spreading these out over other commits.

Tracker: https://tracker.ceph.com/issues/71814

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@aainscow aainscow requested review from a team as code owners June 11, 2025 08:16
@aainscow aainscow requested review from bill-scales and rzarzynski and removed request for a team June 11, 2025 08:16
@aainscow aainscow changed the title osd: Multiple fixes to EC and peering osd: Multiple fixes to optimized EC and peering Jun 11, 2025
@ronen-fr
Copy link
Contributor

Thanks for the clear comments (and for the code quality in general...).
LGTM. Some minor comments.

@ronen-fr
Copy link
Contributor

LGTM. Minor notes still left to respond/fix, but not important enough to delay approval.

Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see remaining notes

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

aainscow added 12 commits July 1, 2025 13:03
The op.is_delete() function only returns true if the op is not ALSO
doing something else (e.g. a clone). This causes issues with clearing
the new EC extent cache.

Also improve some debug and code cleanliness.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
…ized EC.

The problem was that incorrect shard_versions were being detected
by scrub. The comment in the code explains the detailed problem
and solution.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Code changes to prevent create then erase of empty
shard.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Inserting the first parity buffer was causing the ro-range within the SEM to be incorrectly calculated.
Simple fix and I have added some unit tests to defend this error in the future.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Some client OPs are able to generate transactions which delete an
object and then write it again. This is used by the copy-from ops.

If such a write is not 4k aligned, then the new EC code was incorrectly doing
a read-modify write on the misaligned 4k.  This causes some
garbage to be written to the backend OSD, off the end of the
object.  This is only a problem if the object is later extended
without the end being written.

Problematic sequence is:

1. Create two objects (A and B) of size X and Y where:
X > Y, (Y % 4096) != 0

2. copy_from  OP B -> A
3. Extend B without writing offset Y+1

This will result in a corrupt data buffer at Y+1 without this fix.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Scrub revealed a bug whereby the non-primary shards were being given a
version number in the OI which did not match the expected version in
the authoritative OI.

A secondary issue is that all attributes were being pushed to the non
primary shards, whereas only OI is actually needed.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

# Conflicts:
#	src/osd/osd_types.h
If a ReadOp from EC contains two objects where one object only reads from a single shard, but
 other onjects require other shards, then this bug can be hit.  The fix should make it clear
 what the issue is

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
In old EC, the full stripe was always read and written.  In new EC, we only attempt
to recover the shards that were missing. If an old OSD is available, the read can
be directed there.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
…ct is recovering.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
get_all_remaining_reads() is used for two purposes:

1. If a read unexpectedly fails, recover from other shards.
2. If a shard is missing, but is allowed to be missing (typically
   due to unequal shard sizes), we rely on this function to return
   no new reads, without an error.

The test failure we saw case (2), but I think case (1) is important.

Most of the time we probably would not notice, but if insufficient redundancy
exists without the for_recovery being set, then this will result in
recovery failing.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
This was a particular edge case whereby the need to do an encode and a decode as part of recovery
was causing EC to attempt to do a decode off the end of the shard, despite this being
unnecessary.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
@aainscow
Copy link
Contributor Author

aainscow commented Jul 1, 2025

@yuriw @ljflores @SrinivasaBharath Fix pushed. I have sent to ceph-ci and I will see if I can figure out how to re-run some tests!

@ljflores
Copy link
Member

ljflores commented Jul 1, 2025

@aainscow ping me if you need any help with that! I will also let Bharath know that we need to redo the batch for this. So know that it is getting tested individually here: https://tracker.ceph.com/issues/71921

@ljflores
Copy link
Member

ljflores commented Jul 1, 2025

Segmentation fault in Objecter code (this one, I'm not sure, but I flagged it since your PR makes some changes to the Objecter code):
/a/yuriw-2025-06-25_18:29:16-rados-wip-yuri-testing-2025-06-25-0715-distro-default-smithi/8349961

op 4502 is a stat of 86:c5c1dece:test-rados-api-smithi167-27425-40::foo:head it was initially routed to PG 86.3 on osd 4, then we receive epoch map 359, I think we check and find nothing has changed (we certainly don't send it again), then we get a reply from osd 4 (with result -2) and hit a segv

Looking at osd 4 we seem to be getting 102 of these stats that are all being failed - starting with tid 4334, these all have RETRY=3 which implies its the 3rd time this tid has been sent. It seems unlikely that the client would send so many stats for the same object, so probably the client has corrupted a linked list or something and went wrong a bit earlier?? PG 86.3 is a replica 2 PG [4,6]

The changes to Objecter and OSDMap are the only client side changes - these are generally all conditional on optimized EC pools, however there was 0a61624 where I de-referenced a null pointer when a pool had been deleted. That bug only affected the monitors, but its an example of how the EC changes might regress other pools.

We are continuing some investigation about whether this could be related, although at the moment we think it is not.

@aainscow @bill-scales I actually just found this same issue (the Objecter issue) on a different batch, so I think that points to this being unrelated.

See https://tracker.ceph.com/issues/71930

@ljflores
Copy link
Member

ljflores commented Jul 1, 2025

OSD got stuck down and never came back up (this error is a bit more ambiguous as well, but something is clearly up with the OSD in this test, so you might be able to check the osd logs for something related to your PR).
Example:
/a/yuriw-2025-06-25_18:29:16-rados-wip-yuri-testing-2025-06-25-0715-distro-default-smithi/8349833

All the osds 0-8 get "Monitor daemon marked osd.X down" messages, I think this is expected from the "ceph osd down" command Osds 0-5,7-8 all come back in epoch 50 with logs like "osd.7 50 state: booting -> active", this appears to be tied to clearing noup which is allowing the osds back into the cluster. Osd 6 gets stuck at epoch 47 and never hears about epoch 48 onward so never comes back into the cluster. We also see messages like this which is the OSD seeing the new map and refusing to participate because NOUP is set

2025-06-25T19:03:31.210+0000 7f1999267640 -1 osd.5 45 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:31.382+0000 7f198be3c640 -1 osd.5 46 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:32.387+0000 7f198be3c640 -1 osd.5 47 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:33.390+0000 7f198be3c640 -1 osd.5 48 osdmap NOUP flag is set, waiting for it to clear

For osd 6 we only see these because it doesn't hear about epoch 48:

2025-06-25T19:03:32.248+0000 7f2136e4b640 -1 osd.6 46 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:32.387+0000 7f212622a640 -1 osd.6 47 osdmap NOUP flag is set, waiting for it to clear
Mon A isn't sending osd 6 map 48, it does send it to all the other osds. Somehow osd 6 has upset the monitor

Also found this one in a different batch, so likely unrelated: https://tracker.ceph.com/issues/71931

@aainscow @bill-scales

@ljflores
Copy link
Member

ljflores commented Jul 8, 2025

@ljflores
Copy link
Member

ljflores commented Jul 8, 2025

jenkins test make check arm64

@ljflores
Copy link
Member

ljflores commented Jul 8, 2025

jenkins test signed

@ljflores
Copy link
Member

ljflores commented Jul 8, 2025

@aainscow can you fix this commit with a Signed-off-by line? 231ac30

    Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
@aainscow
Copy link
Contributor Author

aainscow commented Jul 8, 2025

@aainscow can you fix this commit with a Signed-off-by line? 231ac30

Done, Thanks. I will merge once tests passed if I am still awake.

@aainscow aainscow requested a review from rzarzynski July 9, 2025 10:50
@aainscow
Copy link
Contributor Author

aainscow commented Jul 9, 2025

All ready to go. I want to double check with @rzarzynski that he is good with the merging.

@ljflores ljflores merged commit 38578de into ceph:main Jul 9, 2025
13 checks passed

void add(const pg_log_entry_t& e, bool applied = true) {
void add(const pg_log_entry_t& e, enum NonPrimary nonprimary, bool applied, pg_info_t *info, LogEntryHandler *h) {
mark_writeout_from(e.version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'enum' should be removed in some future fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants