osd: Multiple fixes to optimized EC and peering by aainscow · Pull Request #63873 · ceph/ceph

aainscow · 2025-06-11T08:16:27Z

This PR is a result of several weeks/months of teuthology tests for EC, running in a branch where the optimised EC is always on.

The commits themselves describe the individual problems that are being fixed and each fix is relatively self contained. Where multiple attempts at a fix were made, these have been squashed, but largely each commit reflects the result of a teuthology test failure.

There is one commit which is a collection of cosmetic/debug/typos, rather than spreading these out over other commits.

Tracker: https://tracker.ceph.com/issues/71814

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

src/osd/OSDMap.cc

src/osd/PGLog.h

src/osd/PGBackend.h

src/osd/PGLog.h

src/osd/OSDMap.cc

src/osd/PeeringState.cc

src/osd/ECTransaction.cc

src/osd/ECExtentCache.cc

src/osd/ECTransaction.cc

ronen-fr · 2025-06-11T12:02:57Z

Thanks for the clear comments (and for the code quality in general...).
LGTM. Some minor comments.

ronen-fr · 2025-06-16T11:14:54Z

LGTM. Minor notes still left to respond/fix, but not important enough to delay approval.

ronen-fr

see remaining notes

github-actions · 2025-06-19T02:06:52Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

The op.is_delete() function only returns true if the op is not ALSO doing something else (e.g. a clone). This causes issues with clearing the new EC extent cache. Also improve some debug and code cleanliness. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

…ized EC. The problem was that incorrect shard_versions were being detected by scrub. The comment in the code explains the detailed problem and solution. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

Code changes to prevent create then erase of empty shard. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

Inserting the first parity buffer was causing the ro-range within the SEM to be incorrectly calculated. Simple fix and I have added some unit tests to defend this error in the future. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

Some client OPs are able to generate transactions which delete an object and then write it again. This is used by the copy-from ops. If such a write is not 4k aligned, then the new EC code was incorrectly doing a read-modify write on the misaligned 4k. This causes some garbage to be written to the backend OSD, off the end of the object. This is only a problem if the object is later extended without the end being written. Problematic sequence is: 1. Create two objects (A and B) of size X and Y where: X > Y, (Y % 4096) != 0 2. copy_from OP B -> A 3. Extend B without writing offset Y+1 This will result in a corrupt data buffer at Y+1 without this fix. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

Scrub revealed a bug whereby the non-primary shards were being given a version number in the OI which did not match the expected version in the authoritative OI. A secondary issue is that all attributes were being pushed to the non primary shards, whereas only OI is actually needed. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> # Conflicts: # src/osd/osd_types.h

If a ReadOp from EC contains two objects where one object only reads from a single shard, but other onjects require other shards, then this bug can be hit. The fix should make it clear what the issue is Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

In old EC, the full stripe was always read and written. In new EC, we only attempt to recover the shards that were missing. If an old OSD is available, the read can be directed there. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

…ct is recovering. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

get_all_remaining_reads() is used for two purposes: 1. If a read unexpectedly fails, recover from other shards. 2. If a shard is missing, but is allowed to be missing (typically due to unequal shard sizes), we rely on this function to return no new reads, without an error. The test failure we saw case (2), but I think case (1) is important. Most of the time we probably would not notice, but if insufficient redundancy exists without the for_recovery being set, then this will result in recovery failing. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

This was a particular edge case whereby the need to do an encode and a decode as part of recovery was causing EC to attempt to do a decode off the end of the shard, despite this being unnecessary. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

aainscow · 2025-07-01T12:19:57Z

@yuriw @ljflores @SrinivasaBharath Fix pushed. I have sent to ceph-ci and I will see if I can figure out how to re-run some tests!

ljflores · 2025-07-01T15:17:50Z

@aainscow ping me if you need any help with that! I will also let Bharath know that we need to redo the batch for this. So know that it is getting tested individually here: https://tracker.ceph.com/issues/71921

ljflores · 2025-07-01T19:51:30Z

Segmentation fault in Objecter code (this one, I'm not sure, but I flagged it since your PR makes some changes to the Objecter code):
/a/yuriw-2025-06-25_18:29:16-rados-wip-yuri-testing-2025-06-25-0715-distro-default-smithi/8349961

op 4502 is a stat of 86:c5c1dece:test-rados-api-smithi167-27425-40::foo:head it was initially routed to PG 86.3 on osd 4, then we receive epoch map 359, I think we check and find nothing has changed (we certainly don't send it again), then we get a reply from osd 4 (with result -2) and hit a segv

Looking at osd 4 we seem to be getting 102 of these stats that are all being failed - starting with tid 4334, these all have RETRY=3 which implies its the 3rd time this tid has been sent. It seems unlikely that the client would send so many stats for the same object, so probably the client has corrupted a linked list or something and went wrong a bit earlier?? PG 86.3 is a replica 2 PG [4,6]

The changes to Objecter and OSDMap are the only client side changes - these are generally all conditional on optimized EC pools, however there was 0a61624 where I de-referenced a null pointer when a pool had been deleted. That bug only affected the monitors, but its an example of how the EC changes might regress other pools.

We are continuing some investigation about whether this could be related, although at the moment we think it is not.

@aainscow @bill-scales I actually just found this same issue (the Objecter issue) on a different batch, so I think that points to this being unrelated.

See https://tracker.ceph.com/issues/71930

ljflores · 2025-07-01T19:57:32Z

OSD got stuck down and never came back up (this error is a bit more ambiguous as well, but something is clearly up with the OSD in this test, so you might be able to check the osd logs for something related to your PR).
Example:
/a/yuriw-2025-06-25_18:29:16-rados-wip-yuri-testing-2025-06-25-0715-distro-default-smithi/8349833

All the osds 0-8 get "Monitor daemon marked osd.X down" messages, I think this is expected from the "ceph osd down" command Osds 0-5,7-8 all come back in epoch 50 with logs like "osd.7 50 state: booting -> active", this appears to be tied to clearing noup which is allowing the osds back into the cluster. Osd 6 gets stuck at epoch 47 and never hears about epoch 48 onward so never comes back into the cluster. We also see messages like this which is the OSD seeing the new map and refusing to participate because NOUP is set
2025-06-25T19:03:31.210+0000 7f1999267640 -1 osd.5 45 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:31.382+0000 7f198be3c640 -1 osd.5 46 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:32.387+0000 7f198be3c640 -1 osd.5 47 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:33.390+0000 7f198be3c640 -1 osd.5 48 osdmap NOUP flag is set, waiting for it to clear
For osd 6 we only see these because it doesn't hear about epoch 48:
2025-06-25T19:03:32.248+0000 7f2136e4b640 -1 osd.6 46 osdmap NOUP flag is set, waiting for it to clear
2025-06-25T19:03:32.387+0000 7f212622a640 -1 osd.6 47 osdmap NOUP flag is set, waiting for it to clear
Mon A isn't sending osd 6 map 48, it does send it to all the other osds. Somehow osd 6 has upset the monitor

Also found this one in a different batch, so likely unrelated: https://tracker.ceph.com/issues/71931

@aainscow @bill-scales

ljflores · 2025-07-08T18:26:59Z

RADOS approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues71921

ljflores · 2025-07-08T18:27:10Z

jenkins test make check arm64

ljflores · 2025-07-08T18:27:20Z

jenkins test signed

ljflores · 2025-07-08T19:29:39Z

@aainscow can you fix this commit with a Signed-off-by line? 231ac30

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

aainscow · 2025-07-08T21:22:06Z

@aainscow can you fix this commit with a Signed-off-by line? 231ac30

Done, Thanks. I will merge once tests passed if I am still awake.

aainscow · 2025-07-09T11:43:43Z

All ready to go. I want to double check with @rzarzynski that he is good with the merging.

ronen-fr · 2025-07-10T09:14:57Z

src/osd/PGLog.h


-  void add(const pg_log_entry_t& e, bool applied = true) {
+  void add(const pg_log_entry_t& e, enum NonPrimary nonprimary, bool applied, pg_info_t *info, LogEntryHandler *h) {
    mark_writeout_from(e.version);


the 'enum' should be removed in some future fix

aainscow requested review from a team as code owners June 11, 2025 08:16

aainscow requested review from bill-scales and rzarzynski and removed request for a team June 11, 2025 08:16

github-actions bot added core crimson mon tests labels Jun 11, 2025