Skip to content

tentacle: Optimized Erasure Coding - Fixpack 2#65411

Merged
ljflores merged 29 commits intoceph:tentaclefrom
aainscow:wip-72561-tentacle
Sep 16, 2025
Merged

tentacle: Optimized Erasure Coding - Fixpack 2#65411
ljflores merged 29 commits intoceph:tentaclefrom
aainscow:wip-72561-tentacle

Conversation

@aainscow
Copy link
Contributor

@aainscow aainscow commented Sep 5, 2025

backport tracker: https://tracker.ceph.com/issues/72561


backport of #64501
parent tracker: https://tracker.ceph.com/issues/72077

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@aainscow aainscow requested a review from a team as a code owner September 5, 2025 14:59
@aainscow aainscow added this to the tentacle milestone Sep 5, 2025
@aainscow aainscow mentioned this pull request Sep 5, 2025
14 tasks
bill-scales and others added 23 commits September 7, 2025 23:59
1. Refactor the code that applies pwlc to update info and log so that there
is one function rather than multiple copies of the code.

2. pwlc information is used to track shards that have not been updated by
partial writes. It is used to advance last_complete (and last_update and
the log head) to account for log entries that the shard missed. It was
only being applied if last_complete matched the range of partial writes
recorded in pwlc. When a shard has missing objects last_complete is
deliberately held before the oldest need, this stops pwlc being applied.
This is wrong - pwlc can still try and update last update and the log
head even if it cannot update last_complete.

3. When the primary receives info (and pwlc) information from OSD x(y)
it uses the pwlc information to update x(y)'s info. During backfill
there may be other shards z(y) which should also be updated using the
pwlc information.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit c6d2bb9)
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 1352868)
When recovering attributes, we read them from the first potential primary, then
if that read failures, attempt to read from another potential primary.

The problem is that the code which calculates which shards to read for a recovery
only takes into account *data* and not where the attributes are.  As such, if the
second read only required a non-primary, then the attribute read fails and the
OSD panics.

The fix is to detect this scenario and perform an empty read to that shard, which
the attribute-read code can use for attribute reads.

Code was incorrectly interpreting a failed attribute read on recovery as
meaning a "fast_read". Also, no attribute recovery would occur in this case.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 98eae78)
When the primary shard of an optimized EC pool does not have
a copy of the log it may need to repeat the GetLog peering
step twice, the first time to get a full copy of a log from
a shard that sees all log entries and then a second time
to get the "best" log from a nonprimary shard which may
have a partial log due to partial writes.

A side effect of repeating GetLog is that the missing
list is collected for both the "best" shard and the
shard that provides a full copy of the log. This later
missing list confuses later steps in the peering
process and may cause this shard to complete writes
and end up diverging from the primary. Discarding
this missing list causes Peering to behave the same as if
the GetLog step did not need to be repeated.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d2ba0f9)
Fix 1:

These are multiple fixes that affected the same code. To simplify review
and understanding of the changes, they have been merged into a single
commit.

What happened in defect is (k,m = 6,4)

1. State is: fast_reads = true, shards 0,4,5,6,7,8 are available. Shard 1 is missing this object.
2. Shard 5 only needs zeros, so read is dropped. Other sub read message sent.
3. Object on shard 1 completes recovery (so becomes not-missing)
4. Read completes, complete notices that it only has 5 reads, so calculates what it needs to re-read.
5. Calculates it needs 0,1,4,5,6,7 - and so wants to read shard 1.
6. Code assumes that enough reads should have been performed, so refused to do another reads and instead generates an EIO.

The problem here is some "lazy" code in step (4).  What is should be doing is working out that it
can use the zero buffers and not calling get_remaining_reads().  Instead, what it attempts to do is
call get_remaining_reads() and if there is no work to do, then it assumes it has everything
already and completes the read with success.  This assumption mostly works - but in this
combination of fast_reads picking less than k shards to read from AND an object completing
recovery in parallel causes issue.

The solution is to wait for all reads to complete and then assume that any remaining zero buffers
count as completed reads.  This should then cause the plugin to declare "success"

Fix 2:

There are decodes in new EC which can occur when less than k
shards have been read.  These reads in the last stripe, where
for decoding purposes, the data past the end of the shard can
be considered zeros. EC does not read these, but instead relies
on the decode function inventing the zero buffers.

This was working correctly when fast reads were turned off, but
if such an IO was encountered with fast reads turned on the
logic was disabled and the IO returns an EIO.

This commit fixes that logic, so that if all reads have complete
and send_all_remaining_reads conveys that no new reads were
requested, then decode will still be possible.

FIX 3:

If reading the end of an object with unequally sized objects,
we pad out the end of the decode with zeros, to provide
the correct data to the plugin.

Previously, the code decided not to add the zeros to "needed"
shards.  This caused a problem where for some parity-only
decodes, an incomplete set of zeros was generated, fooling the
_decode function into thinking that the entire shard was zeros.

In the fix, we need to cope with the case where the only data
needed from the shard is the padding itself.

The comments around the new code describe the logic behind
the change.

This makes the encode-specific use case of padding out the
to-be-decoded shards unnecessary, as this is performed by the
pad_on_shards function below.

Also fixing some logic in calculating the need_set being passed
to the decode function did not contain the extra shards needed
for the decode. This need_set is actually ignored by all the
plugins as far as I know, but being wrong does not seem
helpful if its needed in the future.

Fix 4: Extend reads when recovering parity

Here is an example use case which was going wrong:
1. Start with 3+2 EC, shards 0,3,4 are 8k shard 1,2 is 4k
2. Perform a recovery, where we recover 2 and 4.  2 is missing, 4 can be copied from another OSD.
3. Recovery works out that it can do the whole recovery with shards 0,1,3. (note not 4)
4. So the "need" set is 0,1,3, the "want" set is 2,4 and the "have" set is 0,1,3,4,5
5. The logic in get_all_avail_shards then tries to work out the extents it needs - it only. looks at 2, because we "have" 4
6. Result is that we end up reading 4k on 0,1,3, then attempt to recover 8k on shard 4 from this... which clearly does not work.

Fix 5: Round up padding to 4k alignment in EC

The pad_on_shards was not aligning to 4k.  However, the decode/encode functions were. This meant that
we assigned a new buffer, then added another after - this should be faster.

Fix 6: Do not invent encode buffers before doing decode.

In this bug, during recovery, we could potentially be creating
unwanted encode buffers and using them to decode data buffers.

This fix simply removes the bad code, as there is new code above
which is already doing the correct action.

Fix 7: Fix miscompare with missing decodes.

In this case, two OSDs failed at once. One was replaced and the other was not.

This caused us to attempt to encode a missing shard while another shard was missing, which
caused a miscompare because the recovery failed to do the decode properly before doing an encode.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit c116b86)
PeeringState::proc_replica_log needs to apply pwlc before
calling PGLog so that any partial writes that have occurred
are taken into account when working out where a replica/stray
has diverged from the primary.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 6c3c0a8)
This was a failed test, where the primary concluded that all objects were present
despite one missing object on the non primary shard.

The problem was caused because the log entries are sent to the unwritten shards if that
shard is missing in order to update the version number in the missing object. However,
the log entry should not actually be added to the log.

Further testing showed there are other scenarios where log entries are sent to
unwritten shards (for example a clone + partial_write in the same transaction),
these scenarios do not want to add the log entry either.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 24cd772)
The CRCs were being invalidate at the wrong point, so the last CRC was
not being invalidated.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 564b53c)
The clone logic in the truncate was only cloning from the truncate
to the end of the pre-truncate object. If the next shard was being
truncated to a shorter length (which is common), then this shard
has a larger clone.

The rollback, however, can only be given a single range, so it was
given a range which covers all clones.  The problem is that if shard
0 is rolled back, then some empty space from the clone was copied
to shard 0.

Fix is easy - calculate the full clone length and apply to all shards, so it matches the rollback.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 5d7588c)
rollback_info_trimmed_to changes

PGLog::rewind_divergent_log was only causing the log to be marked
dirty and checkpointed if there were divergent entries. However
after a PG split it is possible that the log can be rewound
modifying crt and/or rollback_info_trimmed_to without creating
divergent entries because the entries being rolled back were
all split into the other PG.

Failing to checkpoint the log generates a window where if the OSD
is reset you can end up with crt (and rollback_info_trimmed_to) > head.
One consequence of this is asserts like
ceph_assert(rollback_info_trimmed_to == head); firing.

Fixes: https://tracker.ceph.com/issues/55141

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d8f78ad)
…issing

Non-primary shards may not be updated because of partial writes. This means
that the OI verison for an object on these shards may be stale. An assert
in read_log_and_missing was checking that the OI version matched the have
version in a missing entry. The missing entry calculates the have version
using the prior_version from a log entry, this does not take into account
partial writes so can be ahead of the stale OI version.

Relax the assert for optimized pools to require have >= oi.version

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 74e138a)
Optimized EC pools were blocking clean_temps from clearing pg_temp
when up == acting but up_primary != acting_primary because optimized
pools sometimes use pg_temp to force a change of primary shard.

However this can block merges which require the two PGs being
merged to have the same primary. Relax clean_temps to permit
pg_temp to be cleared so long as the new primary is not a
non-primary shard.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit ce53276)
Fix bug with routing to an acting set like [None,Y,X,X]p(X)
for a 3+1 optimzed pool where osd X is representing more
than one shard. For an optimized EC pool we want it to
choose shard 3 because shard 2 is a non-primary. If we
just search the acting set for the first OSD that matches
X this will pick shard 2, so we have to convert the order
to primary's first, then find the matching OSD and then
convert this back to the normal ordering to get shard 3.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 3310f97)
Fix some unusual scenarios where peering was incorrectly
declaring that objects were missing on stray shards. When
proc_master_log rolls forward partial writes it need to
update pwlc exactly the same way as if the write had been
completed. This ensures that stray shards that were not
updated because of partial writes do not cause objects
to be incorrectly marked as missing.

The fix also means some code in GetMissing which was trying
to do a similar thing for shards that were acting,
recovering or backfilling (but not stray) can be deleted.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 83a9c0a)
Scrub detected a bug where if an object was truncated to a size where the first
shard is smaller than the chunk size (only possible for >4k chunk sizes), then
the coding shards were being aligned to the chunk size, rather than to 4k.

This fixes changes how the write plan is calculated to write the correct size.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit a39a309)
The versions on partial shards are permitted to be behind, so we need
to relax several asserts, this is another example.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 0c89e7e)
…nterval.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 3493d13)
If multiple object are being read as part of the same recovery (this happens
when recovering some snapshots) and a read fails, then some reads from other
shards will be necessary.  However, some objects may not need to read. In this
case it is only important that at least one read message is sent, rather than
one read message per object is sent.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 9f9ea6d)
…ent to an object with the whiteout flag set.

Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
(cherry picked from commit 89fef78)
1. proc_master_log can roll forward full-writes that have
been applied to all shards but not yet completed. Add a
new function consider_adjusting_pwlc to roll-forward
pwlc. Later partial_write can be called to process the
same writes again. This can result in pwlc being rolled
backwards. Modify partial_write so it does not undo pwlc.

2. At the end of proc_master_log we want the new
authorative view of pwlc to persist - this may be
better or worse than the stale view of pwlc held by
other shards. consider_rollback_pwlc sometimes
updated the epoch in the toversion (second value of the
range fromversion-toverison). We now always do this.
Updating toversion.epoch causes problems because this
version sometimes gets copied to last_update and
last_complete - using the wrong epoch here messes
everything up in later peering cycles. Instead we
now update fromversion.epoch. This requires changes
to apply_pwlc and an assert in Stray::react(const MInfoRec&)

3. Calling apply_pwlc at the end of proc_master_log is
too early - updating last_update and last_complete here
breaks GetMissing. We need to do this later when activating
(change to search_missing and activate)

4. proc_master_log is calling partial_write with the
wrong previous version - this causes problems after a
split when the log is sparsely populated.

5. merging PGs is not setting up pwlc correctly which
can cause issues in future peering cycles. The
pwlc can simply be reset, we need to update the epoch
to make sure this view of pwlc persists vs stale
pwlc from other shards.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 0b8593a)
…p as nop

Optimized EC pools store pgtemp with primary shards first, this was not
being taken into account by OSDMonitor::preprocess_pgtemp which meant
that the change of pgtemp from [None,2,4] to [None,4,2] for a 2+1 pool
was being rejected as a nop because the primary first encoded version
of [None,2,4] is [None,4,2].

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 00aa193)
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 1c24765)
When a shard is backfilling it gets given log entries
for partial writes even if they do not apply to the
shard. The code was updating the missing list but
discarding the log entry. This is wrong because the
update can be rolled backwards and the log entry is
required to revert the update to the missing list.
Keeping the log entry has a small but insignificant
performance impact.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 1fa5302)
aainscow and others added 6 commits September 8, 2025 00:10
There were some bugs in attribute reads during recovery in optimised
EC where the attribute read failed. There were two scenarios:

1. It was not necessary to do any further reads to recover the data. This
can happen during recovery of many shards.
2. The re-read could be honoured from non-primary shards. There are
sometimes multiple copies of the shard whcih can be used, so a failed read
on one OSD can be replaced by a read from another.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 417fb71)
Explanation of bug which is being fixed:

Log entry 204'784 is an error - "_rollback_to deleting head on smithi17019940-573 because
got ENOENT|whiteout on find_object_context" so this log entry is generated outside of EC by
PrimaryLogPG. It should be applied to all shards, however osd 13(2) was a little slow and
the update got interrupted by a new epoch so it didn't apply it. All the other shards
marked it as applied and completed (there isn't the usual interlock that EC has of making
sure all shards apply the update before any complete it).

We then processed 4 partial writes applying and completing them (they didn't update osd
13(2)), then we have a new epoch and go through peering. Peering says osd 13(2) didn't see
update 204'784 (it didn't) and therefore the error log entry and the 4 partial writes need
to be rolled back. The other shards had completed those 4 partial writes so we end up with
4 missing objects on all the shards which become unfound objects.

I think the underlying bug means that log entry 204'784 isn't really complete and may
"disappear" from the log in a subsequent peering cycle. Trying to forcefully rollback a
logged error doesn't generate a missing object or a miscompare, so the consequences of the
bug are hidden. It is however tripping up the new EC code where proc_master_log is being
much stricter about what a completed write means.

Fix:
After generating a logged error we could force the next write to EC to update metadata on
all shards even if its a partial write. This means this write won't complete unless all
shards see the logged error. This will make new EC behave the same as old EC. There is
already an interlock with EC (call_write_ordered) which is called just before generating
the log error that ensures that any in-flight writes complete before submitting the log
error. We could set a boolean flag here (at the point call_write_ordered is called is fine,
don't need to wait for the callback) to say the next write has to be to all shards. The
flag can be cleared if we generate the transactions for the next write, or we get an
on_change notification (peering will then clear up the mess)

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 4948f74)
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 846879e)
…_complete

Fix bug in apply_pwlc where the primary was advancing last_complete for a
shard doing async recorvery so that last_complete became equal to last_update
and it then thought that recovery had completed. It is only valid to advance
last_complete if it is equal to last_update.

Tidy up the logging in this function as consecutive calls to this function
often logged that it could advance on the 1st call and then that it could not
on the 2nd call. We only want one log message.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 0b19ed4)
Undo a previous attempt at a fix that made add_log_entry skip adding partial
writes to the log if the write did not update this shard. The only case where
this code path executed is when a partial write was to an object that needs
backfilling or async recovery. For async recovery we need to keep the
log entry because it is needed to update the missing list. For backfill it
doesn't harm to keep the log entry.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 9f0e883)
Shards performing backfill or async recovery receive log entries
(but not transactions) for updates to missing/yet to be backfilled
objects. These log entries get applied and completed immediately
because there is nothing that can be rolled back. This causes
pwlc to advance too early and causes problems if other shards
do not complete the update and end up rolling it backwards.

This fix sets pwlc to be invalid when such a log entry is
applied and completed and it then remains invalid until the
next interval when peering runs again. Other shards will
continue to update pwlc and any complete subset of shards
in a future interval will include at least one shard that
has continued to update pwlc

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 534fc76)
Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical changes to PR 64501 except for TestECBackend.cc which also has fixes for 2 signed comparison compiler warnings

@rzarzynski
Copy link
Contributor

I agree, although there are some changes in regards to main, the backport is still trivial.

BTW: I wonder whether there is any better way to compare a backport in git. What Gemini advised, doesn't work:

$ git diff c6d2bb9f6479767927ef01f6e2871dc1bb6d0f54^..534fc76d40a86a49bfabab247d3a703cbb575e27 d2e2c7dfd8cbc188589f096586f2ba47d6fba9b4^..7ba2bf1b2b8e735ca82b9e78708a8ff3c4aac841
usage: git diff [<options>] [<commit>] [--] [<path>...]
   or: git diff [<options>] --cached [--merge-base] [<commit>] [--] [<path>...]
...

Just to the record:

$ git diff c6d2bb9f6479767927ef01f6e2871dc1bb6d0f54^..534fc76d40a86a49bfabab247d3a703cbb575e27  > /tmp/orig
2$ git diff d2e2c7dfd8cbc188589f096586f2ba47d6fba9b4^..7ba2bf1b2b8e735ca82b9e78708a8ff3c4aac841 > /tmp/new
$ diff -ur /tmp/orig /tmp/new
...
 diff --git a/src/test/osd/TestECBackend.cc b/src/test/osd/TestECBackend.cc
-index 1dd4faa6e68..ca299abd963 100644
+index 1dd4faa6e68..a554fe4c94e 100644
 --- a/src/test/osd/TestECBackend.cc
 +++ b/src/test/osd/TestECBackend.cc
-@@ -1320,6 +1320,7 @@ void test_decode(unsigned int k, unsigned int m, uint64_t chunk_size, uint64_t o
+@@ -158,12 +158,12 @@ public:
+       // for each missing shard.
+       for (auto a : available) {
+         minimum_set.insert(a);
+-        if (minimum_set.size() == data_chunk_count) {
++        if (std::cmp_equal(minimum_set.size(), data_chunk_count)) {
+           break;
+         }
+       }
+ 
+-      if (minimum_set.size() != data_chunk_count) {
++      if (std::cmp_not_equal(minimum_set.size(), data_chunk_count)) {
+         minimum_set.clear();
+         return -EIO; // Cannot recover.
+       }
+@@ -174,7 +174,6 @@ public:
+     }
+     return 0;
+   }
+-
+   [[deprecated]]
+   int minimum_to_decode(const std::set<int> &want_to_read,
+     const std::set<int> &available,
+@@ -1320,7 +1319,8 @@ void test_decode(unsigned int k, unsigned int m, uint64_t chunk_size, uint64_t o
      }
    }
  
+-  ASSERT_EQ(0, semap.decode(ec_impl, want, object_size, nullptr, true));
 +  semap.add_zero_padding_for_decode(read_request.zeros_for_decode);
-   ASSERT_EQ(0, semap.decode(ec_impl, want, object_size, nullptr, true));
++  ASSERT_EQ(0, semap.decode(ec_impl, want, object_size));
  }
  
-@@ -1476,4 +1477,4 @@ TEST(ECCommon, decode7) {
+ TEST(ECCommon, decode) {
+@@ -1476,4 +1476,4 @@ TEST(ECCommon, decode7) {
    acting_set.insert_range(shard_id_t(0), 3);
  
    test_decode(k, m, chunk_size, object_size, want, acting_set);

@ljflores
Copy link
Member

@ljflores ljflores merged commit f12c135 into ceph:tentacle Sep 16, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants