osd/ECTransaction: Remove incorrect asserts in generate_transactions#56924
osd/ECTransaction: Remove incorrect asserts in generate_transactions#56924
Conversation
Back in PR ceph#11701 when EC Overwrites were added, there was significant churn in the ECTransaction code. Several asserts were added in generate_transactions, but the code changed several times and it appears some of those asserts don't make sense in the final version of the PR. This PR removes two of those asserts. The first doesn't appear to be necessary given how the surrounding conditional block changed. The second appears to be an incorrect assert that was left over from an earlier commit that no longer should be in place given the logic that was added to handle truncate (which roughly mirrors what we do in ReplicatedBackend). It's possible the assert for obc and entry are also unnecessary, however I left those in place for now. Fixes: https://tracker.ceph.com/issues/65509 Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
|
This is currently impacting our operations, we have 2 pools inoperable because of it. It appears to be the second time this bug has taken down our cluster following a full cluster wide recovery. |
athanatos
left a comment
There was a problem hiding this comment.
Yep, that looks right to me.
|
@athanatos - do we really need QA for changes like that? |
|
@ifed01 For this change specifically, no. It's the removal of an assert and so can't actually create a failure condition. However, I prefer that core contributions go through testing regardless unless there's a specific, pressing need to merge without testing. Is there such a pressing need here? This bug is very old. @yuriw Can we get this into the next testing batch? |
@athanatos I won't comment on whether or not we've reached that threshold, but here's the failure condition that Mike mentioned in slack that might inform that decision: |
|
The most confounding part of this is the affected cluster has been in operation for 3 years without tripping this bug. I don't know what has changed in client behavior such that steady state operations triggered the bug but it caused a 52 hour outage until we could recover the cluster and stay healthy by forcibly disabling trim/discards from clients. |
I think we should convey also negligible-risk changes through QA, even for the sake of honoring the workflow.
OK, I added information about the workaround to the tracker. The ticket is High (but not Urgent), so let's backport this without delays but let's also have the due QA. |
| ceph_assert(op.truncate->first == | ||
| op.truncate->second); | ||
| ceph_assert(entry); | ||
| ceph_assert(obc); |
There was a problem hiding this comment.
A dozen lines below there is the if (obc) which seems to be always taken. Maybe it's worth a follow-up?
There was a problem hiding this comment.
@rzarzynski yeah, I considered the same thing, but ultimately decided to leave it for a future PR.
|
Hi, I was also involved in finding this. I'm trying to have it reproduced, but did not quite succeed. In the logs I find this a notable log:
The Perhaps someone has an idea what could be missing here to fully reproduce it. |
Back in PR #11701 when EC Overwrites were added, there was significant churn in the ECTransaction code. Several asserts were added in generate_transactions, but the code changed several times and it appears some of those asserts don't make sense in the final version of the PR.
This PR removes two of those asserts. The first doesn't appear to be necessary given how the surrounding conditional block changed. The second appears to be an incorrect assert that was left over from an earlier commit that no longer should be in place given the logic that was added to handle truncate (which roughly mirrors what we do in ReplicatedBackend).
It's possible the assert for entry and/or obc are also unnecessary, however I left those in place for now.
Fixes: https://tracker.ceph.com/issues/65509