crimson/osd: clear peer_missing entries for deleted objects#67488
crimson/osd: clear peer_missing entries for deleted objects#67488shraddhaag wants to merge 1 commit intoceph:mainfrom
Conversation
This commit fixes the abort in Recovered::Recovered. There is a race to acquire the OBC lock between backfill and client delete for the same object. When the lock is acquired first by the backfill, the object is recovered first, and then deleted by the client delete request. When recovering the object, the corresponding peer_missing entry is cleared and we are able to transition to Recovered state successfully. When the lock is acquired first by client delete request, the object is deleted. Then backfill tries to recover the object, finds it deleted and exists early. The stale peer_missing entry is not cleared. In Recovered::Recovered, needs_recovery() sees this stale peer_missing entry and calls abort. This issue is fixed by clearing out the stale peer_missing entries for the deleted object when object is being recovered by recover_object(). Earlier, an alternative fix was to clear the stale peer_missing entries in enqueue_delete_for_backfill called by the client delete. This approach was abandoned as this can also lead to a race condition which might result in this same error, if the bckfill completes before the entries are cleared in enqueue_delete_for_backfill(). Fixes: https://tracker.ceph.com/issues/70501 Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
3c87d3d to
abffb61
Compare
amathuria
left a comment
There was a problem hiding this comment.
Looks good, approving.
Thanks for the detailed explanation @shraddhaag!
|
jenkins test make check arm64 |
Matan-B
left a comment
There was a problem hiding this comment.
Good progress!
I think this approach hides the issue but we should instead make sure racing deletes are properly handled.
We can't skip recovery each time object is not found - there could be cases where the object is not found due to actual error and was not deleted.
| // acquired the OBC lock. Clear stale peer_missing entries that were added | ||
| // by prepare_backfill_for_missing() so that needs_recovery() does not | ||
| // fail the assertion in Recovered::Recovered. | ||
| DEBUGDPP("object does not exist, clearing peer_missing: {}", pg, soid); |
There was a problem hiding this comment.
We should consider the case where the object is being recovered and can't (for some catastrophic reason) be found without being deleted.
This must result in an error (abort/retry). Otherwise, we always let missing objects be marked as recovered - even on non-deleted objects.
| if (pg.get_peering_state().get_peer_missing(peer).is_missing(soid, &item)) { | ||
| DEBUGDPP("clear stale peer_missing entry {} v {} for peer {}", | ||
| pg, soid, item.need, peer); | ||
| pg.get_peering_state().on_peer_recover(peer, soid, item.need); |
There was a problem hiding this comment.
ReplicatedRecoveryBackend::recover_delete already pushes MOSDPGRecoveryDelete to all peers that still have the object marked as missing.
The above would be responsible for letting peer handle call on_local_recover with is_delete=true and handling the reply from this peer (via handle_recovery_delete_reply) and eventually call on_peer_recover.
I think it's wrong to skip the logic above, by calling on_peer_recover here. We trick peering_state into thinking "That peer recovered this object" - while this is not the case.
The issue here (prior to the changes in the PR) seems to be that we thought that we "Abort the recovery in this case .."(l. 46) but we didn't actually abort it and later on PeeringState "Recovered" would fail on ceph_assert(!ps->needs_recovery()); - since we some peers still think they have missing objects.
I would try to understand how we can let MOSDPGRecoveryDelete logic execute properly and/or dropping/aborting recovery for this object.
Alternatively, we could try making maybe_pull_missing_obj smarter and by not pulling deleted objects? Currently we only skip pulls if the object is not missing, perhaps this case could be further expanded to detect deletes.
ReplicatedRecoveryBackend::maybe_pull_missing_obj(
{
if (!local_missing.is_missing(soid)) {
// object is not missing, don't pull
return seastar::make_ready_future<>();
}
The Issue
This issue occurs when backfilling for an object and a client request to delete for the same object coincide. In such a case, there is a race to acquire the OBC lock between backfill and client delete.
Consider the scenario when the backfill is in progress and has already added the peer_missing entry for the given object. Now, one of two scenarios can happen:
When the lock is acquired first by the backfill, the object is recovered first, and then deleted by the client delete request. When recovering the object, the corresponding peer_missing entry is cleared when backfill recovers the object, the client delete sends a complete delete op to the backfilling OSD (hence, deleting the object) and we are able to transition to Recovered state successfully.
When the lock is acquired first by client delete request, the object is deleted. Then backfill tries to recover the object, finds it deleted and exits early. The stale peer_missing entry is not cleared. In Recovered::Recovered, needs_recovery() sees this stale peer_missing entry and calls abort.
The below image summaries the issue seen in https://pulpito.ceph.com/shraddhaag-2026-02-09_14:34:53-crimson-rados-main-distro-debug-trial/41478/.
The fix.
This issue is fixed by clearing out the stale peer_missing entries for the deleted object when object is discovered to be deleted by recover_object().
The fix was tested across many runs, all successfully passed:
Earlier, an alternative fix was to clear the stale peer_missing entries in enqueue_delete_for_backfill called by the client delete. This approach was abandoned as this can also lead to a race condition which might result in this same error. In such a case, the backfill would complete before the entries are cleared in enqueue_delete_for_backfill().
Fixes: https://tracker.ceph.com/issues/70501
QA Analysis
https://tracker.ceph.com/projects/crimson/wiki/MAIN#httpspulpitocephcomshraddhaag-2026-02-26_022256-crimson-rados-main-distro-debug-trial
Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.