crimson: fix several bugs causing stuck backfills#62837
Conversation
|
Failure https://tracker.ceph.com/issues/71002 -- seems to be in scan_for_backfill, but doesn't seem to be related to these patches. Tested |
4380720 to
0dcac7e
Compare
We're looking for the exact hobject_t specified. Fixes: https://tracker.ceph.com/issues/70935 Signed-off-by: Samuel Just <sjust@redhat.com>
…_for_backfill load_and_lock can return enoent if object is missing. Fixes: https://tracker.ceph.com/issues/70936 Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…rottle This made sense prior to the addition of the scheduler. Now, we need to go through the scheduler whether there's a max or not. Signed-off-by: Samuel Just <sjust@redhat.com>
…ng macros Signed-off-by: Samuel Just <sjust@redhat.com>
…over_object_with_throttle 791772f used with_throttle here in a way which caused then_interruptible in PGRecovery::recover_object to be called outside of an interruptible context. Instead of using a wrapper taking a lambda, rephrase as an RAII releaser suitable for use in a coroutine. This avoids needing to structure with_throttle to deal correctly with both interruptible and non-interruptible contexts. Fixes: https://tracker.ceph.com/issues/70939 Signed-off-by: Samuel Just <sjust@redhat.com>
It's now unused, and these combinator style wrappers are easy to misuse and hard to read compared with RAII and coroutines. Signed-off-by: Samuel Just <sjust@redhat.com>
0dcac7e to
cd9dcf6
Compare
|
jenkins test make check arm64 |
Matan-B
left a comment
There was a problem hiding this comment.
lgtm! One question regarding the resolving bug - probably not a blocker.
| auto obc_manager = pg.obc_loader.get_obc_manager(object); | ||
| co_await pg.obc_loader.load_and_lock( | ||
| auto obc_manager = pg.obc_loader.get_obc_manager( | ||
| object, /* resolve_clone = */ false); |
There was a problem hiding this comment.
The clone_id of the clone is not necessarily a valid snapshot of the object.
Reproduces on any clone whose id doesn't happen to be a live snapshot.
Since this by no means a client originated request this, getting the exact hobject_t makes sense to me.
However, my only hesitation here is with us actually emplacing the (possibly) to-be trimmed objects into the backfill interval.
Meaning shouldn't we rely on resolve_oid here to be used as a removed snap management check?
In a scenario in which a snapshot is removed, a belonging clone won't be resolved, why should it be part of the backfill interval in this case?
I think that the described case is possible since the object here is obtained directly from disk - without any metadata on the logical state of this object. The resolve check might actually help us gain information on this object (i.e to-be trimmed).
Assuming the above is somewhat correct, without resolving the clone object, we possibly backfill a to-be trimmed clone or if trimming was already kicked - this clone object will be logically stale(leaked in way)?
There was a problem hiding this comment.
resolve_clone is really only for client reads -- the client supplied snapshot needs to be a current, actual snapshot of the object. However, the cloneid on the object may be for a removed clone or may not necessarily ever have been a valid snapshot on the object.
| // if the object does not exist here, it must have been removed | ||
| // between the collection_list_partial and here. This can happen | ||
| // for the first item in the range, which is usually last_backfill. | ||
| co_return; |
There was a problem hiding this comment.
nit: Should we add a log line here to more easily spot cases where scan_for_backfill returns earlier than expected? Could help with future investigations
There was a problem hiding this comment.
Doing that now in a follow up for https://tracker.ceph.com/issues/71003
| })); | ||
| DEBUGDPP("got throttle: {} {}", pg->get_dpp(), soid, need); | ||
| co_await pg->get_recovery_backend()->recover_object(soid, need); | ||
| co_return; |
There was a problem hiding this comment.
nit: DEBUGDPP("releasing throttle: {} {}", pg->get_dpp(), soid, need);
| }); | ||
| LOG_PREFIX(PGRecovery::recover_object_with_throttle); | ||
| DEBUGDPP("{} {}", pg->get_dpp(), soid, need); | ||
| auto releaser = co_await interruptor::make_interruptible( |
|
@athanatos, please see #62619 (comment) |
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition