crimson: rework request pipeline to allow for per-object processing#61005
crimson: rework request pipeline to allow for per-object processing#61005
Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
1d577fe to
5188d65
Compare
It's always possible for there to be an in-progress replica-read when the replica processes a repop. It's rare in our tests because the read and write submitted by the test client would need to overlap in time. This makes the results non-deterministic and thus a somewhat less sensitive test. Note, the space of valid results is well defined -- it would have to be state before or after any of the outstanding writes. Any other result or a torn read would be wrong. It's probably worth updating RadosModel to add such a pattern, but we can do that later. This branch makes this race much more likely and even observable with the existing RadosModel implementation as it extends the obc lifetime past the point of returning the result to the client in order to ensure that it outlives the handle. Fixes: https://tracker.ceph.com/issues/69013 Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
We're going to need instance_handle to outlive exiting the pipeline stage as it will later hold a reference to an obc holding that stage. Signed-off-by: Samuel Just <sjust@redhat.com>
… exited The operation will hold a reference to the obc containing most of the pipeline stages. Signed-off-by: Samuel Just <sjust@redhat.com>
5188d65 to
b931baf
Compare
Signed-off-by: Samuel Just <sjust@redhat.com>
start() isn't particularly long and splitting it here isn't all that helpful. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ages It's important to construct and submit log entries atomically because the submission order needs to match the entry numbering. Previously, this occurred under the pg-wide exclusive process stage. Shortly, each obc will have its own process stage, so we need to ensure that atomicity seperately from the pipeline stages. Introduce a simple mutex. Signed-off-by: Samuel Just <sjust@redhat.com>
It's pretty short and this way all of the stage transitions are in one place. Signed-off-by: Samuel Just <sjust@redhat.com>
0f0c606 to
3092e6c
Compare
|
jenkins test make check |
|
I see ~10-12% throughput increase compared with "C. fine-grained-cache" (SeaStore, single OSD, 32c, RBD) from #60654. |
| } | ||
|
|
||
| /* | ||
| * Clears unreferenced elements from the lru set [from, to] |
There was a problem hiding this comment.
Should we keep this option in our interface? It might be useful to clear unreferenced without invalidating them (in some future use-case)
There was a problem hiding this comment.
I'd rather not keep it without users. We can always readd it if we need it later.
| const OpInfo &op_info, | ||
| std::vector<OSDOp>& ops); | ||
|
|
||
| seastar::shared_mutex submit_lock; |
There was a problem hiding this comment.
Out of scope: Should we introduce access limitations to any log entries submissions that would require taking submit_lock without delegating the responsibility to the next (any future) user?
There was a problem hiding this comment.
Perhaps, but it seems like it would probably just make the interface harder to use.
There was a problem hiding this comment.
The main hole is that usage in snaptrim_event.cc. That should really get cleaned up in a subsequent refactor, but it can wait for now.
| co_await manager.target_state.lock_to(lock_type); | ||
| } else { | ||
| manager.target_state.lock_excl_sync(); | ||
| manager.target_state.obc->loading = true; |
There was a problem hiding this comment.
In case we stay with loading, should we mark as false once we finish to load?
There was a problem hiding this comment.
Not as written, needs to stay set. I could change the name, but I'd rather leave it as is for now.
There was a problem hiding this comment.
Renamed to loading_started, added a comment explaining the usage and invariant.
| } wait_repop; | ||
| }; | ||
|
|
||
| struct CommonOBCPipeline { |
There was a problem hiding this comment.
Out of scope: we should update dev/crimson/pipeline.rst in the future to reflect some of the new changes.
There was a problem hiding this comment.
Forgot about that, adding a commit updating it.
|
|
||
| DEBUGDPP("{}.{}: entering wait_pg_ready stage", | ||
| *pgref, *this, this_instance_id); | ||
| ihref.enter_stage_sync(client_pp(pg).wait_pg_ready, *this); |
There was a problem hiding this comment.
Can you please explain why enter_stage won't work here? which order are we preserving?
Perhaps we can add a comment here?
There was a problem hiding this comment.
enter_stage would work, but it's strictly less efficient as the interface allows it to block (though it never will). We're using enter_stage_sync here because the prior stage is PerShardPipeline::create_or_wait_pg (OrderedExclusive) and CommonPGPipeline::WaitPGReady is OrderedConcurrent. Exiting an OrderedExclusive stage never blocks and entering an OrderedConcurrent stage never blocks. enter_stage_sync for that transition, therefore, does not need to allow for blocking and is therefore simpler and more efficient.
|
|
||
| if (existed) { | ||
| if (manager.head_state.is_empty()) { | ||
| ceph_assert(manager.target_state.is_empty()); |
There was a problem hiding this comment.
Should this line be moved out of if (manager.head_state.is_empty()) because manager.set_state_obc(manager.target_state, manager.head_state.obc) below is called unconditionally ?
- adds ObjectContext::obc_pipeline - exposes ObjectContext::obc_pipeline via ObjectContextLoader::Orderer - allows obcs to be in the registry without being loaded - adds ObjectContext::loading bool to signal that loading has begun Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ent to use obc stages Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…changes_and_submit Signed-off-by: Samuel Just <sjust@redhat.com>
…es_n_do_ops_effects Templating MutFunc was pretty confusing, and flush_changes_n_do_ops_effects is already closely coupled to PG::submit_transaction. Signed-off-by: Samuel Just <sjust@redhat.com>
That the log entry's verison matches the object_info on the actual object is a pretty core invariant. This commit moves creating the log entry for head and populating the metadata into OpsExecuter::prepare_head_update. As a side effect, flush_clone_metadata and CloningCtx::apply_to were removed and split between prepare_head_update (portions related to the head's ssc) and flush_changes_and_submit. Signed-off-by: Samuel Just <sjust@redhat.com>
We want to emplace and initialize osd_op_params upon first write, but we don't want to fill at_version, pg_trim_to, pg_committed_to, or last_complete until prepare_transaction because we don't want to require a particular commit order any earlier than we have to. Signed-off-by: Samuel Just <sjust@redhat.com>
…x,complete_cloning_ctx We need to defer versioning the clone oi and log entry until commit time while ensuring that the clone operation occurs first in the transaction. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ease Signed-off-by: Samuel Just <sjust@redhat.com>
Repops previously used PGPipeline::await_map. This is actually important as we need them to be processed in order. However, using await_map was confusing and using a single exclusive stage is decidedly unoptimal as we could allow pipelineing on write commit. For now, move them over to their own pipeline stage so we can remove the PGPipeline struct entirely. Later, we'll improve replica write pipelining for better replica-side write concurrency. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
This commit updates pipeline.rst to include some basic information about how the pipeline stages now work. I've removed the explicit listing of the different stages as I'd rather readers refer to the actual implementation for those details to avoid them getting out of date. I also removed the comparison to classic as the approach has now diverged quite a bit and I feel that the ordering part is more important to focus on than the points at which processing might block. Signed-off-by: Samuel Just <sjust@redhat.com>
5496e89 to
dbb129c
Compare
|
jenkins test make check |
|
jenkins test make check arm64 |
Matan-B
left a comment
There was a problem hiding this comment.
Thanks for adding the additional docs/comments!
This PR reworks the client IO pipeline to allow one request per object to be processed concurrently. Writes still need to serialize at submission time.
For random reads with high concurrency, this seems to be good for a ~8-10% throughput increase on a single OSD with seastore in my environment. Further refinement may well improve things further, but this branch has gotten large as is.
Teuthology run looks good, all tests passed: https://pulpito.ceph.com/sjust-2024-12-06_21:24:24-crimson-rados-wip-sjust-crimson-testing-2024-12-06-distro-default-smithi/
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e