crimson: rework request pipeline to allow for per-object processing by athanatos · Pull Request #61005 · ceph/ceph

athanatos · 2024-12-09T19:59:07Z

This PR reworks the client IO pipeline to allow one request per object to be processed concurrently. Writes still need to serialize at submission time.

For random reads with high concurrency, this seems to be good for a ~8-10% throughput increase on a single OSD with seastore in my environment. Further refinement may well improve things further, but this branch has gotten large as is.

Teuthology run looks good, all tests passed: https://pulpito.ceph.com/sjust-2024-12-06_21:24:24-crimson-rados-wip-sjust-crimson-testing-2024-12-06-distro-default-smithi/

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

github-actions · 2024-12-09T20:05:32Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Samuel Just <sjust@redhat.com>

It's always possible for there to be an in-progress replica-read when the replica processes a repop. It's rare in our tests because the read and write submitted by the test client would need to overlap in time. This makes the results non-deterministic and thus a somewhat less sensitive test. Note, the space of valid results is well defined -- it would have to be state before or after any of the outstanding writes. Any other result or a torn read would be wrong. It's probably worth updating RadosModel to add such a pattern, but we can do that later. This branch makes this race much more likely and even observable with the existing RadosModel implementation as it extends the obc lifetime past the point of returning the result to the client in order to ensure that it outlives the handle. Fixes: https://tracker.ceph.com/issues/69013 Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

We're going to need instance_handle to outlive exiting the pipeline stage as it will later hold a reference to an obc holding that stage. Signed-off-by: Samuel Just <sjust@redhat.com>

… exited The operation will hold a reference to the obc containing most of the pipeline stages. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

start() isn't particularly long and splitting it here isn't all that helpful. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

…ages It's important to construct and submit log entries atomically because the submission order needs to match the entry numbering. Previously, this occurred under the pg-wide exclusive process stage. Shortly, each obc will have its own process stage, so we need to ensure that atomicity seperately from the pipeline stages. Introduce a simple mutex. Signed-off-by: Samuel Just <sjust@redhat.com>

It's pretty short and this way all of the stage transitions are in one place. Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos · 2024-12-11T05:09:40Z

jenkins test make check

cyx1231st · 2024-12-12T08:08:50Z

I see ~10-12% throughput increase compared with "C. fine-grained-cache" (SeaStore, single OSD, 32c, RBD) from #60654.

Matan-B · 2024-12-12T10:58:37Z

src/common/intrusive_lru.h

  }

-  /*
-   * Clears unreferenced elements from the lru set [from, to]


Should we keep this option in our interface? It might be useful to clear unreferenced without invalidating them (in some future use-case)

I'd rather not keep it without users. We can always readd it if we need it later.

Matan-B · 2024-12-12T11:48:18Z

src/crimson/osd/pg.h

    const OpInfo &op_info,
    std::vector<OSDOp>& ops);

+  seastar::shared_mutex submit_lock;


Out of scope: Should we introduce access limitations to any log entries submissions that would require taking submit_lock without delegating the responsibility to the next (any future) user?

Perhaps, but it seems like it would probably just make the interface harder to use.

The main hole is that usage in snaptrim_event.cc. That should really get cleaned up in a subsequent refactor, but it can wait for now.

src/crimson/osd/object_context.h

Matan-B · 2024-12-12T11:59:19Z

src/crimson/osd/object_context_loader.cc

    co_await manager.target_state.lock_to(lock_type);
  } else {
    manager.target_state.lock_excl_sync();
+    manager.target_state.obc->loading = true;


In case we stay with loading, should we mark as false once we finish to load?

Not as written, needs to stay set. I could change the name, but I'd rather leave it as is for now.

Renamed to loading_started, added a comment explaining the usage and invariant.

Matan-B · 2024-12-12T12:02:04Z

src/crimson/osd/osd_operation.h

  } wait_repop;
 };

+struct CommonOBCPipeline {


Out of scope: we should update dev/crimson/pipeline.rst in the future to reflect some of the new changes.

Forgot about that, adding a commit updating it.

Matan-B · 2024-12-12T13:29:11Z

src/crimson/osd/osd_operations/client_request.cc

+
+  DEBUGDPP("{}.{}: entering wait_pg_ready stage",
+	   *pgref, *this, this_instance_id);
+  ihref.enter_stage_sync(client_pp(pg).wait_pg_ready, *this);


Can you please explain why enter_stage won't work here? which order are we preserving?
Perhaps we can add a comment here?

enter_stage would work, but it's strictly less efficient as the interface allows it to block (though it never will). We're using enter_stage_sync here because the prior stage is PerShardPipeline::create_or_wait_pg (OrderedExclusive) and CommonPGPipeline::WaitPGReady is OrderedConcurrent. Exiting an OrderedExclusive stage never blocks and entering an OrderedConcurrent stage never blocks. enter_stage_sync for that transition, therefore, does not need to allow for blocking and is therefore simpler and more efficient.

adding a comment

src/crimson/osd/ops_executer.h

src/crimson/osd/pg.cc

cyx1231st · 2024-12-13T08:15:36Z

src/crimson/osd/object_context_loader.cc


-  if (existed) {
+  if (manager.head_state.is_empty()) {
+    ceph_assert(manager.target_state.is_empty());


Should this line be moved out of if (manager.head_state.is_empty()) because manager.set_state_obc(manager.target_state, manager.head_state.obc) below is called unconditionally ?

Yep, repushed.

src/crimson/osd/osd_operations/client_request.cc

- adds ObjectContext::obc_pipeline - exposes ObjectContext::obc_pipeline via ObjectContextLoader::Orderer - allows obcs to be in the registry without being loaded - adds ObjectContext::loading bool to signal that loading has begun Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

…ent to use obc stages Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

…changes_and_submit Signed-off-by: Samuel Just <sjust@redhat.com>

…es_n_do_ops_effects Templating MutFunc was pretty confusing, and flush_changes_n_do_ops_effects is already closely coupled to PG::submit_transaction. Signed-off-by: Samuel Just <sjust@redhat.com>

That the log entry's verison matches the object_info on the actual object is a pretty core invariant. This commit moves creating the log entry for head and populating the metadata into OpsExecuter::prepare_head_update. As a side effect, flush_clone_metadata and CloningCtx::apply_to were removed and split between prepare_head_update (portions related to the head's ssc) and flush_changes_and_submit. Signed-off-by: Samuel Just <sjust@redhat.com>

We want to emplace and initialize osd_op_params upon first write, but we don't want to fill at_version, pg_trim_to, pg_committed_to, or last_complete until prepare_transaction because we don't want to require a particular commit order any earlier than we have to. Signed-off-by: Samuel Just <sjust@redhat.com>

…x,complete_cloning_ctx We need to defer versioning the clone oi and log entry until commit time while ensuring that the clone operation occurs first in the transaction. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

…ease Signed-off-by: Samuel Just <sjust@redhat.com>

Repops previously used PGPipeline::await_map. This is actually important as we need them to be processed in order. However, using await_map was confusing and using a single exclusive stage is decidedly unoptimal as we could allow pipelineing on write commit. For now, move them over to their own pipeline stage so we can remove the PGPipeline struct entirely. Later, we'll improve replica write pipelining for better replica-side write concurrency. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

This commit updates pipeline.rst to include some basic information about how the pipeline stages now work. I've removed the explicit listing of the different stages as I'd rather readers refer to the actual implementation for those details to avoid them getting out of date. I also removed the comparison to classic as the approach has now diverged quite a bit and I feel that the ordering part is more important to focus on than the points at which processing might block. Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos · 2024-12-14T00:50:13Z

jenkins test make check

athanatos · 2024-12-14T00:50:41Z

jenkins test make check arm64

Matan-B

Thanks for adding the additional docs/comments!

athanatos requested a review from a team as a code owner December 9, 2024 19:59

github-actions bot added common crimson tests labels Dec 9, 2024

athanatos requested review from Matan-B and cyx1231st December 9, 2024 19:59

github-actions bot added the needs-rebase label Dec 9, 2024

athanatos added 2 commits December 9, 2024 12:08

common/intrusive_lru: add method to access use count

06affa6

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/osd/object_context: add formatter instance for ObjectContext

4a638e5

Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos force-pushed the sjust/wip-crimson-io-3 branch from 1d577fe to 5188d65 Compare December 9, 2024 20:13

github-actions bot removed the needs-rebase label Dec 9, 2024

athanatos added 6 commits December 10, 2024 15:32

crimson/osd/pg: fix tabbing in replica_clear_repop_obc

c0da409

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../ops_executer: no reason to return cloning_ctx

bb2c45f

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../ops_executer: emplace osd_op_params in fill_op_params

87c3ea2

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../client_request: extend instance_handle lifetime

883d023

We're going to need instance_handle to outlive exiting the pipeline stage as it will later hold a reference to an obc holding that stage. Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../internal_client_request: extend start() until stages have…

f7adf67

… exited The operation will hold a reference to the obc containing most of the pipeline stages. Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos force-pushed the sjust/wip-crimson-io-3 branch from 5188d65 to b931baf Compare December 10, 2024 15:33

athanatos added 6 commits December 10, 2024 18:01

crimson/.../client_request: move log line to complete_request callback

adc63e2

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson: inline SnapTrimObjSubEvent::process_and_submit

c8f19ea

start() isn't particularly long and splitting it here isn't all that helpful. Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../pg: convert submit_error_log to coroutine

fda23c9

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../pg: update debugging in PG::submit_error_log

5021e87

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson: inline InternalClientRequest::do_process

9ed81f5

It's pretty short and this way all of the stage transitions are in one place. Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos force-pushed the sjust/wip-crimson-io-3 branch from 0f0c606 to 3092e6c Compare December 11, 2024 02:02

cyx1231st added the crimson-perf label Dec 12, 2024

Matan-B reviewed Dec 12, 2024

View reviewed changes

github-actions bot added the documentation label Dec 13, 2024

cyx1231st reviewed Dec 13, 2024

View reviewed changes

athanatos added 15 commits December 13, 2024 12:32

crimson: add CommonOBCPipeline

29dedef

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../osd_operation*: add wait_pg_ready and get_obc

6421053

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson: convert client_request, internal_client_request, snaptrim_ev…

f655f7f

…ent to use obc stages Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../pg_backend: split clone into clone_for_write, set_metadata

86588d2

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../ops_executer: rename flush_changes_n_do_effects to flush_…

45cc9e9

…changes_and_submit Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../ops_executer: just call submit_transaction in flush_chang…

fc02927

…es_n_do_ops_effects Templating MutFunc was pretty confusing, and flush_changes_n_do_ops_effects is already closely coupled to PG::submit_transaction. Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/.../pg: more debugging

9e69d0e

Signed-off-by: Samuel Just <sjust@redhat.com>

crimson/osd/object_context_loader: print obc (with refcount) upon rel…

0c87de8

…ease Signed-off-by: Samuel Just <sjust@redhat.com>

crimson: remove now unused pipeline stages

4c46b01

Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos force-pushed the sjust/wip-crimson-io-3 branch from 5496e89 to dbb129c Compare December 13, 2024 20:33

athanatos requested review from Matan-B and cyx1231st December 14, 2024 03:38

athanatos mentioned this pull request Dec 14, 2024

crimson: allow replica side write commits to pipeline #61086

Merged

Matan-B approved these changes Dec 15, 2024

View reviewed changes

Matan-B added TESTED ready-to-merge labels Dec 15, 2024

cyx1231st approved these changes Dec 16, 2024

View reviewed changes

athanatos merged commit 3725c74 into ceph:main Dec 16, 2024

markhpc added the performance label Dec 19, 2024

athanatos mentioned this pull request Mar 18, 2025

crimson/osd/pg: Logically ignore older repops replies #62277

Closed

14 tasks

Conversation

athanatos commented Dec 9, 2024

Uh oh!

github-actions bot commented Dec 9, 2024

Uh oh!

athanatos commented Dec 11, 2024

Uh oh!

cyx1231st commented Dec 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athanatos Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

athanatos commented Dec 14, 2024

Uh oh!

athanatos commented Dec 14, 2024

Uh oh!

Matan-B left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

athanatos Dec 13, 2024 •

edited

Loading