Skip to content

crimson: rework request pipeline to allow for per-object processing#61005

Merged
athanatos merged 29 commits intoceph:mainfrom
athanatos:sjust/wip-crimson-io-3
Dec 16, 2024
Merged

crimson: rework request pipeline to allow for per-object processing#61005
athanatos merged 29 commits intoceph:mainfrom
athanatos:sjust/wip-crimson-io-3

Conversation

@athanatos
Copy link
Contributor

This PR reworks the client IO pipeline to allow one request per object to be processed concurrently. Writes still need to serialize at submission time.

For random reads with high concurrency, this seems to be good for a ~8-10% throughput increase on a single OSD with seastore in my environment. Further refinement may well improve things further, but this branch has gotten large as is.

Teuthology run looks good, all tests passed: https://pulpito.ceph.com/sjust-2024-12-06_21:24:24-crimson-rados-wip-sjust-crimson-testing-2024-12-06-distro-default-smithi/

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions
Copy link

github-actions bot commented Dec 9, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos force-pushed the sjust/wip-crimson-io-3 branch from 1d577fe to 5188d65 Compare December 9, 2024 20:13
It's always possible for there to be an in-progress replica-read
when the replica processes a repop.  It's rare in our tests because
the read and write submitted by the test client would need to
overlap in time.  This makes the results non-deterministic and
thus a somewhat less sensitive test.  Note, the space of valid
results is well defined -- it would have to be state before or
after any of the outstanding writes.  Any other result or a torn
read would be wrong.  It's probably worth updating RadosModel
to add such a pattern, but we can do that later.

This branch makes this race much more likely and even observable
with the existing RadosModel implementation as it extends the
obc lifetime past the point of returning the result to the client
in order to ensure that it outlives the handle.

Fixes: https://tracker.ceph.com/issues/69013
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
We're going to need instance_handle to outlive exiting the
pipeline stage as it will later hold a reference to an
obc holding that stage.

Signed-off-by: Samuel Just <sjust@redhat.com>
… exited

The operation will hold a reference to the obc containing most of the
pipeline stages.

Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos force-pushed the sjust/wip-crimson-io-3 branch from 5188d65 to b931baf Compare December 10, 2024 15:33
Signed-off-by: Samuel Just <sjust@redhat.com>
start() isn't particularly long and splitting it here isn't
all that helpful.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ages

It's important to construct and submit log entries atomically because
the submission order needs to match the entry numbering.  Previously,
this occurred under the pg-wide exclusive process stage.  Shortly, each
obc will have its own process stage, so we need to ensure that atomicity
seperately from the pipeline stages.  Introduce a simple mutex.

Signed-off-by: Samuel Just <sjust@redhat.com>
It's pretty short and this way all of the stage transitions are
in one place.

Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos force-pushed the sjust/wip-crimson-io-3 branch from 0f0c606 to 3092e6c Compare December 11, 2024 02:02
@athanatos
Copy link
Contributor Author

jenkins test make check

@cyx1231st
Copy link
Member

I see ~10-12% throughput increase compared with "C. fine-grained-cache" (SeaStore, single OSD, 32c, RBD) from #60654.

}

/*
* Clears unreferenced elements from the lru set [from, to]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this option in our interface? It might be useful to clear unreferenced without invalidating them (in some future use-case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not keep it without users. We can always readd it if we need it later.

const OpInfo &op_info,
std::vector<OSDOp>& ops);

seastar::shared_mutex submit_lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope: Should we introduce access limitations to any log entries submissions that would require taking submit_lock without delegating the responsibility to the next (any future) user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but it seems like it would probably just make the interface harder to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main hole is that usage in snaptrim_event.cc. That should really get cleaned up in a subsequent refactor, but it can wait for now.

co_await manager.target_state.lock_to(lock_type);
} else {
manager.target_state.lock_excl_sync();
manager.target_state.obc->loading = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we stay with loading, should we mark as false once we finish to load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as written, needs to stay set. I could change the name, but I'd rather leave it as is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to loading_started, added a comment explaining the usage and invariant.

} wait_repop;
};

struct CommonOBCPipeline {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope: we should update dev/crimson/pipeline.rst in the future to reflect some of the new changes.

Copy link
Contributor Author

@athanatos athanatos Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about that, adding a commit updating it.


DEBUGDPP("{}.{}: entering wait_pg_ready stage",
*pgref, *this, this_instance_id);
ihref.enter_stage_sync(client_pp(pg).wait_pg_ready, *this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why enter_stage won't work here? which order are we preserving?
Perhaps we can add a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enter_stage would work, but it's strictly less efficient as the interface allows it to block (though it never will). We're using enter_stage_sync here because the prior stage is PerShardPipeline::create_or_wait_pg (OrderedExclusive) and CommonPGPipeline::WaitPGReady is OrderedConcurrent. Exiting an OrderedExclusive stage never blocks and entering an OrderedConcurrent stage never blocks. enter_stage_sync for that transition, therefore, does not need to allow for blocking and is therefore simpler and more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a comment


if (existed) {
if (manager.head_state.is_empty()) {
ceph_assert(manager.target_state.is_empty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be moved out of if (manager.head_state.is_empty()) because manager.set_state_obc(manager.target_state, manager.head_state.obc) below is called unconditionally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, repushed.

- adds ObjectContext::obc_pipeline
- exposes ObjectContext::obc_pipeline via ObjectContextLoader::Orderer
- allows obcs to be in the registry without being loaded
- adds ObjectContext::loading bool to signal that loading has begun

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ent to use obc stages

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…changes_and_submit

Signed-off-by: Samuel Just <sjust@redhat.com>
…es_n_do_ops_effects

Templating MutFunc was pretty confusing, and flush_changes_n_do_ops_effects
is already closely coupled to PG::submit_transaction.

Signed-off-by: Samuel Just <sjust@redhat.com>
That the log entry's verison matches the object_info on the actual
object is a pretty core invariant.  This commit moves creating the
log entry for head and populating the metadata into
OpsExecuter::prepare_head_update.

As a side effect, flush_clone_metadata and CloningCtx::apply_to
were removed and split between prepare_head_update (portions
related to the head's ssc) and flush_changes_and_submit.

Signed-off-by: Samuel Just <sjust@redhat.com>
We want to emplace and initialize osd_op_params upon first write,
but we don't want to fill at_version, pg_trim_to, pg_committed_to,
or last_complete until prepare_transaction because we don't want to
require a particular commit order any earlier than we have to.

Signed-off-by: Samuel Just <sjust@redhat.com>
…x,complete_cloning_ctx

We need to defer versioning the clone oi and log entry until
commit time while ensuring that the clone operation occurs
first in the transaction.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ease

Signed-off-by: Samuel Just <sjust@redhat.com>
Repops previously used PGPipeline::await_map.  This is actually
important as we need them to be processed in order.  However, using
await_map was confusing and using a single exclusive stage is decidedly
unoptimal as we could allow pipelineing on write commit.  For now, move
them over to their own pipeline stage so we can remove the PGPipeline
struct entirely.  Later, we'll improve replica write pipelining for
better replica-side write concurrency.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
This commit updates pipeline.rst to include some basic information about
how the pipeline stages now work.  I've removed the explicit listing of
the different stages as I'd rather readers refer to the actual
implementation for those details to avoid them getting out of date.

I also removed the comparison to classic as the approach has now diverged
quite a bit and I feel that the ordering part is more important to focus
on than the points at which processing might block.

Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos force-pushed the sjust/wip-crimson-io-3 branch from 5496e89 to dbb129c Compare December 13, 2024 20:33
@athanatos
Copy link
Contributor Author

jenkins test make check

@athanatos
Copy link
Contributor Author

jenkins test make check arm64

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the additional docs/comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Merged (Pre Tentacle Freeze)

Development

Successfully merging this pull request may close these issues.

4 participants