Skip to content

osd,crimson/osd: rework of replica read and related state#56677

Merged
athanatos merged 26 commits intoceph:mainfrom
athanatos:sjust/for-review/wip-replica-read
Nov 4, 2024
Merged

osd,crimson/osd: rework of replica read and related state#56677
athanatos merged 26 commits intoceph:mainfrom
athanatos:sjust/for-review/wip-replica-read

Conversation

@athanatos
Copy link
Contributor

This PR does a few related things:

  • rename last_update_ondisk to pg_committed_to and clarify semantics
  • update replica (ec and replicated) to maintain pg_committed_to rather than min_last_complete_ondisk
  • populate primary->replica (replicated and ec) messages to propagate pg_committed_to rather than min_last_complete_ondisk

The above result in addressing https://tracker.ceph.com/issues/65086 and https://tracker.ceph.com/issues/65085.

The commits are structured to break the above into steps. See commit messages and included code comments for further details.

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@athanatos
Copy link
Contributor Author

athanatos commented Apr 4, 2024

@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch 3 times, most recently from b8e338f to 1e3d12b Compare April 9, 2024 20:56
@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch from 1e3d12b to d50fb42 Compare April 10, 2024 23:26
@github-actions
Copy link

github-actions bot commented Jun 9, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch from d50fb42 to 1035752 Compare June 12, 2024 02:41
@athanatos
Copy link
Contributor Author

New commits add mechanism for allowing replicas to do replica reads at last_update after a delay.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch from 1035752 to 20c4e77 Compare June 17, 2024 18:19
@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch from 20c4e77 to 7cb3c48 Compare June 17, 2024 20:09
@rzarzynski rzarzynski requested review from Matan-B and rzarzynski July 8, 2024 18:40
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch from 7cb3c48 to 072d57e Compare August 7, 2024 02:19
Signed-off-by: Samuel Just <sjust@redhat.com>
…_updated

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
The name last_update_ondisk is misleading as it suggests a local
property like last_update_applied rather than a pg-global
property.  Clarify the name and add a much more specific comment.

Signed-off-by: Samuel Just <sjust@redhat.com>
Avoid maintaining pg_committed_to if pg is not active.  We can't
guarantee that last_update won't become divergent, so it doesn't
provide useful information.

Signed-off-by: Samuel Just <sjust@redhat.com>
The purpose of this rename is merely to clarify that the necessary
condition on ec roll-forward is that the pg has committed up to that
point.  Along with subsequent commits, this will clarify that both ec
and replicated pools propagate pg_committed_to for related if not
identical reasons.  Because EC::submit_transaction already did

  op->roll_forward_to = std::max(min_last_complete_ondisk, rmw_pipeline.committed_to);

there's no difference in behavior as rmw_pipeline.committed_to is
updated immediately after the notification to the PG that the write
completed.

Signed-off-by: Samuel Just <sjust@redhat.com>
…last_complete_ondisk

This commit updates the bulk of the interface pathways in crimson and
classic to refer to pg_committed_to rather than min_last_complete_ondisk
and changes the replica side to maintain pg_committed_to instead.

This commit shouldn't actually cause any behavior change -- we're still
passing min_last_complete_ondisk (which is a valid lower bound for
pg_committed_to!).

Signed-off-by: Samuel Just <sjust@redhat.com>
…nsaction

This commit actually changes the bound we're propagating.  This solves
two bugs:
- Using min_last_complete_ondisk caused replicas to be two update rounds
  behind rather than one
- Replicas don't actually have enough information to set
  min_last_complete_ondisk on activation, so we couldn't serve replica
  reads until the first write.  pg_committed_to, on the other hand, is
  fine as the activation last_update cannot become divergent.

Moreover, last_complete won't advance past missing objects causing
min_last_complete_ondisk to be blocked by any replica missing object.
Note that the replica read pathway seperately checks whether the target
is missing locally, so that property was not needed.

Fixes: https://tracker.ceph.com/issues/65086
Fixes: https://tracker.ceph.com/issues/65085
Signed-off-by: Samuel Just <sjust@redhat.com>
…Missing and related

Signed-off-by: Samuel Just <sjust@redhat.com>
…ries

This matches the behavior for normal IOs.

Signed-off-by: Samuel Just <sjust@redhat.com>
It wouldn't actually be wrong for the primary to trim the log right up
to the pg_committed_to bound it is propagating (though it generally
won't).

Signed-off-by: Samuel Just <sjust@redhat.com>
See comment for details. Modifies ECBackend::submit_transaction to
use the passed pg_committed_to unconditionally, adds a comment to
explain, and adds a comment to RMWPipeline::pg_committed_to to
clarify that it may lag PeeringState::pg_committed_to.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…notify_t

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…remove

We don't support balanced reads on ec pools.  Additionally, the yaml
actually specifies 'balanced_reads' rather than 'balance_reads' and
therefore has no actual effect.

Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos force-pushed the sjust/for-review/wip-replica-read branch from e4515b9 to dda683b Compare October 22, 2024 04:16
@athanatos
Copy link
Contributor Author

jenkins test api

@athanatos
Copy link
Contributor Author

Previous run uncovered a bug in intrusive_timer with initialization order, fixed in new version.

New test run: https://pulpito.ceph.com/sjust-2024-10-22_15:57:45-rados-wip-sjust-testing-2024-10-21-distro-default-smithi/
Some env failures, did a rerun: https://pulpito.ceph.com/sjust-2024-10-25_19:15:47-rados-wip-sjust-testing-2024-10-21-distro-default-smithi/

Remaining failures appear to be a subset of main: https://pulpito.ceph.com/teuthology-2024-10-27_20:00:17-rados-main-distro-default-smithi/

I think this is ready to merge. I'll merge later this week if there are no objections or outstanding reviews to wait for. @Matan-B @rzarzynski @cyx1231st

@athanatos
Copy link
Contributor Author

jenkins test api

@athanatos athanatos merged commit 048ce81 into ceph:main Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants