Skip to content

objecter: use configurable to set local and balanced reads#56180

Merged
yuriw merged 11 commits intoceph:mainfrom
yehudasa:wip-objecter-local-read
Jan 21, 2025
Merged

objecter: use configurable to set local and balanced reads#56180
yuriw merged 11 commits intoceph:mainfrom
yehudasa:wip-objecter-local-read

Conversation

@yehudasa
Copy link
Member

@yehudasa yehudasa commented Mar 13, 2024

This sets the relevant flags when a read op is created. The configurable allows for

  • 'localize' : localized reads
  • 'balance': balanced reads
  • 'default': the default behavior (read from primary)

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is short and sweet. I've been leaning towards preferring per-component switches to allow more granular control of things, but the simplicity and universality here makes me think again.

bool keep_balanced_budget = false;
bool honor_pool_full = true;

std::atomic<int> extra_read_flags{0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eww do we really need to make this atomic and stick a memory barrier in the middle of all those constructor calls?
All we really need here is for the memory update to not be torn and I don't think that can happen for these sizes?

Copy link
Contributor

@athanatos athanatos Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the atomic, read/write races are undefined behavior otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to avoid the atomic here, we might consider removing the support for runtime changes. i think we care more about performance here than the convenience around config changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need sequentially consistent memory ordering (default for a C++ atomic), release-acquire ordering should be sufficient - https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering .

On X86 with its strong memory model this results in no memory barriers in the generated code, just constraints on the compiler not to re-order code. On other architectures (e.g. ARM and POWER) release-acquire ordering adds synchronization instructions, but these are still have less overhead than the instructions added for sequential consistent memory order.

So using code like this when reading the flags would be more efficient: extra_read_flags.load(std::memory_order_acquire)

Given that the balanced and localized flags are just performance optimizations, and as such it doesn't really matter if a race hazard between threads mean a few I/Os use an out of date copy of flags, then a weaker memory ordering of memory_order_relaxed could be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a weaker memory ordering seems reasonable to me -- my main concern was simply that we use defined C++ memory ordering semantics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by new commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@yehudasa yehudasa changed the title [DNM] objecter: use configurable to set local and balanced reads objecter: use configurable to set local and balanced reads Mar 25, 2024
bool keep_balanced_budget = false;
bool honor_pool_full = true;

std::atomic<int> extra_read_flags{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a global_ops_flags which is also atomic and appears only to be use by the filer Client::set_filer_flags and Client::get_filer_flags to set the same LOCALIZE_READS flag.

Why do we need two separate sets of additional flags, why can't this code just set the global_ops_flags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_ops_flags is used for all ops, extra_read_flags is used only for read ops

bool keep_balanced_budget = false;
bool honor_pool_full = true;

std::atomic<int> extra_read_flags{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by new commit

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this approach very much. is there a natural place in the rados suite to the add test coverage necessary to validate it?

@yehudasa yehudasa requested review from a team as code owners May 9, 2024 16:23
@yehudasa yehudasa requested a review from a team as a code owner May 9, 2024 19:36
@github-actions github-actions bot added the core label May 9, 2024
@yehudasa yehudasa force-pushed the wip-objecter-local-read branch 2 times, most recently from 5e5ed24 to 369544b Compare May 9, 2024 20:25
@yehudasa
Copy link
Member Author

yehudasa commented May 9, 2024

@cbodley see

i like this approach very much. is there a natural place in the rados suite to the add test coverage necessary to validate it?

@cbodley see 369544b (untested yet)

@cbodley
Copy link
Contributor

cbodley commented May 9, 2024

@cbodley see 369544b (untested yet)

i see that the rados/verify and rados/basic subsuites both run all of the cls tests. one of those would probably be a good place for that read-affinity subdirectory. it should also include an empty default.yaml so the default configuration is still part of the matrix. i know the rados suite is already gigantic, so we might consider using read-affinity$ so each job picks one at random instead of tripling the number of jobs

cc @rzarzynski

osd:
crush location: 'root=default host=host-$id'
client:
rados replica read policy: balance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idryomov does the crush location required for balance read similar to local?

Do this feature supported in ec pool ? The name suggests it is replica feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looping @rzarzynski as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idryomov does the crush location required for balance read similar to local?

No, specifying just rados replica read policy: balance should be enough.

Do this feature supported in ec pool ? The name suggests it is replica feature

It's not supported on EC pools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @idryomov

bool keep_balanced_budget = false;
bool honor_pool_full = true;

std::atomic<int> extra_read_flags{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

bool keep_balanced_budget = false;
bool honor_pool_full = true;

std::atomic<int> extra_read_flags{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there a particular reason for using signed type for the extra_read_flags?

mutate() takes flags as int, I see.

bufferlist *pbl,
int flags)
int flags,
int flags_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mixed indentation.

@mkogan1 mkogan1 self-requested a review June 24, 2024 10:42
Copy link
Contributor

@mkogan1 mkogan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graphs are showing IO operations on each of the replica 3 OSDs while performing S3 GET operations of 1M 4KB objects.
On the left ./bin/radosgw ... --rados_replica_read_policy=default and on the right ./bin/radosgw ... --rados_replica_read_policy=balance where please note the IO operations are distributed almost evenly between the drives.
image

@cbodley
Copy link
Contributor

cbodley commented Jun 26, 2024

i see that the rados/verify and rados/basic subsuites both run all of the cls tests. one of those would probably be a good place for that read-affinity subdirectory. it should also include an empty default.yaml so the default configuration is still part of the matrix. i know the rados suite is already gigantic, so we might consider using read-affinity$ so each job picks one at random instead of tripling the number of jobs

cc @rzarzynski

@rzarzynski do you have any feedback on this part? i see that you've approved - do you think this is ready for qa and merge?

@mattbenjamin
Copy link
Contributor

@mkogan1 thanks for testing balanced reads! are there any other workloads you have setups to test? @vumrao will be testing local reads at scale

@yehudasa yehudasa force-pushed the wip-objecter-local-read branch from 369544b to 2bda7d4 Compare July 2, 2024 23:48
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Oct 13, 2024
@cbodley cbodley removed the stale label Oct 14, 2024
@gregsfortytwo
Copy link
Member

Hi folks, what needs to happen for this to merge? I thought it was in already and I think we've pulled it in to our product so I'd like to see it get in. :o

@cbodley
Copy link
Contributor

cbodley commented Dec 4, 2024

i see that the rados/verify and rados/basic subsuites both run all of the cls tests. one of those would probably be a good place for that read-affinity subdirectory. it should also include an empty default.yaml so the default configuration is still part of the matrix. i know the rados suite is already gigantic, so we might consider using read-affinity$ so each job picks one at random instead of tripling the number of jobs
cc @rzarzynski

@rzarzynski do you have any feedback on this part? i see that you've approved - do you think this is ready for qa and merge?

ping. combinatorics aside, my main concern is that the addition of read-affinity/{balance.yaml, local.yaml} is removing test coverage of the default configuration

@rzarzynski
Copy link
Contributor

rzarzynski commented Dec 4, 2024

ping. combinatorics aside, my main concern is that the addition of read-affinity/{balance.yaml, local.yaml} is removing test coverage of the default configuration

I agree. Even if this was a saving measure on the size of suite, it would be better to test for default and e.g. localize.

Let's add the missing yaml, rebase and merge. Thanks for finding this, @cbodley!

jiffin and others added 9 commits December 12, 2024 17:01
Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
for localized and balanced reads.

Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
Many of the exec calls were using a read version of the API while they
were calling a write objclass method. This modifies the calls to use
the write API.

Fixes: https://tracker.ceph.com/issues/65889

Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
These calls are read calls but can and have been used as write calls,
which doesn't work with local (and balanced) affinity.

Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
Signed-off-by: Yehuda Sadeh <ysadehwe@ibm.com>
@cbodley cbodley force-pushed the wip-objecter-local-read branch from fe19a13 to 24087ec Compare December 12, 2024 22:14
@cbodley
Copy link
Contributor

cbodley commented Dec 12, 2024

i pushed a rebase:

src/cls/rgw/cls_rgw_client.cc conflicts came from #58448 which had already converted cls_rgw_lc_put_head(), cls_rgw_lc_rm_entry(), cls_rgw_lc_set_entry() to Object*Operation for use with async_operate()

src/common/options/rgw.yaml.in had a temporary conflicts between the addition/removal of rgw_replica_read_policy and the new rgw_bucket_logging_obj_roll_time

with read-affinity/ at the top level of the rados suite, teuthology
would try to run it as a subsuite instead of mixing its yamls into an
existing subsuite. move into rados/verify instead

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley
Copy link
Contributor

cbodley commented Dec 12, 2024

I agree. Even if this was a saving measure on the size of suite, it would be better to test for default and e.g. localize.

Let's add the missing yaml, rebase and merge. Thanks for finding this, @cbodley!

thanks @rzarzynski. i added the empty default.yaml and an additional commit:

qa/rados: move read-affinity/ under rados/verify subsuite

with read-affinity/ at the top level of the rados suite, teuthology
would try to run it as a subsuite instead of mixing its yamls into an
existing subsuite. move into rados/verify instead

let me know if you'd prefer it somewhere other than rados/verify, or if you want that symlinked into multiple rados subsuites

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack from the RBD perspective (at least one more instance of exec() misuse has sneaked in into RBD in the interim, but we can take care of it separately)

@yuriw yuriw merged commit 13883f5 into ceph:main Jan 21, 2025
@idryomov
Copy link
Contributor

at least one more instance of exec() misuse has sneaked in into RBD in the interim, but we can take care of it separately

#61471

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.