Skip to content

rgw: switch back to boost::asio for spawn() and yield_context#55592

Merged
cbodley merged 12 commits intoceph:mainfrom
cbodley:wip-boost-asio-spawn
May 21, 2024
Merged

rgw: switch back to boost::asio for spawn() and yield_context#55592
cbodley merged 12 commits intoceph:mainfrom
cbodley:wip-boost-asio-spawn

Conversation

@cbodley
Copy link
Contributor

@cbodley cbodley commented Feb 15, 2024

a fork of boost::asio::spawn() was introduced in 2020 with spawn::spawn() from #31580. this fork enabled rgw to customize how the coroutine stacks are allocated in order to avoid stack overflows in frontend request coroutines. this customization was based on a StackAllocator concept from the boost::context library

in boost 1.80, that same StackAllocator overload was added to boost::asio::spawn(), along with other improvements like per-op cancellation. now that boost has everything we need, switch back and drop the spawn submodule

this required switching a lot of async functions from async_completion<> to async_initiate<>. similar changes were necessary to enable the c++20 coroutine token boost::asio::use_awaitable

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@cbodley cbodley force-pushed the wip-boost-asio-spawn branch 3 times, most recently from e1b1277 to 593002f Compare February 15, 2024 14:47
@cbodley cbodley marked this pull request as ready for review February 15, 2024 15:43
@cbodley cbodley requested review from a team as code owners February 15, 2024 15:43
@cbodley cbodley requested a review from adamemerson February 15, 2024 15:43
@cbodley cbodley force-pushed the wip-boost-asio-spawn branch from 593002f to 6ce8b0b Compare February 15, 2024 15:51
Copy link
Contributor

@adamemerson adamemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@cbodley
Copy link
Contributor Author

cbodley commented Feb 21, 2024

i'll wait a while to merge this. it conflicts with things like #55214 and #55051 that we want on squid

@github-actions
Copy link

github-actions bot commented Mar 8, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@@ -122,7 +122,6 @@ namespace {
class Waiter {
using Signature = void(boost::system::error_code);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably make an attempt to get rid of this class entirely, but that's not urgent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, it is only used to yield a coroutine.
so, for starter, we can remove the mutex+cond var code.

@cbodley
Copy link
Contributor Author

cbodley commented Apr 19, 2024

i tried rebasing but saw lots of errors building new d4n stuff. i'll have to work through that too

cbodley added 11 commits May 13, 2024 12:13
Signed-off-by: Casey Bodley <cbodley@redhat.com>
the qat async initiator functions were based on async_completion<> and
its completion_handler member, but the updated boost::asio::yield_context
doesn't provide a completion_handler. switch to the updated
async_initate() method which does work with boost::asio::yield_context

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
a fork of boost::asio::spawn() was introduced in 2020 with spawn::spawn() from ceph#31580. this fork enabled rgw to customize how the coroutine stacks are allocated in order to avoid stack overflows in frontend request coroutines. this customization was based on a StackAllocator concept from the boost::context library

in boost 1.80, that same StackAllocator overload was added to boost::asio::spawn(), along with other improvements like per-op cancellation. now that boost has everything we need, switch back and drop the spawn submodule

this required switching a lot of async functions from async_completion<> to async_initiate<>. similar changes were necessary to enable the c++20 coroutine token boost::asio::use_awaitable

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley cbodley force-pushed the wip-boost-asio-spawn branch from d448be7 to 79a6459 Compare May 13, 2024 16:57
@cbodley
Copy link
Contributor Author

cbodley commented May 13, 2024

rebased over #56979 and #57083

@cbodley cbodley added needs-qa and removed TESTED labels May 13, 2024
@cbodley
Copy link
Contributor Author

cbodley commented May 21, 2024

jenkins test api

@cbodley
Copy link
Contributor Author

cbodley commented May 21, 2024

jenkins test make check

@cbodley cbodley merged commit b7328fe into ceph:main May 21, 2024
@cbodley cbodley deleted the wip-boost-asio-spawn branch May 21, 2024 20:24
@yuvalif
Copy link
Contributor

yuvalif commented May 22, 2024

passed qa in https://pulpito.ceph.com/cbodley-2024-05-14_15:58:02-rgw-wip-boost-asio-spawn-distro-default-smithi/ with rerun https://pulpito.ceph.com/cbodley-2024-05-15_12:34:15-rgw-wip-boost-asio-spawn-distro-default-smithi/

@cbodley the kafka test failure in the above is a regression with this PR (although it is hard to tell as a similar failure existed before).
the issue is reproduced locally consistently (you would need kafka to run onn localhost):

$ MON=1 OSD=1 MDS=0 MGR=0 RGW=1 ../src/vstart.sh -n -d
$ aws --endpoint-url http://localhost:8000 sns create-topic --name=fishtopic --attributes='{"push-endpoint":  kafka://localhost"}'
$ aws --endpoint-url http://localhost:8000 s3 mb s3://fish
$ aws --endpoint-url http://localhost:8000 s3api put-bucket-notification-configuration --bucket fish --notification-configuration='{"TopicConfigurations": [{"Id": "notif1", "TopicArn": "arn:aws:sns:default::fishtopic", "Events": []}]}'
$ head -c 512 </dev/urandom > myfile
$ aws --endpoint-url http://localhost:8000 s3 cp myfile s3://fish

last call will hang.
in case of persistent notification, the call will not hang (as it just pushed the notification to the queue). but the processing of the queue by the notification manager will hang.

Comment on lines +159 to -160
}, token, yield.get_executor());
l.unlock();
init.result.get();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvalif i think the regression is probably due to this change. the unlock() should happen before suspend, so probably needs to be inside the lambda

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was done here: #54697
and was backported to reef and quincy.
so, we probably need to make this chnage there as well.

somewhat related, we may have an race condition in the http client:
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_http_client.cc#L73
we don't use the lock to protect the "done" flag. maybe this is the reason for: https://tracker.ceph.com/issues/66033 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. done is atomic, but it does look like wait() could still race with the finish() on another thread. i wonder why we haven't seen this anywhere else

i pushed a commit cbodley@4eab016 to lock before the done check but test_notification_push_http still fails the same way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be that there is another issue. however, isn't it better to fix that regardless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #57632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants