rgw/async/notifications: use common async waiter in pubsub push by yuvalif · Pull Request #58765 · ceph/ceph

yuvalif · 2024-07-23T17:52:37Z

Fixes: https://tracker.ceph.com/issues/64184

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

cbodley

cool 👍

cbodley · 2024-07-23T18:41:38Z

src/test/common/test_async_yield_waiter.cc

+#include <thread>
 #include <boost/asio/io_context.hpp>
 #include <boost/asio/spawn.hpp>
+#include <boost/context/protected_fixedsize_stack.hpp>


nit: unused header

cbodley · 2024-07-23T18:41:43Z

src/test/common/test_async_yield_waiter.cc


+void invoke_callback(int expected_reply, std::function<void(int)> cb) {
+  auto t = std::thread([cb, expected_reply] {
+      std::this_thread::sleep_for(std::chrono::seconds(1));


i've really been trying to avoid sleeps in unit tests. short sleeps make for flaky tests, and long sleeps make them slow (that adds up in 'make check'). if the test cases are fast, you can run them with something like --gtest_repeat=1000 under valgrind or sanitizers to help catch any races

in this case, you just need to ensure that cb() is called after waiter.async_wait() suspends

boost::asio::defer() is a useful tool for this. it schedules a handler to run after the current one finishes. you can use that to avoid the sleep by spawning the thread in a deferred block:

- invoke_callback(expected_reply, [&waiter](int r) {waiter.complete(boost::system::error_code{}, r);}); + // trigger completion after async_wait() suspends + boost::asio::defer(yield.get_executor(), [&] { + invoke_callback(expected_reply, [&waiter](int r) {waiter.complete(boost::system::error_code{}, r);}); + }); reply = waiter.async_wait(yield);

cross_thread_cancel is a recent example that does something similar, although it spawns a separate thread for the execution context, waits on a latch for that thread to block in throttle.wait(), then triggers cancellation from the main thread: ac4c8ad

lack of sleeps there keep the runtime extremely short:

12 tests from co_throttle (3 ms total)

hmm. calling "defer" (with or without the timeout) fixes the tsan errors.
maybe this is the root cause for the problem?
will try to modify that in rgw_pubsub_push.cc

changed the notifications code to use defer

cbodley · 2024-07-24T16:37:29Z

src/rgw/driver/rados/rgw_pubsub_push.cc

-    if (y) {
-      boost::system::error_code ec;
-      auto yield = y.get_yield_context();
-      auto&& token = yield[ec];
-      boost::asio::async_initiate<boost::asio::yield_context, Signature>(
-          [this, &l] (auto handler, auto ex) {
-            completion = Completion::create(ex, std::move(handler));
-            l.unlock(); // unlock before suspend
-          }, token, yield.get_executor());
-      return -ec.value();
-    }
-    maybe_warn_about_blocking(dpp);
-
-    cond.wait(l, [this]{return (done==true);});


the existing Waiter handles both states of optional_yield. ceph::async::yield_waiter only accepts the yield_context, and calling y.get_yield_context() on a null_yield is undefined behavior

i think this Waiter should use ceph::async::yield_waiter internally, while preserving the mutex/condition_variable for the null_yield case

currently, these functions are called either from the frontend or from the notifications manager.
both are not using the null_yield. to avoid undefined behavior, i can add an assert

don't we also send notifications from background threads like lifecycle and multisite? those would pass null_yield

yes, you are right (forgot about these)

looks like i'm missing a non-persistent lifecycle notifications test (lifecycle still passing with persistent notifications).

not possible to test LC+non persistent. see: https://tracker.ceph.com/issues/67174

also, no coverage for multisite sync notifications: https://tracker.ceph.com/issues/67178

for testing, you can disable rgw_beast_enable_async so that frontend requests always see null_yield

kafka tests are passing locally with the flag set to "false" (vstart.sh -n -d -o "rgw_beast_enable_async=false")

can we set the flag from commandline in a test, so we can cover that in teuthology as well?

* use the "yield_waiter" and "waiter" from common/async insteasd of the "waiter" implemented inside the bucket notification code (this is so we don't need separate investigations for 2 implementations) * added a unit test that simulate how a separate thread (kafka or amqp) is resuming a coroutine which is created by either the frontend or the notification manager. before using "defer" the unit test is passing, however, when executed under thread sanitizer (using the WITH_TSAN cmake flag) the following errors are observed: https://0x0.st/Xp4P.txt after using "defer" the unit test passes under TSAN without errors. Fixes: https://tracker.ceph.com/issues/64184 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

yuvalif · 2024-07-25T16:28:50Z

src/rgw/driver/rados/rgw_pubsub_push.cc

+        });
+        return w.async_wait(yield);
+      }
+      ceph::async::waiter<int> w;


@adamemerson could you please have a look here?

yuvalif · 2024-07-26T11:55:08Z

teuthology is mostly passing: https://pulpito.ceph.com/yuvalif-2024-07-26_08:21:47-rgw:notifications-wip-yuval-64184-distro-default-smithi/
one http test is failing with a known failure (code change is just for kafka and amqp).

yuvalif · 2024-07-31T05:12:10Z

jenkins test api

yuvalif requested a review from a team as a code owner July 23, 2024 17:52

github-actions bot added rgw tests labels Jul 23, 2024

yuvalif requested review from adamemerson and cbodley July 23, 2024 17:52

cbodley reviewed Jul 23, 2024

View reviewed changes

yuvalif force-pushed the wip-yuval-64184 branch from 98e404b to 58edf62 Compare July 24, 2024 16:06

cbodley reviewed Jul 24, 2024

View reviewed changes

adamemerson removed their request for review July 24, 2024 19:30

yuvalif force-pushed the wip-yuval-64184 branch from 58edf62 to 2872c75 Compare July 25, 2024 16:24

yuvalif commented Jul 25, 2024

View reviewed changes

yuvalif requested a review from cbodley July 29, 2024 11:33

cbodley approved these changes Aug 1, 2024

View reviewed changes

ivancich merged commit 202258f into ceph:main Aug 1, 2024

yuvalif mentioned this pull request Mar 17, 2025

squid: rgw/async/notifications: use common async waiter in pubsub push #62337

Merged

Conversation

yuvalif commented Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

cbodley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuvalif commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuvalif commented Jul 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuvalif commented Jul 23, 2024 •

edited

Loading

yuvalif commented Jul 26, 2024 •

edited

Loading