Skip to content

rgw/async/notifications: use common async waiter in pubsub push#58765

Merged
ivancich merged 1 commit intoceph:mainfrom
yuvalif:wip-yuval-64184
Aug 1, 2024
Merged

rgw/async/notifications: use common async waiter in pubsub push#58765
ivancich merged 1 commit intoceph:mainfrom
yuvalif:wip-yuval-64184

Conversation

@yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Jul 23, 2024

Fixes: https://tracker.ceph.com/issues/64184

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@yuvalif yuvalif requested a review from a team as a code owner July 23, 2024 17:52
@yuvalif yuvalif requested review from adamemerson and cbodley July 23, 2024 17:52
Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool 👍

#include <thread>
#include <boost/asio/io_context.hpp>
#include <boost/asio/spawn.hpp>
#include <boost/context/protected_fixedsize_stack.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused header


void invoke_callback(int expected_reply, std::function<void(int)> cb) {
auto t = std::thread([cb, expected_reply] {
std::this_thread::sleep_for(std::chrono::seconds(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've really been trying to avoid sleeps in unit tests. short sleeps make for flaky tests, and long sleeps make them slow (that adds up in 'make check'). if the test cases are fast, you can run them with something like --gtest_repeat=1000 under valgrind or sanitizers to help catch any races

in this case, you just need to ensure that cb() is called after waiter.async_wait() suspends

boost::asio::defer() is a useful tool for this. it schedules a handler to run after the current one finishes. you can use that to avoid the sleep by spawning the thread in a deferred block:

-        invoke_callback(expected_reply, [&waiter](int r) {waiter.complete(boost::system::error_code{}, r);});
+        // trigger completion after async_wait() suspends
+        boost::asio::defer(yield.get_executor(), [&] {
+            invoke_callback(expected_reply, [&waiter](int r) {waiter.complete(boost::system::error_code{}, r);});
+          });
         reply = waiter.async_wait(yield);

cross_thread_cancel is a recent example that does something similar, although it spawns a separate thread for the execution context, waits on a latch for that thread to block in throttle.wait(), then triggers cancellation from the main thread: ac4c8ad

lack of sleeps there keep the runtime extremely short:

12 tests from co_throttle (3 ms total)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. calling "defer" (with or without the timeout) fixes the tsan errors.
maybe this is the root cause for the problem?
will try to modify that in rgw_pubsub_push.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the notifications code to use defer

Comment on lines -154 to -167
if (y) {
boost::system::error_code ec;
auto yield = y.get_yield_context();
auto&& token = yield[ec];
boost::asio::async_initiate<boost::asio::yield_context, Signature>(
[this, &l] (auto handler, auto ex) {
completion = Completion::create(ex, std::move(handler));
l.unlock(); // unlock before suspend
}, token, yield.get_executor());
return -ec.value();
}
maybe_warn_about_blocking(dpp);

cond.wait(l, [this]{return (done==true);});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing Waiter handles both states of optional_yield. ceph::async::yield_waiter only accepts the yield_context, and calling y.get_yield_context() on a null_yield is undefined behavior

i think this Waiter should use ceph::async::yield_waiter internally, while preserving the mutex/condition_variable for the null_yield case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, these functions are called either from the frontend or from the notifications manager.
both are not using the null_yield. to avoid undefined behavior, i can add an assert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we also send notifications from background threads like lifecycle and multisite? those would pass null_yield

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right (forgot about these)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right (forgot about these)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like i'm missing a non-persistent lifecycle notifications test (lifecycle still passing with persistent notifications).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not possible to test LC+non persistent. see: https://tracker.ceph.com/issues/67174

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, no coverage for multisite sync notifications: https://tracker.ceph.com/issues/67178

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testing, you can disable rgw_beast_enable_async so that frontend requests always see null_yield

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kafka tests are passing locally with the flag set to "false" (vstart.sh -n -d -o "rgw_beast_enable_async=false")

can we set the flag from commandline in a test, so we can cover that in teuthology as well?

@adamemerson adamemerson removed their request for review July 24, 2024 19:30
* use the "yield_waiter" and "waiter" from common/async insteasd of the "waiter"
  implemented inside the bucket notification code (this is so we don't
  need separate investigations for 2 implementations)
* added a unit test that simulate how a separate thread (kafka or amqp) is
resuming a coroutine which is created by either the frontend or the
notification manager.

before using "defer" the unit test is passing, however,
when executed under thread sanitizer (using the WITH_TSAN cmake flag)
the following errors are observed: https://0x0.st/Xp4P.txt
after using "defer" the unit test passes under TSAN without errors.

Fixes: https://tracker.ceph.com/issues/64184

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
});
return w.async_wait(yield);
}
ceph::async::waiter<int> w;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamemerson could you please have a look here?

@yuvalif
Copy link
Contributor Author

yuvalif commented Jul 26, 2024

teuthology is mostly passing: https://pulpito.ceph.com/yuvalif-2024-07-26_08:21:47-rgw:notifications-wip-yuval-64184-distro-default-smithi/
one http test is failing with a known failure (code change is just for kafka and amqp).

@yuvalif yuvalif requested a review from cbodley July 29, 2024 11:33
@yuvalif
Copy link
Contributor Author

yuvalif commented Jul 31, 2024

jenkins test api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants