rgw/kafka: do not destroy the connection on errors by yuvalif · Pull Request #56033 · ceph/ceph

yuvalif · 2024-03-07T11:57:15Z

fixes: https://tracker.ceph.com/issues/66017

as well as other simplifications:

do not store temporary configuration in the connection object. just use as a local variable
do not create a connection without a producer

other improvements:

copy to a local list before publishing
convert internal error codes to errno

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

github-actions · 2024-03-13T10:47:28Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

yuvalif · 2024-03-14T10:35:33Z

jenkins test make check

cbodley · 2024-04-11T14:22:09Z

i see that #55051 closed. should https://tracker.ceph.com/issues/63915 point to this pr now?

yuvalif · 2024-04-11T14:46:56Z

i see that #55051 closed. should https://tracker.ceph.com/issues/63915 point to this pr now?

pr #55051 was an attempt to add new functionality, allowing the propagation of kafka errors back to the user.
As an initial step i did some refactoring/cleanup, which is the work in this PR.
However, the end goal, of error propagation turned out to be more difficult than expected.
So, I split the cleanup work here, and closed the other PR.

yuvalif · 2024-04-11T16:36:03Z

@cbodley there is one bug fix here, which is causing a crash-on-close. should i open a tracker for it and link to this PR?

cbodley · 2024-04-11T17:17:52Z

@cbodley there is one bug fix here, which is causing a crash-on-close. should i open a tracker for it and link to this PR?

yes, since i assume we need backport to squid at least

kchheda3 · 2024-04-12T17:38:35Z

src/rgw/rgw_kafka.cc

      auto reply_count = 0U;
-      const auto send_count = messages.consume_all(std::bind(&Manager::publish_internal, this, std::placeholders::_1));
+      std::vector<message_wrapper_t*> local_messages;
+      const auto send_count = messages.consume_all([&local_messages](auto message) {local_messages.push_back(message);});


consume_all is a non-blocking function and prior to your changes, it was invoking publish_internal which was run async in diff thread where the conn was used. and while below code had a block where it checks to conn to be idle and if yes it deletes the connection.
so there were chances of race condition where before publish_internal goes and updates the conn->time the idle check was executed and conn was destroyed causing the race condition and crash when publish_internal tries to access the conn
if yes, i just have 1 doubt, consume_all will still run async and local_messages will not be populated immediately and then std::for_each would loop over the local_messages which would not have populated and this will result in loss of messages being delivered?
with current changes, you are consuming all the messages and then calling the publish_internal in std::for_each loop so ensuring the idle check will be executed only after the publish_internal as been completed for all of the messages and then this would avoid the race condition as destroy and publish_internal would never be executed in parallel?

we were running in the same thread (the kafka manager thread) even before the change.
nothing was running in paralel, and consume_all is completly syncronouse, so, i don't think there was a risk for race condition.
the main reason for the change is that before the change, traversing the queue was taking longer (since we called publish_internal as we traversed the queue.
this means that it is more likely for the thread that pushes into the queue to find it full.
however, i did not see any visible performence improvments with the change, so i'm also ok with reverting it.

probably reverting the change would make sense then for consume_all ??

ok. will revert that

src/rgw/rgw_kafka.cc

github-actions · 2024-05-13T14:45:17Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

kchheda3 · 2024-05-13T14:49:47Z

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

yuvalif · 2024-05-13T15:30:05Z

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

kchheda3 · 2024-05-13T15:37:28Z

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

nope, we are able to repro this all the time
if the rd_kafka_produce produces a fatal error it returns -1 and then code destroys the connection & rd_kafka_topic_t, but does not remove it from vector list and then next code uses the deleted topic from the vector as it was not removed and crashes.
to ask rd_kafka_produce produces a fatal error, we are able to do it by just revoking the credential of the kafka user.
was looking at the rd_kafka_produce and there are multiple use case when the rd_kafka_produce can return fatal error (-1), so its not only about the shutdown

yuvalif · 2024-05-14T18:41:45Z

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

nope, we are able to repro this all the time if the rd_kafka_produce produces a fatal error it returns -1 and then code destroys the connection & rd_kafka_topic_t, but does not remove it from vector list and then next code uses the deleted topic from the vector as it was not removed and crashes. to ask rd_kafka_produce produces a fatal error, we are able to do it by just revoking the credential of the kafka user. was looking at the rd_kafka_produce and there are multiple use case when the rd_kafka_produce can return fatal error (-1), so its not only about the shutdown

i never saw this issue. but this just make this fix even more critical.
can you please open a tracker, with the above scenario, so we can backport the fix, at least to squid?

kchheda3 · 2024-05-14T18:45:16Z

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

nope, we are able to repro this all the time if the rd_kafka_produce produces a fatal error it returns -1 and then code destroys the connection & rd_kafka_topic_t, but does not remove it from vector list and then next code uses the deleted topic from the vector as it was not removed and crashes. to ask rd_kafka_produce produces a fatal error, we are able to do it by just revoking the credential of the kafka user. was looking at the rd_kafka_produce and there are multiple use case when the rd_kafka_produce can return fatal error (-1), so its not only about the shutdown

i never saw this issue. but this just make this fix even more critical. can you please open a tracker, with the above scenario, so we can backport the fix, at least to squid?

yeah 100%, we need to backport it to squid for sure
i will open the tracker and i was planning to ping you for this issue only
i had few questions on the maintaining that vector which i already commented

yuvalif · 2024-05-15T04:28:00Z

jenkins test api

src/rgw/rgw_kafka.cc

as well as other simplifications: * do not store temporary configuration in the connection object. just use as a local variable * do not create a connection without a producer other improvements: * copy to a local list before publishing * convert internal error codes to errno Fixes: https://tracker.ceph.com/issues/66017 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

yuvalif · 2024-05-23T11:36:47Z

teuthology passing (kafka tests): https://pulpito.ceph.com/yuvalif-2024-05-23_10:33:57-rgw:notifications-wip-yuval-kafka-cleanup-distro-default-smithi/

yuvalif · 2024-05-23T11:37:16Z

jenkins test make check

yuvalif · 2024-05-23T11:37:38Z

jenkins test docs

yuvalif · 2024-05-23T11:38:09Z

jenkins render docs

yuvalif · 2024-05-23T11:38:52Z

jenkins test windows

kchheda3 · 2024-05-23T19:42:49Z

src/rgw/rgw_kafka.cc

+        rd_kafka_err2str(result) << dendl;
+  } else {
+      ldout(conn->cct, 1) << "Kafka run: nack received with result=" << 
+        rd_kafka_err2str(result) << dendl;


nit, if you could add the broker and topic name here and also for error after calling rd_kafka_produce at line 428
"topic: " << rd_kafka_name(rk) << " broker: " << conn->broker << dendl

will do a round of debug log fixes (in a different PR), and add that there.

use DoutPrefixProvider

unify what log messages contain (broker+topic)

kchheda3

Thanks @yuvalif

yuvalif requested a review from a team as a code owner March 7, 2024 11:57

github-actions bot added the rgw label Mar 7, 2024

yuvalif mentioned this pull request Mar 7, 2024

rgw/kafka: reply with an error when broker is down #55051

Closed

14 tasks

yuvalif added the needs-review label Mar 11, 2024

github-actions bot added the needs-rebase label Mar 13, 2024

yuvalif force-pushed the wip-yuval-kafka-cleanup branch from 8c401a9 to 7dee1d1 Compare March 13, 2024 13:57

github-actions bot removed the needs-rebase label Mar 13, 2024

yuvalif added the cleanup label Mar 13, 2024

kchheda3 reviewed Apr 12, 2024

View reviewed changes