Skip to content

rgw/kafka: do not destroy the connection on errors#56033

Merged
yuvalif merged 1 commit intoceph:mainfrom
yuvalif:wip-yuval-kafka-cleanup
May 25, 2024
Merged

rgw/kafka: do not destroy the connection on errors#56033
yuvalif merged 1 commit intoceph:mainfrom
yuvalif:wip-yuval-kafka-cleanup

Conversation

@yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Mar 7, 2024

fixes: https://tracker.ceph.com/issues/66017

as well as other simplifications:

  • do not store temporary configuration in the connection object. just use as a local variable
  • do not create a connection without a producer

other improvements:

  • copy to a local list before publishing
  • convert internal error codes to errno

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@yuvalif yuvalif requested a review from a team as a code owner March 7, 2024 11:57
@github-actions github-actions bot added the rgw label Mar 7, 2024
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@yuvalif
Copy link
Contributor Author

yuvalif commented Mar 14, 2024

jenkins test make check

@cbodley
Copy link
Contributor

cbodley commented Apr 11, 2024

i see that #55051 closed. should https://tracker.ceph.com/issues/63915 point to this pr now?

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 11, 2024

i see that #55051 closed. should https://tracker.ceph.com/issues/63915 point to this pr now?

pr #55051 was an attempt to add new functionality, allowing the propagation of kafka errors back to the user.
As an initial step i did some refactoring/cleanup, which is the work in this PR.
However, the end goal, of error propagation turned out to be more difficult than expected.
So, I split the cleanup work here, and closed the other PR.

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 11, 2024

@cbodley there is one bug fix here, which is causing a crash-on-close. should i open a tracker for it and link to this PR?

@cbodley
Copy link
Contributor

cbodley commented Apr 11, 2024

@cbodley there is one bug fix here, which is causing a crash-on-close. should i open a tracker for it and link to this PR?

yes, since i assume we need backport to squid at least

auto reply_count = 0U;
const auto send_count = messages.consume_all(std::bind(&Manager::publish_internal, this, std::placeholders::_1));
std::vector<message_wrapper_t*> local_messages;
const auto send_count = messages.consume_all([&local_messages](auto message) {local_messages.push_back(message);});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consume_all is a non-blocking function and prior to your changes, it was invoking publish_internal which was run async in diff thread where the conn was used. and while below code had a block where it checks to conn to be idle and if yes it deletes the connection.
so there were chances of race condition where before publish_internal goes and updates the conn->time the idle check was executed and conn was destroyed causing the race condition and crash when publish_internal tries to access the conn
if yes, i just have 1 doubt, consume_all will still run async and local_messages will not be populated immediately and then std::for_each would loop over the local_messages which would not have populated and this will result in loss of messages being delivered?
with current changes, you are consuming all the messages and then calling the publish_internal in std::for_each loop so ensuring the idle check will be executed only after the publish_internal as been completed for all of the messages and then this would avoid the race condition as destroy and publish_internal would never be executed in parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were running in the same thread (the kafka manager thread) even before the change.
nothing was running in paralel, and consume_all is completly syncronouse, so, i don't think there was a risk for race condition.
the main reason for the change is that before the change, traversing the queue was taking longer (since we called publish_internal as we traversed the queue.
this means that it is more likely for the thread that pushes into the queue to find it full.
however, i did not see any visible performence improvments with the change, so i'm also ok with reverting it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably reverting the change would make sense then for consume_all ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. will revert that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@kchheda3
Copy link
Contributor

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

@yuvalif
Copy link
Contributor Author

yuvalif commented May 13, 2024

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

@kchheda3
Copy link
Contributor

kchheda3 commented May 13, 2024

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

nope, we are able to repro this all the time
if the rd_kafka_produce produces a fatal error it returns -1 and then code destroys the connection & rd_kafka_topic_t, but does not remove it from vector list and then next code uses the deleted topic from the vector as it was not removed and crashes.
to ask rd_kafka_produce produces a fatal error, we are able to do it by just revoking the credential of the kafka user.
was looking at the rd_kafka_produce and there are multiple use case when the rd_kafka_produce can return fatal error (-1), so its not only about the shutdown

@yuvalif
Copy link
Contributor Author

yuvalif commented May 14, 2024

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

nope, we are able to repro this all the time if the rd_kafka_produce produces a fatal error it returns -1 and then code destroys the connection & rd_kafka_topic_t, but does not remove it from vector list and then next code uses the deleted topic from the vector as it was not removed and crashes. to ask rd_kafka_produce produces a fatal error, we are able to do it by just revoking the credential of the kafka user. was looking at the rd_kafka_produce and there are multiple use case when the rd_kafka_produce can return fatal error (-1), so its not only about the shutdown

i never saw this issue. but this just make this fix even more critical.
can you please open a tracker, with the above scenario, so we can backport the fix, at least to squid?

@kchheda3
Copy link
Contributor

@yuvalif do we not have a tracker for this
we are hitting this crash all the time, and its brining down all the rgw daemons

This is crash on shutdown, so unlikely tonbe observed. This is mainly a code cleanup pr

nope, we are able to repro this all the time if the rd_kafka_produce produces a fatal error it returns -1 and then code destroys the connection & rd_kafka_topic_t, but does not remove it from vector list and then next code uses the deleted topic from the vector as it was not removed and crashes. to ask rd_kafka_produce produces a fatal error, we are able to do it by just revoking the credential of the kafka user. was looking at the rd_kafka_produce and there are multiple use case when the rd_kafka_produce can return fatal error (-1), so its not only about the shutdown

i never saw this issue. but this just make this fix even more critical. can you please open a tracker, with the above scenario, so we can backport the fix, at least to squid?

yeah 100%, we need to backport it to squid for sure
i will open the tracker and i was planning to ping you for this issue only
i had few questions on the maintaining that vector which i already commented

@yuvalif yuvalif force-pushed the wip-yuval-kafka-cleanup branch from 7dee1d1 to 3a4749e Compare May 14, 2024 19:12
@yuvalif
Copy link
Contributor Author

yuvalif commented May 15, 2024

jenkins test api

@yuvalif yuvalif force-pushed the wip-yuval-kafka-cleanup branch from 3a4749e to a1edb0d Compare May 22, 2024 10:26
as well as other simplifications:
* do not store temporary configuration in the connection object. just
  use as a local variable
* do not create a connection without a producer

other improvements:
* copy to a local list before publishing
* convert internal error codes to errno

Fixes: https://tracker.ceph.com/issues/66017

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
@yuvalif yuvalif force-pushed the wip-yuval-kafka-cleanup branch from a1edb0d to 1b6e850 Compare May 22, 2024 18:29
@yuvalif yuvalif requested a review from kchheda3 May 22, 2024 18:31
@yuvalif
Copy link
Contributor Author

yuvalif commented May 23, 2024

@yuvalif
Copy link
Contributor Author

yuvalif commented May 23, 2024

jenkins test make check

@yuvalif
Copy link
Contributor Author

yuvalif commented May 23, 2024

jenkins test docs

@yuvalif
Copy link
Contributor Author

yuvalif commented May 23, 2024

jenkins render docs

@yuvalif
Copy link
Contributor Author

yuvalif commented May 23, 2024

jenkins test windows

@yuvalif yuvalif added TESTED and removed needs-qa labels May 23, 2024
rd_kafka_err2str(result) << dendl;
} else {
ldout(conn->cct, 1) << "Kafka run: nack received with result=" <<
rd_kafka_err2str(result) << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if you could add the broker and topic name here and also for error after calling rd_kafka_produce at line 428
"topic: " << rd_kafka_name(rk) << " broker: " << conn->broker << dendl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do a round of debug log fixes (in a different PR), and add that there.

  • use DoutPrefixProvider
  • unify what log messages contain (broker+topic)

Copy link
Contributor

@kchheda3 kchheda3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yuvalif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants