Skip to content

rgw/notifications: handle migration state between v1 and v2#55527

Closed
yuvalif wants to merge 22 commits intoceph:wip-rgw-meta-topicfrom
yuvalif:wip-yuval-handle-migration
Closed

rgw/notifications: handle migration state between v1 and v2#55527
yuvalif wants to merge 22 commits intoceph:wip-rgw-meta-topicfrom
yuvalif:wip-yuval-handle-migration

Conversation

@yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Feb 10, 2024

test instructions:
https://gist.github.com/yuvalif/21449e301732b719cd1ed97c3eeeabb2

  • during migration all topic and notification operations must fail with HTTP error code 500
  • read operations should return the values of the v1 topics and notifications
  • sending notifications should continue based on v1 values

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@yuvalif yuvalif requested a review from a team as a code owner February 10, 2024 21:00
@github-actions github-actions bot added the rgw label Feb 10, 2024
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@yuvalif yuvalif force-pushed the wip-yuval-handle-migration branch from 44c9bf9 to 0bea9ba Compare February 13, 2024 14:53
@cbodley
Copy link
Contributor

cbodley commented Feb 13, 2024

@yuvalif regarding the get_url() commit, can you help me understand how that related to your forwarding issues? i thought the master zone was responding with 503s, which rgw_http_error_to_errno() should be mapping to -EBUSY errors. the retry logic based on get_url() only triggers on -EIO errors

we really don't want to run this retry logic for normal http errors like 503, because set_url_unconnectable() marks that endpoint down for future requests. only connection errors should do that

cc @jzhu116-bloomberg who added this in #53320

Comment on lines +1121 to +1122
int RadosStore::stat_topics_v1(const std::string& tenant, optional_yield y, const DoutPrefixProvider *dpp) {
return rgw_stat_system_obj(dpp, svc()->sysobj, svc()->zone->get_zone_params().log_pool, topics_oid(tenant), nullptr, nullptr, y, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may actually be better to send a read request here instead of a stat so we can initialize the metadata cache. if we stat here and go on to read it later, that requires two separate round trips to the osd. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is only during migration that we expect this to return sucess.
if we are in v2 after migration we expect -ENOENT, and hopefuly just get a negative chache entry.
if we are in v2, we should not call this at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is only during migration that we expect this to return sucess.

yeah, i thought that was reason enough

if we are in v2, we should not call this at all

unrelated to read vs. stat, but v2 will always have to call this to determine whether we're in the migrating state - at least until a later release like U removes the v1 format and migration logic entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorrry, was a typo. i meant in v1 (where we expect to get actual values) we shhoulod not call this at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np. i just wanted to point out the interaction with stat and metadata caching

@yuvalif yuvalif force-pushed the wip-yuval-handle-migration branch 2 times, most recently from 9b000d1 to 714fbb6 Compare February 13, 2024 16:54
@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 13, 2024

@yuvalif regarding the get_url() commit, can you help me understand how that related to your forwarding issues? i thought the master zone was responding with 503s, which rgw_http_error_to_errno() should be mapping to -EBUSY errors. the retry logic based on get_url() only triggers on -EIO errors

we really don't want to run this retry logic for normal http errors like 503, because set_url_unconnectable() marks that endpoint down for future requests. only connection errors should do that

cc @jzhu116-bloomberg who added this in #53320

there could be some other issue, but without that commit i have the following issue.
on the primary RGW:

2024-02-13T19:07:57.792+0000 7fff3edfa700  1 req 15018604344975567796 0.001000000s sns:pubsub_topic_create WARNING: topic migration in process. please try again later
2024-02-13T19:07:57.792+0000 7fff3edfa700  1 req 15018604344975567796 0.001000000s sns:pubsub_topic_create failed to create topic 'fishtopic', ret=-11
2024-02-13T19:07:57.792+0000 7fff3edfa700  2 req 15018604344975567796 0.001000000s sns:pubsub_topic_create completing
2024-02-13T19:07:57.792+0000 7fff3edfa700  2 req 15018604344975567796 0.001000000s sns:pubsub_topic_create op status=-11
2024-02-13T19:07:57.792+0000 7fff3edfa700  2 req 15018604344975567796 0.001000000s sns:pubsub_topic_create http status=500

while on the 2ndary:

2024-02-13T19:07:57.792+0000 7fff135a3700 20 req 9097203288537776269 0.004000000s sns:pubsub_topic_create forward(): failed to forward request. retries=0
2024-02-13T19:07:57.793+0000 7fff135a3700  1 req 9097203288537776269 0.005000000s sns:pubsub_topic_create CreateTopic forward_request_to_master returned ret = -22
2024-02-13T19:07:57.793+0000 7fff135a3700  2 req 9097203288537776269 0.005000000s sns:pubsub_topic_create completing
2024-02-13T19:07:57.793+0000 7fff135a3700  2 req 9097203288537776269 0.005000000s sns:pubsub_topic_create op status=-22
2024-02-13T19:07:57.793+0000 7fff135a3700  2 req 9097203288537776269 0.005000000s sns:pubsub_topic_create http status=400

with this commit the 2ndary also return 500.

@yuvalif yuvalif force-pushed the wip-yuval-handle-migration branch from 714fbb6 to da48af8 Compare February 14, 2024 10:38
@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 14, 2024

@yuvalif regarding the get_url() commit, can you help me understand how that related to your forwarding issues? i thought the master zone was responding with 503s, which rgw_http_error_to_errno() should be mapping to -EBUSY errors. the retry logic based on get_url() only triggers on -EIO errors

we really don't want to run this retry logic for normal http errors like 503, because set_url_unconnectable() marks that endpoint down for future requests. only connection errors should do that

cc @jzhu116-bloomberg who added this in #53320

removed the "get_url()" commit, and added 500 to the case/switch in rgw_http_error_to_errno() - converting to EAGAIN

#pragma once

#include "rgw_common.h"
#include <asm-generic/errno-base.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this really required? why not <errno.h>?

case 409:
return -ENOTEMPTY;
case 500:
return -EAGAIN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why EAGAIN instead of ERR_INTERNAL_ERROR? set_req_state_err() doesn't have a mapping for EAGAIN, so it relies on this:

  dout(0) << "WARNING: set_req_state_err err_no=" << err_no
        << " resorting to 500" << dendl;

  err.http_ret = 500;
  err.err_code = "UnknownError";

in contrast, rgw_http_s3_errors has a more specific mapping for { ERR_INTERNAL_ERROR, {500, "InternalError" }},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to ERR_SERVICE_UNAVAILABLE and 503

kchheda3 and others added 10 commits February 22, 2024 15:04
…n multisite config

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
…ions in multisite config.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
…een bucket and topics.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
…ions in multisite config.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
both locally and in teuthology

Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
v1 could be enabled only in local tests. teuthology tests would run with v2

Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
make the SiteConfig available to all of RGWRados via svc.site instead
of storing it in sal::RadosStore

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
RGWPubSub constructor takes SiteConfig instead of zonegroup map

replace do_all_zonegroups_support_notification_v2() with a generic
function rgw::all_zonegroups_support() that handles non-realm
configurations too

remove unused sal::ZoneGroup::supports_feature()

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
metadata sync ignores entries other than MDLOG_STATUS_COMPLETE, so we
don't need to write separate prepare/complete entries. metadata
mutations can just call complete_entry() on success

Signed-off-by: Casey Bodley <cbodley@redhat.com>
rename read_topics()/write_topics() to 'v1' and only call them from
internal v1 call paths

public get_topics() now calls read_topics_v1() for the v1 case, and does
the paginated listing with driver->meta_list_keys_next() for v2

RGWPSListTopicsOp now uses the NextToken request/response params with
the paginated get_topics(), limiting responses to 100 entries like AWS

'radosgw-admin topic list' also paginates the listing according to
--max-entries to avoid reading everything into memory at once

Signed-off-by: Casey Bodley <cbodley@redhat.com>
the format of topic metadata keys is agnostic to the backend, so the
parsing/formatting functions should be in rgw_pubsub.h

Signed-off-by: Casey Bodley <cbodley@redhat.com>
add a new interface for topic metadata that doesn't depend on metadata
backends. this low-level interface is used by both RadosStore and the
topic metadata handler

remove Driver::delete_bucket_topic_mapping() from sal because the omap
object is deleted internally by rgwrados::topic::remove()

remove the RGWRados::topics_pool_ctx member

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

cbodley and others added 3 commits February 23, 2024 11:16
test instructions:
https://gist.github.com/yuvalif/21449e301732b719cd1ed97c3eeeabb2

* during migration all topic and notification operations must fail with HTTP error code 503
* read operations should return the values of the v1 topics and notifications
* sending notifications should continue based on v1 values

Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
@yuvalif yuvalif force-pushed the wip-yuval-handle-migration branch from 310edbd to 7f92f74 Compare February 26, 2024 13:26
@cbodley cbodley force-pushed the wip-rgw-meta-topic branch from 4620245 to 666e79f Compare March 5, 2024 17:56
@cbodley
Copy link
Contributor

cbodley commented Mar 5, 2024

merged into #55657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants