Skip to content

Stretched cluster#66624

Merged
yuriw merged 4 commits intoceph:mainfrom
leonidc:stretched-cluster
Mar 20, 2026
Merged

Stretched cluster#66624
yuriw merged 4 commits intoceph:mainfrom
leonidc:stretched-cluster

Conversation

@leonidc
Copy link
Contributor

@leonidc leonidc commented Dec 14, 2025

The NVMe-oF GWs are currently unaware of the data center location. When all of the GWs in the group are collocated in the same location, there is no issue. But if GWs (within the same GW group) are located in different locations within the same Ceph cluster, the failover target might be wrong. Choosing the wrong failover target will bring into a situation where the compute is on one site but storage is on another site. This is causing a significant impact to the I/O performance.
Also when disaster cluster recovered failback of ANA groups to it should be done only by monitor command.

Last update
Introduce explicit commands to set and clear a disaster state for a location:
nvme-gw pool group disaster Set
nvme-gw pool group disaster Clear
These commands explicitly inform the monitor of the disaster status of a given location.

remove the command:
nvme-gw pool group start-failback

Expected Behavior
With this mechanism in place:

  1. Controlled failbacks
    Failbacks from other locations are allowed for all gateways except those belonging to a location marked as being in a disaster state.
  2. Stable ANA ownership during disasters
    Gateways in a disaster location, even if partially recovered, will not acquire ANA group ownership during failovers caused by gateway failures in other locations.
  3. Recovery handling
    When a location is marked as recovered using the “disaster Clear” command, the system triggers the same workflow previously initiated by the start-failback command, forcing failbacks to the gateways in the previously affected location.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@caroav
Copy link
Contributor

caroav commented Jan 7, 2026

@athanatos can you review it? thx.

@leonidc leonidc force-pushed the stretched-cluster branch from 662ae2d to 42e81b1 Compare January 7, 2026 08:26
@athanatos
Copy link
Contributor

I will try to get to this this week.

@athanatos
Copy link
Contributor

Not going to get to it today, I'll review it Monday morning.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on some things, but I don't really feel that I understand the intent here. Perhaps you could explain in a bit more detail how the additions in this patch work?


bool NVMeofGwMap::validate_number_locations(int num_gws, int num_locations)
{
return true; // This function may have not empty body in the nex commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no next commit. Is this for a future PR? Perhaps something like // TODO: add validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks


int NVMeofGwMap::cfg_set_location(const NvmeGwId &gw_id,
const NvmeGroupKey& group_key,
std::string &location, bool &propose_pending, bool test) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, thanks

*/
#define MIN_NUM_ANA_GROUPS 0xFFF
uint32_t min_num_ana_groups_in_gw = MIN_NUM_ANA_GROUPS;
NvmeGwId min_gw_id = "empty";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intent here -- is empty not a valid gw id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos thanks, agree , will remove this assignment

for (auto& found_gw_state: gws_states) {
auto st = found_gw_state.second;
if (st.location == location) {
if(st.availability == gw_availability_t::GW_AVAILABLE &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks

* for ana-grp in list make relocation.
* if all ana-grps in location active remove location from the map failbacks_in_progress
*/
FailbackLocation location = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why declare this here with a placeholder value? Seems like you could just declare it when set below at

    location = failbacks_in_progress[group_key];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks

* function tries to found the minimum loaded gw in location
* if found, relocate ana grp to it using regular failback FSM
*/
#define MIN_NUM_ANA_GROUPS 0xFFF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if any particular gw has more than MIN_NUM_ANA_GROUPS you won't migrate to it? Seems like a bad name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos , the problem is name of value ? what do you propose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it seems like a max? MAX_NUM_ANA_GROUPS_FOR_RELOCATE? Also, don't use #defines for things like this, use a constexpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks

bool do_failback = false;
dout(10) << "Find failback GW for GW " << gw_id << dendl;
bool force_inter_location = false;
FailbackLocation location = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, why initialize this with a dummy value? It only seems to be used in the if block immediately below, just declare it when assigned to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

for (auto& found_gw_state: gws_states) {
auto st = found_gw_state.second;
if (st.availability == gw_availability_t::GW_AVAILABLE)
if (location == "" || st.location == location) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right , this is a bad style because "" is also valid location, thought the current code is also working,
will add parameter to the function "ignore_location"

@leonidc leonidc force-pushed the stretched-cluster branch 2 times, most recently from ea9fb24 to b561d27 Compare January 13, 2026 12:44
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@athanatos
Copy link
Contributor

I'm still basically just not sure how these commands are supposed to work. Is there documentation to tell users when and how to actually make use of them?

@leonidc
Copy link
Contributor Author

leonidc commented Jan 25, 2026

@athanatos , I added a description to the PR and motivations for the update.
If it is not sufficient will add more info

@caroav
Copy link
Contributor

caroav commented Feb 8, 2026

@athanatos will you be able to continue the review soon? thx.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any serious problems. I still don't entirely follow the nvme specific logic -- perhaps someone from the nvme team could do a review?

}

bool NVMeofGwMap::is_location_in_disaster(const NvmeGroupKey& group_key,
NvmeLocation& location, bool &cleanup_in_process) {
Copy link
Contributor

@athanatos athanatos Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal isn't to be able to call this multiple times with the same cleanup_in_progress param, it would be a lot better to set it in every exit path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree , will change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos @baum reviewed and approved. Will you be able to approve.

@leonidc leonidc requested a review from baum February 11, 2026 13:51
Copy link
Contributor

@baum baum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

  GW commands added : set location and set admin state enable/disable
  added start-failback <location> command.

  failover logic is impacted by GW location
  implement GW admin commands  enable/disable
  added map  for location-failback-in-progress
  failback between locations happens only by monitor command
  implemented new ana-group relocation process used
  when inter-location failback command sent
  added upgrade rules

fixes: https://tracker.ceph.com/issues/74210

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
          changes in  behavior of failover/failback, additional
          validations added

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
@athanatos athanatos self-requested a review February 23, 2026 17:03
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go through rados teuthology prior to merge.

 remove invalid code from NVMeofGwSerialize.h
 put encode/decode at the file NVmeofGwMap.h

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
@yuriw
Copy link
Contributor

yuriw commented Mar 9, 2026

jenkins test this please

batrick added a commit to batrick/ceph that referenced this pull request Mar 12, 2026
This reverts commit 6dddf54, reversing
changes made to 07ec509.

Backport 6dddf54 introduced a new connection feature bit
NVMEOF_BEACON_DIFF but there are plans (ceph#66624) to make further
enhancements on that feature bit. This would cause the mons to crash
during upgrades.

However, this connection feature bit should not have been added to
begin with. The correct way to do this is extend
e55ad7b by @athanatos to require
`CEPH_MON_FEATURE_INCOMPAT_NVMEOF_BEACON_DIFF` if all mons support it.
This should be done by having mons add/update their supported features
the MonMap via an update from `MMonJoin` (see for instance `crush_loc`
which was recently added to `mon_info_t`). Once the supported features
indicated for each mon in the `MonMap` show they understand the new
NVMEOF_BEACON_DIFF, then it should be turned on globally in the
`MonMap` as a required feature (added to the incompat set).

Conflicts:
	src/mon/NVMeofGwMon.h: conflicts with header change from 19c9be2

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
batrick added a commit to batrick/ceph that referenced this pull request Mar 15, 2026
This reverts commit 6dddf54, reversing
changes made to 07ec509.

Backport 6dddf54 introduced a new connection feature bit
NVMEOF_BEACON_DIFF but there are plans (ceph#66624) to make further
enhancements on that feature bit. This would cause the mons to crash
during upgrades.

However, this connection feature bit should not have been added to
begin with. The correct way to do this is extend
e55ad7b by @athanatos to require
`CEPH_MON_FEATURE_INCOMPAT_NVMEOF_BEACON_DIFF` if all mons support it.
This should be done by having mons add/update their supported features
the MonMap via an update from `MMonJoin` (see for instance `crush_loc`
which was recently added to `mon_info_t`). Once the supported features
indicated for each mon in the `MonMap` show they understand the new
NVMEOF_BEACON_DIFF, then it should be turned on globally in the
`MonMap` as a required feature (added to the incompat set).

Conflicts:
	src/mon/NVMeofGwMon.h: conflicts with header change from 19c9be2
                               fix missing header change in ceph#66584

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@JonBailey1993
Copy link
Contributor

@yuriw yuriw merged commit c47f5b4 into ceph:main Mar 20, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants