mon: add NVMe-oF gateway monitor and HA by baum · Pull Request #54671 · ceph/ceph

baum · 2023-11-27T09:03:01Z

mon: add NVMe-oF gateway monitor and HA

This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).

The implementation consists of the following main modules:

NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored.
NVMeofGwMonitorClient -- an intermediary daemon that is intended to run alongside NVMeOF gateway daemon (1:1 mapping -- there would be a separate instance for each gateway daemon). NVMeofGwMonitorClient speaks:
- RADOS to the monitor (sends NVMeofGw beacons and receives NVMeofGw maps). The beacon signals that the gateway daemon is alive and also includes information about the overall state of the NVMeOF service which is then used by the monitor to make decisions and perform [XXX: some operations -- please describe which operations].
- gRPC to the gateway daemon (gets NVMeOF subsystems to include that information in the beacon and sets NVMeOF ANA states based on the information in the map to tell a particular gateway daemon to go live or back off). This is the reason a dependency on gRPC is added to Ceph.
MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons.
MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes.

It is also adding 3 new mon commands:

nvme-gw create
nvme-gw delete
nvme-gw show

The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.

The design of the HA is documented here

src/nvmeof/NVMeofGw.cc

src/mon/NVMeofGwMap.cc

src/msg/Message.h

- per ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

- per ceph/ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

rahullepakshi · 2023-12-19T15:06:09Z

@baum can you please add a high level description for this pull request, if thats okay?

caroav · 2023-12-24T14:13:32Z

@baum can you please add a high level description for this pull request, if thats okay?

@rahullepakshi see description. It is not entirely complete but I think that it covers most of the idea.

- per ceph/ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

github-actions · 2024-02-08T22:26:23Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

ceph.spec.in

doc/nvmeof/beacon.puml

doc/nvmeof/gateway-state.puml

src/common/options/mon.yaml.in

jdurgin · 2024-02-13T06:37:37Z

The design of the HA is documented here - https://docs.google.com/document/d/1Om2_snZ55tV6KJaRdpRGaxu7X3E-F-e4VUuQDE_Vicc/edit. If anyone wants to review it, and requires access please add a comment.

Please add this to the tree e.g. in doc/dev or to the gateway docs

jdurgin

A few mostly minor things, still need to take a closer look at the HA protocol, once doc is available tomorrow

.gitmodules

src/ceph_nvmeof_monitor_client.cc

src/nvmeof/NVMeofGwMonitorClient.cc

src/pybind/mgr/cephadm/services/nvmeof.py

src/nvmeof/NVMeofGwMonitorClient.cc

src/common/options/mon.yaml.in

baum · 2024-07-12T11:30:33Z

@oritwas @baum Am I missing something, or does NVMeofGwMon not trim its maps?

@athanatos, thank you for spotting this out 🖖, added get_trim_to() implementation see diff

athanatos · 2024-07-12T15:17:21Z

~~Why 500 maps? Does the gw ever need anything other than the most recent one? Is the value of keeping more debugging?~~ Ah, the comment clarifies that. Ok, seems reasonable.

src/mon/Monitor.cc

src/mon/MonCommands.h

swariri · 2024-07-17T08:49:22Z

By when we can expect this build merged into Ceph release ?
Looking for HA functionality eagerly.

caroav · 2024-07-17T09:43:47Z

By when we can expect this build merged into Ceph release ? Looking for HA functionality eagerly.

Hopefully very soon. We hope that in the next week.

yuriw · 2024-07-24T17:26:17Z

@baum @caroav FYI
ref: https://tracker.ceph.com/issues/66550

rzarzynski

The mon part LGTM.

src/mon/Monitor.cc

caroav · 2024-07-26T07:51:25Z

jenkins test api

yuriw · 2024-07-27T14:25:01Z

@baum see generator didn't yield in https://pulpito.ceph.com/yuriw-2024-07-26_21:23:43-rados-wip-yuri-testing-2024-07-26-0628-distro-default-smithi/
ref: https://tracker.ceph.com/issues/67210

NitzanMordhai · 2024-07-31T05:46:09Z

Rados approved https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues67210

- gateway submodule Fixes: https://tracker.ceph.com/issues/64777 This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes). The implementation consists of the following main modules: - NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored. - NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations. - MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons. - MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes. It is also adding 3 new mon commands: - nvme-gw create - nvme-gw delete - nvme-gw show The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted. Signed-off-by: Leonid Chernin <lechernin@gmail.com> Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

1. qa/tasks/nvmeof.py: 1.1. create multiple rbd images for all subsystems 1.2. add NvmeofThrasher and ThrashTest 2. qa/tasks/mon_thrash.py: add 'switch_thrashers' option 3. nvmeof_setup_subsystem.sh: create multiple subsystems and enable HA 4. Restructure qa/suites/rbd/nvmeof: Create two sub-suites - "basic" (nvmeof_initiator job) - "thrash" (new: nvmeof_mon_thrash and nvmeof_thrash jobs) Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

yuriw · 2024-07-31T15:10:20Z

@baum @caroav this was tested, approved and ready for merge
ref: https://tracker.ceph.com/issues/67210

cbodley · 2024-07-31T20:02:07Z

.gitmodules

+[submodule "src/boost_redis"]
+	path = src/boost_redis
+	url = https://github.com/boostorg/redis.git


a bad rebase added this back after it was removed in #58428. i've opened #58971 to remove it again

cbodley · 2024-07-31T20:04:06Z

.gitmodules

+	path = src/boost_redis
+	url = https://github.com/boostorg/redis.git
+[submodule "src/nvmeof/gateway"]
+	path = src/nvmeof/gateway


this repository has a spdk submodule that duplicates ceph's spdk submodule. maybe we can remove the original one?

maybe we can remove the original one?

+1. I could be wrong, but I think it was added for some experimental work in Bluestore that never panned out. We obviously need the SPDK fork in the organization, but we shouldn't need the submodule in ceph.git.

cbodley · 2024-08-02T14:29:42Z

src/mon/NVMeofGwMap.cc

+              mon->osdmon()->wait_for_writeable_ctx( new CMonRequestProposal(this, addr_vect, expires ));// return false;
+            }
+            else{
+               mon->nvmegwmon()->request_proposal(mon->osdmon());


seeing compiler warnings here: https://tracker.ceph.com/issues/67320

warning: ‘this’ pointer is null [-Wnonnull]

rkhudov · 2024-08-06T14:46:16Z

ceph.spec.in

 BuildRequires:  cmake > 3.5
 BuildRequires:	fuse-devel
 BuildRequires:	git
+BuildRequires:	grpc-devel


@baum should it be optional? Because I can set WITH_NVMEOF_GATEWAY_MONITOR_CLIENT to OFF and I don't need to have grpc-devel for rpm. Right?

swariri · 2024-08-27T10:26:13Z

Is this feature production ready ?

caroav · 2024-08-27T12:03:33Z

Currently, the nvmeof paxos service is disabled by default in compilation. Soon, we it will be enabled by default in compilation. If you want to use it for now, you need to remove this commit, or include the IFDEF in the build.

swariri · 2024-12-02T13:14:16Z

Is it prod ready ?

baum requested a review from a team as a code owner November 27, 2023 09:03

github-actions bot added build/ops common core mon labels Nov 27, 2023

ronen-fr changed the title ~~DRAFT: Ceph nvmeof mononitor~~ DRAFT: Ceph nvmeof monitor Nov 27, 2023

ronen-fr reviewed Nov 27, 2023

View reviewed changes

src/nvmeof/NVMeofGw.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Nov 27, 2023

View reviewed changes

src/mon/NVMeofGwMap.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Nov 27, 2023

View reviewed changes

src/mon/NVMeofGwMap.cc Outdated Show resolved Hide resolved

baum requested a review from batrick November 27, 2023 10:23

batrick requested changes Dec 5, 2023

View reviewed changes

src/msg/Message.h Outdated Show resolved Hide resolved

baum requested a review from a team as a code owner December 18, 2023 17:50

github-actions bot added cephadm pybind labels Dec 18, 2023

baum pushed a commit to baum/ceph that referenced this pull request Dec 18, 2023

Use 0x800 as a bit mask for nvmeofgw messages.

4029390

- per ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

baum pushed a commit to ceph/ceph-ci that referenced this pull request Dec 18, 2023

Use 0x800 as a bit mask for nvmeofgw messages.

696893d

- per ceph/ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

baum force-pushed the ceph-nvmeof-mon branch from 4029390 to 696893d Compare December 18, 2023 19:49

baum pushed a commit to ceph/ceph-ci that referenced this pull request Feb 4, 2024

Use 0x800 as a bit mask for nvmeofgw messages.

12ee0b1

- per ceph/ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

github-actions bot added the needs-rebase label Feb 8, 2024

baum force-pushed the ceph-nvmeof-mon branch from 696893d to e7a25f2 Compare February 11, 2024 17:32

baum requested a review from a team as a code owner February 11, 2024 17:32

github-actions bot added documentation tests labels Feb 11, 2024

baum changed the title ~~DRAFT: Ceph nvmeof monitor~~ Ceph nvmeof monitor Feb 11, 2024

anthonyeleven reviewed Feb 11, 2024

View reviewed changes

ceph.spec.in Outdated Show resolved Hide resolved

ceph.spec.in Outdated Show resolved Hide resolved

doc/nvmeof/beacon.puml Outdated Show resolved Hide resolved

doc/nvmeof/gateway-state.puml Outdated Show resolved Hide resolved

src/common/options/mon.yaml.in Outdated Show resolved Hide resolved

jdurgin requested changes Feb 13, 2024

View reviewed changes

rzarzynski reviewed Jul 12, 2024

View reviewed changes

src/mon/Monitor.cc Show resolved Hide resolved

src/mon/MonCommands.h Show resolved Hide resolved

adk3798 mentioned this pull request Jul 25, 2024

mgr/cephadm: require "group" parameter in nvmeof specs #58860

Merged

14 tasks

rzarzynski approved these changes Jul 25, 2024

View reviewed changes

src/mon/Monitor.cc Show resolved Hide resolved

caroav mentioned this pull request Jul 30, 2024

nvmeof Gateway fails to start up in brand new cluster ceph/ceph-nvmeof#669

Open

leonidc and others added 5 commits July 31, 2024 08:50

mon: add NVMe-oF gateway monitor and HA doc

bb75dde

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

mgr/cephadm: ceph nvmeof monitor support

2946b19

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

nvmeof gw monitor: disable by default

6911df2

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

cbodley reviewed Jul 31, 2024

View reviewed changes

cbodley reviewed Aug 2, 2024

View reviewed changes

rkhudov reviewed Aug 6, 2024

View reviewed changes

cbodley mentioned this pull request Aug 29, 2024

erasure-code: Build the avx512 versions of functions in ISA-L #59494

Closed

14 tasks

nizamial09 mentioned this pull request Nov 4, 2024

squid: mgr/dashboard: rm nvmeof conf based on its daemon name #60604

Merged

baum mentioned this pull request Dec 20, 2024

Backport to squid of ceph-nvmeof-mon #61154

Closed

14 tasks

baum mentioned this pull request Jul 8, 2025

Provide a Go module for interacting with the gateway ceph/ceph-nvmeof#1360

Closed

caroav mentioned this pull request Feb 16, 2026

vSphere 7and 8 not detecting namespace and device to be used as vmfs datastores (resolved ceph version 19.2.3 squid (stable) ) ceph/ceph-nvmeof#1753

Open

Conversation

baum commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mon: add NVMe-oF gateway monitor and HA

The implementation consists of the following main modules:

It is also adding 3 new mon commands:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahullepakshi commented Dec 19, 2023

Uh oh!

caroav commented Dec 24, 2023

Uh oh!

github-actions bot commented Feb 8, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdurgin commented Feb 13, 2024

Uh oh!

jdurgin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baum commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

athanatos commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swariri commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caroav commented Jul 17, 2024

Uh oh!

yuriw commented Jul 24, 2024

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

caroav commented Jul 26, 2024

Uh oh!

yuriw commented Jul 27, 2024

Uh oh!

NitzanMordhai commented Jul 31, 2024

Uh oh!

yuriw commented Jul 31, 2024

Uh oh!

cbodley Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

cbodley Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

idryomov Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

cbodley Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

rkhudov Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

swariri commented Aug 27, 2024

Uh oh!

caroav commented Aug 27, 2024

baum commented Nov 27, 2023 •

edited

Loading

baum commented Jul 12, 2024 •

edited

Loading

athanatos commented Jul 12, 2024 •

edited

Loading

swariri commented Jul 17, 2024 •

edited

Loading