Skip to content

mon: add NVMe-oF gateway monitor and HA#54671

Merged
yuriw merged 5 commits intoceph:mainfrom
baum:ceph-nvmeof-mon
Jul 31, 2024
Merged

mon: add NVMe-oF gateway monitor and HA#54671
yuriw merged 5 commits intoceph:mainfrom
baum:ceph-nvmeof-mon

Conversation

@baum
Copy link
Contributor

@baum baum commented Nov 27, 2023

mon: add NVMe-oF gateway monitor and HA

Fixes: 64777

This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).

The implementation consists of the following main modules:

  • NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored.
  • NVMeofGwMonitorClient -- an intermediary daemon that is intended to run alongside NVMeOF gateway daemon (1:1 mapping -- there would be a separate instance for each gateway daemon). NVMeofGwMonitorClient speaks:
    • RADOS to the monitor (sends NVMeofGw beacons and receives NVMeofGw maps). The beacon signals that the gateway daemon is alive and also includes information about the overall state of the NVMeOF service which is then used by the monitor to make decisions and perform [XXX: some operations -- please describe which operations].
    • gRPC to the gateway daemon (gets NVMeOF subsystems to include that information in the beacon and sets NVMeOF ANA states based on the information in the map to tell a particular gateway daemon to go live or back off). This is the reason a dependency on gRPC is added to Ceph.
  • MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons.
  • MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes.

It is also adding 3 new mon commands:

  • nvme-gw create
  • nvme-gw delete
  • nvme-gw show

The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.

The design of the HA is documented here

@baum baum requested a review from a team as a code owner November 27, 2023 09:03
@ronen-fr ronen-fr changed the title DRAFT: Ceph nvmeof mononitor DRAFT: Ceph nvmeof monitor Nov 27, 2023
@baum baum requested a review from batrick November 27, 2023 10:23
@baum baum requested a review from a team as a code owner December 18, 2023 17:50
baum pushed a commit to baum/ceph that referenced this pull request Dec 18, 2023
- per ceph#54671 (comment)
- update nvmeof gateway revision

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
baum pushed a commit to ceph/ceph-ci that referenced this pull request Dec 18, 2023
- per ceph/ceph#54671 (comment)
- update nvmeof gateway revision

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
@rahullepakshi
Copy link

@baum can you please add a high level description for this pull request, if thats okay?

@caroav
Copy link
Contributor

caroav commented Dec 24, 2023

@baum can you please add a high level description for this pull request, if thats okay?

@rahullepakshi see description. It is not entirely complete but I think that it covers most of the idea.

baum pushed a commit to ceph/ceph-ci that referenced this pull request Feb 4, 2024
- per ceph/ceph#54671 (comment)
- update nvmeof gateway revision

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
@github-actions
Copy link

github-actions bot commented Feb 8, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@baum baum requested a review from a team as a code owner February 11, 2024 17:32
@baum baum changed the title DRAFT: Ceph nvmeof monitor Ceph nvmeof monitor Feb 11, 2024
@jdurgin
Copy link
Member

jdurgin commented Feb 13, 2024

The design of the HA is documented here - https://docs.google.com/document/d/1Om2_snZ55tV6KJaRdpRGaxu7X3E-F-e4VUuQDE_Vicc/edit. If anyone wants to review it, and requires access please add a comment.

Please add this to the tree e.g. in doc/dev or to the gateway docs

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few mostly minor things, still need to take a closer look at the HA protocol, once doc is available tomorrow

@baum
Copy link
Contributor Author

baum commented Jul 12, 2024

@oritwas @baum Am I missing something, or does NVMeofGwMon not trim its maps?

@athanatos, thank you for spotting this out 🖖, added get_trim_to() implementation see diff

@athanatos
Copy link
Contributor

athanatos commented Jul 12, 2024

Why 500 maps? Does the gw ever need anything other than the most recent one? Is the value of keeping more debugging? Ah, the comment clarifies that. Ok, seems reasonable.

@swariri
Copy link

swariri commented Jul 17, 2024

By when we can expect this build merged into Ceph release ?
Looking for HA functionality eagerly.

@caroav
Copy link
Contributor

caroav commented Jul 17, 2024

By when we can expect this build merged into Ceph release ? Looking for HA functionality eagerly.

Hopefully very soon. We hope that in the next week.

@yuriw
Copy link
Contributor

yuriw commented Jul 24, 2024

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mon part LGTM.

@caroav
Copy link
Contributor

caroav commented Jul 26, 2024

jenkins test api

@yuriw
Copy link
Contributor

yuriw commented Jul 27, 2024

@NitzanMordhai
Copy link
Contributor

leonidc and others added 5 commits July 31, 2024 08:50
- gateway submodule

Fixes: https://tracker.ceph.com/issues/64777

This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).

The implementation consists of the following main modules:

- NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored.
- NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations.
- MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons.
- MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes.

It is also adding 3 new mon commands:
- nvme-gw create
- nvme-gw delete
- nvme-gw show

The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.

Signed-off-by: Leonid Chernin <lechernin@gmail.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
1. qa/tasks/nvmeof.py:
    1.1. create multiple rbd images for all subsystems
    1.2. add NvmeofThrasher and ThrashTest
2. qa/tasks/mon_thrash.py: add 'switch_thrashers' option
3. nvmeof_setup_subsystem.sh: create multiple subsystems and enable HA
4. Restructure qa/suites/rbd/nvmeof: Create two sub-suites
   - "basic" (nvmeof_initiator job)
   - "thrash" (new: nvmeof_mon_thrash and nvmeof_thrash jobs)

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
@yuriw
Copy link
Contributor

yuriw commented Jul 31, 2024

@baum @caroav this was tested, approved and ready for merge
ref: https://tracker.ceph.com/issues/67210

Comment on lines +81 to +83
[submodule "src/boost_redis"]
path = src/boost_redis
url = https://github.com/boostorg/redis.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bad rebase added this back after it was removed in #58428. i've opened #58971 to remove it again

path = src/boost_redis
url = https://github.com/boostorg/redis.git
[submodule "src/nvmeof/gateway"]
path = src/nvmeof/gateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this repository has a spdk submodule that duplicates ceph's spdk submodule. maybe we can remove the original one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can remove the original one?

+1. I could be wrong, but I think it was added for some experimental work in Bluestore that never panned out. We obviously need the SPDK fork in the organization, but we shouldn't need the submodule in ceph.git.

mon->osdmon()->wait_for_writeable_ctx( new CMonRequestProposal(this, addr_vect, expires ));// return false;
}
else{
mon->nvmegwmon()->request_proposal(mon->osdmon());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing compiler warnings here: https://tracker.ceph.com/issues/67320

warning: ‘this’ pointer is null [-Wnonnull]

BuildRequires: cmake > 3.5
BuildRequires: fuse-devel
BuildRequires: git
BuildRequires: grpc-devel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baum should it be optional? Because I can set WITH_NVMEOF_GATEWAY_MONITOR_CLIENT to OFF and I don't need to have grpc-devel for rpm. Right?

@swariri
Copy link

swariri commented Aug 27, 2024

Is this feature production ready ?

@caroav
Copy link
Contributor

caroav commented Aug 27, 2024

Currently, the nvmeof paxos service is disabled by default in compilation. Soon, we it will be enabled by default in compilation. If you want to use it for now, you need to remove this commit, or include the IFDEF in the build.

@swariri
Copy link

swariri commented Dec 2, 2024

Is it prod ready ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.