mon: add NVMe-oF gateway monitor and HA#54671
Conversation
- per ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
- per ceph/ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
4029390 to
696893d
Compare
|
@baum can you please add a high level description for this pull request, if thats okay? |
@rahullepakshi see description. It is not entirely complete but I think that it covers most of the idea. |
- per ceph/ceph#54671 (comment) - update nvmeof gateway revision Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
696893d to
e7a25f2
Compare
Please add this to the tree e.g. in doc/dev or to the gateway docs |
jdurgin
left a comment
There was a problem hiding this comment.
A few mostly minor things, still need to take a closer look at the HA protocol, once doc is available tomorrow
@athanatos, thank you for spotting this out 🖖, added |
|
|
|
By when we can expect this build merged into Ceph release ? |
Hopefully very soon. We hope that in the next week. |
|
jenkins test api |
- gateway submodule Fixes: https://tracker.ceph.com/issues/64777 This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes). The implementation consists of the following main modules: - NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored. - NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations. - MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons. - MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes. It is also adding 3 new mon commands: - nvme-gw create - nvme-gw delete - nvme-gw show The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted. Signed-off-by: Leonid Chernin <lechernin@gmail.com> Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
1. qa/tasks/nvmeof.py:
1.1. create multiple rbd images for all subsystems
1.2. add NvmeofThrasher and ThrashTest
2. qa/tasks/mon_thrash.py: add 'switch_thrashers' option
3. nvmeof_setup_subsystem.sh: create multiple subsystems and enable HA
4. Restructure qa/suites/rbd/nvmeof: Create two sub-suites
- "basic" (nvmeof_initiator job)
- "thrash" (new: nvmeof_mon_thrash and nvmeof_thrash jobs)
Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
|
@baum @caroav this was tested, approved and ready for merge |
| [submodule "src/boost_redis"] | ||
| path = src/boost_redis | ||
| url = https://github.com/boostorg/redis.git |
| path = src/boost_redis | ||
| url = https://github.com/boostorg/redis.git | ||
| [submodule "src/nvmeof/gateway"] | ||
| path = src/nvmeof/gateway |
There was a problem hiding this comment.
this repository has a spdk submodule that duplicates ceph's spdk submodule. maybe we can remove the original one?
There was a problem hiding this comment.
maybe we can remove the original one?
+1. I could be wrong, but I think it was added for some experimental work in Bluestore that never panned out. We obviously need the SPDK fork in the organization, but we shouldn't need the submodule in ceph.git.
| mon->osdmon()->wait_for_writeable_ctx( new CMonRequestProposal(this, addr_vect, expires ));// return false; | ||
| } | ||
| else{ | ||
| mon->nvmegwmon()->request_proposal(mon->osdmon()); |
There was a problem hiding this comment.
seeing compiler warnings here: https://tracker.ceph.com/issues/67320
warning: ‘this’ pointer is null [-Wnonnull]
| BuildRequires: cmake > 3.5 | ||
| BuildRequires: fuse-devel | ||
| BuildRequires: git | ||
| BuildRequires: grpc-devel |
There was a problem hiding this comment.
@baum should it be optional? Because I can set WITH_NVMEOF_GATEWAY_MONITOR_CLIENT to OFF and I don't need to have grpc-devel for rpm. Right?
|
Is this feature production ready ? |
|
Currently, the nvmeof paxos service is disabled by default in compilation. Soon, we it will be enabled by default in compilation. If you want to use it for now, you need to remove this commit, or include the IFDEF in the build. |
|
Is it prod ready ? |
mon: add NVMe-oF gateway monitor and HA
Fixes: 64777
This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).
The implementation consists of the following main modules:
It is also adding 3 new mon commands:
The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.
The design of the HA is documented here