Skip to content

rbd-nbd: add ability to restore nbd devices#40809

Closed
pkalever wants to merge 6 commits intoceph:masterfrom
pkalever:rbd-nbd
Closed

rbd-nbd: add ability to restore nbd devices#40809
pkalever wants to merge 6 commits intoceph:masterfrom
pkalever:rbd-nbd

Conversation

@pkalever
Copy link

Problem:

Currently, with the rbd-nbd tool, we can map the nbd devices and scale
them as needed. But in case if the node is rebooted, there is no way to restore
the nbd device mappings and one has to do remap of the big list of devices
manually. This is a serious problem in container environments, if the container
hosting the rbd-nbd processes is restarted, all the mapping is just gone and
The application will start seeing the IO errors.

Solution:

With these changes, on every successful Map operation, we save the
details of the nbd device mappings to a persistent file (here on called as
saveconfig file) in the JSON format which will be used for restore. Similarly,
on Unmap we clean the JSON object related to the device.

Thanks to the attach command, the restore command will leverage it and
brings the rbd-nbd processes to life.

The default path of the JSON file is /etc/ceph/saveconfig.json

Example usage:

$ rbd-nbd list-mapped                                                          
id      pool           namespace  image    snap  device                        
617318  rbd-pool                  image1   -     /dev/nbd0                     
617356  rbd-pool                  image2   -     /dev/nbd1                     
617394  rbd-pool                  image3   -     /dev/nbd2                     
617433  rbd-pool                  image4   -     /dev/nbd3                     
617472  rbd-pool                  image5   -     /dev/nbd4                     
                                                                               
$ pkill -9 rbd-nbd                                                             

Restore all from default location:

$ rbd-nbd restore                                                              
=> Attempting to attach: /dev/nbd0 -> rbd-pool/image1                          
SUCCESS                                                                        
=> Attempting to attach: /dev/nbd1 -> rbd-pool/image2                          
SUCCESS                                                                        
=> Attempting to attach: /dev/nbd2 -> rbd-pool/image3                          
SUCCESS                                                                        
=> Attempting to attach: /dev/nbd3 -> rbd-pool/image4                          
SUCCESS                                                                        
=> Attempting to attach: /dev/nbd4 -> rbd-pool/image5                          
SUCCESS                                                                        

Restore partial list from default location:

$ rbd-nbd restore --attach-list="rbd-pool/image1,/dev/nbd4"
=> Attempting to attach: /dev/nbd0 -> rbd-pool/image1      
SUCCESS                                                    
=> Attempting to attach: /dev/nbd4 -> rbd-pool/image5      
SUCCESS                                                    

Restore partial list from default location with id, keyfile and mon_host:

$ rbd-nbd --id="admin" --keyfile=/tmp/keyfile -m "10.70.1.100:6789" restore --attach-list="rbd-pool/image2,/dev/nbd3"
=> Attempting to attach: /dev/nbd1 -> rbd-pool/image2
SUCCESS
=> Attempting to attach: /dev/nbd3 -> rbd-pool/image4
SUCCESS

Restore from configurable saveconfig file location:

$ ./bin/rbd-nbd map --saveconfig-file /tmp/rbd-nbd-cfg.json
$ ./bin/rbd-nbd restore --saveconfig-file /tmp/rbd-nbd-cfg.json
$ ./bin/rbd-nbd unmap

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Prasanna Kumar Kalever added 4 commits April 12, 2021 15:06
define defaults, will be used in later set of patches.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Problem:
-------
Currently, with the rbd-nbd tool, we can map the nbd devices and scale
them as needed. But in case if the node is rebooted, there is no way to restore
the nbd device mappings and one has to do remap of the big list of devices
manually. This is a serious problem in container environments, if the container
hosting the rbd-nbd processes is restarted, all the mapping is just gone and
The application will start seeing the IO errors.

Solution:
--------
With these changes, on every successful Map operation, we save the
details of the nbd device mappings to a persistent file (here on called as
saveconfig file) in the JSON format which will be used for restore. Similarly,
on Unmap we clean the JSON object related to the device.

Thanks to the attach command, the restore command will leverage it and
brings the rbd-nbd processes to life.

The default path of the JSON file is /etc/ceph/saveconfig.json

Example usage:
-------------
$ rbd-nbd list-mapped
id      pool           namespace  image    snap  device
617318  rbd-pool                  image1   -     /dev/nbd0
617356  rbd-pool                  image2   -     /dev/nbd1
617394  rbd-pool                  image3   -     /dev/nbd2
617433  rbd-pool                  image4   -     /dev/nbd3
617472  rbd-pool                  image5   -     /dev/nbd4

$ pkill -9 rbd-nbd

$ rbd-nbd restore
=> Attempting to attach: /dev/nbd0 -> rbd-pool/image1
SUCCESS
=> Attempting to attach: /dev/nbd1 -> rbd-pool/image2
SUCCESS
=> Attempting to attach: /dev/nbd2 -> rbd-pool/image3
SUCCESS
=> Attempting to attach: /dev/nbd3 -> rbd-pool/image4
SUCCESS
=> Attempting to attach: /dev/nbd4 -> rbd-pool/image5
SUCCESS

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
These changes add the ability to restore a single or a partial list of
rbd-nbd devices that need to be restored.

Example usage:
-------------
$ rbd-nbd restore --attach-list="rbd-pool/image1,/dev/nbd4"
=> Attempting to attach: /dev/nbd0 -> rbd-pool/image1
SUCCESS
=> Attempting to attach: /dev/nbd4 -> rbd-pool/image5
SUCCESS

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
restore should also honor a few common options like keyfile, keyring,
id/user, mon_host and pass the same to all the attach commands that
restore will invoke.

Example usage:
-------------
$ rbd-nbd --id="admin" --keyfile=/tmp/keyfile -m "10.70.1.100:6789" restore --attach-list="rbd-pool/image2,/dev/nbd3"
=> Attempting to attach: /dev/nbd1 -> rbd-pool/image2
SUCCESS
=> Attempting to attach: /dev/nbd3 -> rbd-pool/image4
SUCCESS

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
@pkalever
Copy link
Author

cc: @lxbsz @nixpanic @Madhu-1 @oritwas

@tchaikov tchaikov requested a review from trociny April 12, 2021 11:04
Prasanna Kumar Kalever added 2 commits April 12, 2021 19:57
Environments like containers running rbd-nbd in the pods, need to tune
this saveconfig path to the persistent host-path (bind mounted) locations.
Not just that, for example, CSI drivers need a saveconfig file per rbd-nbd
device.

To satisfy the above requirements, changes here introduce --saveconfig-file
option with the map/restore/unmap commands, which allows tuning the default
location of the saveconfig file.

Please note its the duty of the user to provide the same path of the
saveconfig file at the time of unmap if the rbd-nbd process is attached
manually instead of invoking it via the restore command, especially if
the --saveconfig-file option is not used with attach command.

Example usage:
-------------
$ ./bin/rbd-nbd map --saveconfig-file /tmp/rbd-nbd-cfg.json
$ ./bin/rbd-nbd restore --saveconfig-file /tmp/rbd-nbd-cfg.json
$ ./bin/rbd-nbd unmap

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This change updates the rbd-nbd man page about the new restore command
and its options.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
@trociny
Copy link
Contributor

trociny commented Apr 12, 2021

@pkalever Was it discussed earlier in Ceph community before you started working on this?

Because, although the feature looks useful to me and you did a great job here, I hate to say, I am not sure it is the right approach.

It seems like it could be done on a higher layer, keeping the rbd-nbd simple. Moreover it could be more generic, so it could also handle krbd and FreeFSD's rbd-ggate.

We tried to make these drivers provide similar interface so you could use rbd device map|unmap|list -t <device-type> and it would work in similar way for kernel, nbd, and ggate types.

The functionality provided here resembles rbdmap [1] we have for krbd. And we actually have a feature request to make it generic and work not only for rbdmap [2, 3], and @lxbsz seemed to start working on it, although I have no idea if there have been any progress so far. I don't say we have to stick to rbdmap (I have never used personally) but it would be great if we came with something more generic. Probably you could just move your code from rbd-nbd to rbd?

Note, it is just my humble opinion though. I would love to see what other people think about it. That is why I asked if there were some discussion about this.

[1] https://docs.ceph.com/en/latest/man/8/rbdmap/
[2] https://tracker.ceph.com/issues/20762
[3] https://tracker.ceph.com/issues/48257

@trociny
Copy link
Contributor

trociny commented Apr 12, 2021

@idryomov Do you have any opinion about this?

@lxbsz
Copy link
Member

lxbsz commented Apr 13, 2021

@pkalever Was it discussed earlier in Ceph community before you started working on this?

Because, although the feature looks useful to me and you did a great job here, I hate to say, I am not sure it is the right approach.

It seems like it could be done on a higher layer, keeping the rbd-nbd simple. Moreover it could be more generic, so it could also handle krbd and FreeFSD's rbd-ggate.

We tried to make these drivers provide similar interface so you could use rbd device map|unmap|list -t <device-type> and it would work in similar way for kernel, nbd, and ggate types.

The functionality provided here resembles rbdmap [1] we have for krbd. And we actually have a feature request to make it generic and work not only for rbdmap [2, 3], and @lxbsz seemed to start working on it, although I have no idea if there have been any progress so far. I don't say we have to stick to rbdmap (I have never used personally) but it would be great if we came with something more generic. Probably you could just move your code from rbd-nbd to rbd?

This look good to me, I haven't start that yet and not recently, because I am not sure whether could this work for the containers in rbd-nbd.

While for now, this approach maybe a very good start for this feature IMHO, and let's make it more generic after this if possible later.

BRs

@trociny
Copy link
Contributor

trociny commented Apr 13, 2021

While for now, this approach maybe a very good start for this feature IMHO, and let's make it more generic after this if possible later.

When added to rbd-nbd it will be difficult to remove later as it may already be used so we will have to maintain both.

My actual proposal it to move the current implementation from rbd-nbd to rbd (tools/rbd/action/Device.cc). So, instead of calling rbd-nbd restore you could call rbd device restore --device-type ndb. And if it is not trivial to add support for other device types you can just can make commands return EOPNOTSUPP for now, just make it work for rbd-nbd. @pkalever do you see problems with this?

@idryomov
Copy link
Contributor

Honestly, I'm confused by the description of the problem:

But in case if the node is rebooted, there is no way to restore the nbd device mappings and one has to do remap of the big list of devices manually. This is a serious problem in container environments, if the container hosting the rbd-nbd processes is restarted, all the mapping is just gone and The application will start seeing the IO errors.

If the node is rebooted, there won't be any applications there. When the node is back, a static list of mappings in rbdmap-like format should be enough. Extending rbdmap script to handle nbd is trivial, no rbd or rbd-nbd changes needed.

If the container hosting rbd-nbd processes is restarted, but the node is not rebooted, that's an entirely different use case. Which one is being targeted here?

@pkalever
Copy link
Author

Hello @trociny,

It was a holiday for me yesterday, sorry for my delayed response.

Reply on comment 1:

@pkalever Was it discussed earlier in Ceph community before you started working on this?

This is my first contribution to ceph, I'm unaware of the pattern, so excuse me if this feature appears as a surprise.
And, thanks for the info, I will make sure to bring it up in the community meetings if there are any major changes from me in the future.

Because, although the feature looks useful to me and you did a great job here, I hate to say, I am not sure it is the right approach.

It seems like it could be done on a higher layer, keeping the rbd-nbd simple. Moreover, it could be more generic, so it could also handle krbd and FreeFSD's rbd-ggate.

We tried to make these drivers provide similar interface so you could use rbd device map|unmap|list -t <device-type> and it would work in similar way for kernel, nbd, and ggate types.

This certainly sounds cool, and it's very much understandable that one simple interface is what we expect.

The functionality provided here resembles rbdmap [1] we have for krbd. And we actually have a feature request to make it generic and work not only for rbdmap [2, 3], and @lxbsz seemed to start working on it, although I have no idea if there have been any progress so far. I don't say we have to stick to rbdmap (I have never used personally) but it would be great if we came with something more generic. Probably you could just move your code from rbd-nbd to rbd?

I haven't heard of rbdmap tool before, on first glance I see it's a script, and IMHO that will not be enough in the long run.
Expect the response to the second part of the question below.

Note, it is just my humble opinion though. I would love to see what other people think about it. That is why I asked if there were some discussion about this.

Reply on comment 2:

When added to rbd-nbd it will be difficult to remove later as it may already be used so we will have to maintain both.

My actual proposal it to move the current implementation from rbd-nbd to rbd (tools/rbd/action/Device.cc). So, instead of calling rbd-nbd restore you could call rbd device restore --device-type ndb. And if it is not trivial to add support for other device types you can just can make commands return EOPNOTSUPP for now, just make it work for rbd-nbd. @pkalever do you see problems with this?

Thanks for welcoming the changes.

Also, thanks for the hint, I looked at tools/rbd/action/Device.cc, and here is what I understand at my first glance:
Device.cc is like top layer generic CLI parser, which tries to accept the CLI commands from various backend device types (Kernel, Nbd, Ggate, ... etc.), but the specific device level additional options (which are uncommon) are still parsed at the respective src/tools/rbd/action/Nbd.cc, src/tools/rbd/action/Ggate.cc, etc. Also noticed that the generic rbd CLI interface is forking a subprocess, calling the specific backend device level CLI process (at least with nbd, gate) (hence will have to go through another round of parsing in the respective individual CLI's)

Coming back to the changes w.r.t this PR, which is the restore of mapping on restart of node (and pod in container env), I think we are dealing with saving various options per device which is subject to a given device type. If that is the case, I think we should leave the logic of all the details that the specific device type wants to save per device in the saveconfig.json to that device type-specific code. And for simplicity, integrate the new CLI level restore option with the generic rbd CLI interface.

That said, at some point in time we should deprecate the individual CLI's (rbd-nbd, rbd-ggate, etc) and for that, we need to start preparing our users to adapt to the upcoming integrated CLI changes, this way we will stop duplicating the CLI parsing logic in the individual CLI's and generic CLI.

One example, not to point out or hurt anyone, rather just to give an example, the recent attach command to nbd was a great value-added feature, and I see its integration is missing at the generic CLI interface. This is the problem that we will end up with by maintaining both generic CLI and individual CLIs.

@trociny, great comments!! I'm happy to get corrected and/or hear your further thoughts.

@pkalever
Copy link
Author

Hello @idryomov

Honestly, I'm confused by the description of the problem:

But in case if the node is rebooted, there is no way to restore the nbd device mappings and one has to do remap of the big list of devices manually. This is a serious problem in container environments, if the container hosting the rbd-nbd processes is restarted, all the mapping is just gone and The application will start seeing the IO errors.

If the node is rebooted, there won't be any applications there. When the node is back, a static list of mappings in rbdmap-like format should be enough. Extending rbdmap script to handle nbd is trivial, no rbd or rbd-nbd changes needed.

If the container hosting rbd-nbd processes is restarted, but the node is not rebooted, that's an entirely different use case. Which one is being targeted here?

To be honest, when I started working on this, with this approach I was trying to address both cases you have listed above [although my primary goal was to fix the containers one].

But, thanks for highlighting this, I just realized that I forget to add a force option upon which the rbd-nbd restore should attempt to do a fresh map (ignoring the recoded device number) instead of attach. I will fix this part with the next respin. Also as you pointed it confusing, I shall try to re-frame the problem statement with the next respin.

Cheers!

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you respin, I think we need to discuss whether adding this functionality to rbd-nbd is a good idea. So far, I don't see any reason this couldn't be done externally by (an equivalent of) a simple shell script. Since you said that the container restart case was the primary goal, let's focus on it. The reboot case should be taken care of by #40844. Some questions to kick off the discussion:

  • What is achieved by storing the mapping details at map time? Could you just grab rbd device list output before restarting the container? If not, what is missing from rbd device list output?

  • What is the use case for partial restoring (i.e. filtering by device node and image spec)?

  • What is achieved by having rbd-nbd do the restore (i.e. fork itself for each device, etc)? Could whoever grabbed rbd device list output just iterate through it and call rbd-nbd attach for each entry?

@pkalever
Copy link
Author

Before you respin, I think we need to discuss whether adding this functionality to rbd-nbd is a good idea. So far, I don't see any reason this couldn't be done externally by (an equivalent of) a simple shell script. Since you said that the container restart case was the primary goal, let's focus on it. The reboot case should be taken care of by #40844. Some questions to kick off the discussion:

Sure I will wait until I see an ack on the approach from you or @trociny or @lxbsz

  • What is achieved by storing the mapping details at map time? Could you just grab rbd device list output before restarting the container? If not, what is missing from rbd device list output?

A more meaningful understanding can be achieved if we visualize the problem from container setups. Let me take you through it:

  • In container environments, the rbd-nbd processes will be run inside the container (not on node), so if the container hosting the rbd-nbd processes is restarted, then all the rbd-nbd processes are terminated. So post the container restart the rbd-nbd list will result in empty!
  • As we cannot always predict when the container is restarted, it's not possible to capture the list output prior in time.
  • Also the user may pass various options to rbd-nbd at map time, not all will be listed with list
  • What is the use case for partial restoring (i.e. filtering by device node and image spec)?
  • In simple terms, the restore/map request call in Ceph-CSI is per device and is not like one-shot restore per node.
    In detail:
    With Ceph-CSI, upon restart of the node plugin pod (pod which runs the rbd-nbd and also is responsible to attach/map and mount the device), ideally, the sidecar which talk to Kubernetes gets the list of VolumeAttachments (mounts/devices) that need to be reattached per node. For each RBD device, a new NodeStageVolume call will be sent to the Ceph-CSI container.
    The main Ceph-CSI container on the node will need to handle the NodeStageVolume call by calling rbd-nbd restore with respective image-spec and all other various options per device.
    It's also important to note this is how Kubernetes expect its CSI calls to get handled, nothing to do with ceph-CSI in specific.

  • Also we cannot restore every device because, we never know, in the meanwhile some application pods (nbd mounts) might have got migrated to a new node, which means we don't have to restore/reattach them.

  • What is achieved by having rbd-nbd do the restore (i.e. fork itself for each device, etc)? Could whoever grabbed rbd device list output just iterate through it and call rbd-nbd attach for each entry?

Right, if we want someone to read the saveconfig.json and execute the attach commands, then that will become a new tool again. And we have to take the additional responsibility for maintaining compatibility between the new tool and rbd-nbd every time we make some changes to the rbd-nbd options.

Thanks!

@pkalever pkalever requested a review from idryomov April 14, 2021 14:19
@wjwithagen
Copy link
Contributor

I'm just a simple soul that likes KISS.
And not understanding the full details for the containerization I feel that the lower level rbd-mounters should not not be bothered with the administrative tools.

This sound horribly to a sort of fstab for rbd images, and when the Kubernites process maintains this file, I think does exactly what is meant for. Where I would use the rbd tool like mount to process the rbdmap file.

@idryomov
Copy link
Contributor

idryomov commented Apr 14, 2021

As we cannot always predict when the container is restarted, it's not possible to capture the list output prior in time.

Is that really the case? I might be missing some k8s / CSI spec detail here, but I would argue that if the container gets killed (as opposed to being gracefully restarted for an upgrade), we shouldn't be restoring anything. I mean, rbd-plugin pod is special, it can't be treated like a throwaway container that can just be killed at any time... And on graceful restart, it should be possible to capture rbd device list output or similar.

Also the user may pass various options to rbd-nbd at map time, not all will be listed with list

This can be addressed fairly easily.

Also we cannot restore every device because, we never know, in the meanwhile some application pods (nbd mounts) might have got migrated to a new node, which means we don't have to restore/reattach them.

I see, thanks for the explanation.

Right, if we want someone to read the saveconfig.json and execute the attach commands, then that will become a new tool again. And we have to take the additional responsibility for maintaining compatibility between the new tool and rbd-nbd every time we make some changes to the rbd-nbd options.

I don't see anything wrong with a new tool. This tool would be using exactly one command, rbd-nbd attach, which should be stable. The equivalent rbd map and related commands have remained largely the same and backwards compatible for about ten years.

Also, I'm not saying that this tool (a script, really) must be maintained outside of Ceph. It can very well be part of Ceph, just not embedded in rbd or rbd-nbd.

@idryomov
Copy link
Contributor

This sound horribly to a sort of fstab for rbd images, and when the Kubernites process maintains this file, I think does exactly what is meant for. Where I would use the rbd tool like mount to process the rbdmap file.

I'm having trouble parsing this paragraph ;)

@wjwithagen
Copy link
Contributor

This sound horribly to a sort of fstab for rbd images, and when the Kubernites process maintains this file, I think does exactly what is meant for. Where I would use the rbd tool like mount to process the rbdmap file.

I'm having trouble parsing this paragraph ;)

Sorry about that....

The file that is maintained is actually a receipt to connect to all rbd images, just as fstab tells you where to mount the storage devices. Mount is the all-purpose entry used to execute the rules in /etc/fstab. And that handles all kinds of device and per device different options. YNow you can use mount_nfs <params> but normally you'd call it like mount -t nfs <params>,

And what we are talking about is just the same, but now its about RBD images and mounting to a device that it was previously mounted to. Could be a file structure similar to fstab.

So why not use rbd as the frontend for all this: rbd -f /etc/rbdmap mount.

@pkalever
Copy link
Author

Also the user may pass various options to rbd-nbd at map time, not all will be listed with list

This can be addressed fairly easily.

Just out of my curiosity,
Do you mean to capture various map options per device within the output of rbd-nbd list?

like:

  --device <device path>        Specify nbd device path (/dev/nbd{num})
  --encryption-format           Image encryption format
                                (possible values: luks1, luks2)
  --encryption-passphrase-file  Path of file containing passphrase for unlocking image encryption
  --exclusive                   Forbid writes by other clients
  --io-timeout <sec>            Set nbd IO timeout
  --max_part <limit>            Override for module param max_part
  --nbds_max <limit>            Override for module param nbds_max
  --quiesce                     Use quiesce callbacks
  --quiesce-hook <path>         Specify quiesce hook path
                                (default: libexec/rbd-nbd/rbd-nbd_quiesce)
  --read-only                   Map read-only
  --reattach-timeout <sec>      Set nbd re-attach timeout
                                (default: 30)
  --try-netlink                 Use the nbd netlink interface
---
  --conf/-c FILE    read configuration from the given configuration file
  --id/-i ID        set ID portion of my name
  --name/-n TYPE.ID set name
  --cluster NAME    set cluster name (default: ceph)
  --setuser USER    set uid to user or uid (and gid to user's gid)
  --setgroup GROUP  set gid to group or gid
  --version         show version and quit

  -d                run in foreground, log to stderr
  -f                run in foreground, log to usual location

Also we cannot restore every device because, we never know, in the meanwhile some application pods (nbd mounts) might have got migrated to a new node, which means we don't have to restore/reattach them.

I see, thanks for the explanation.

Right, if we want someone to read the saveconfig.json and execute the attach commands, then that will become a new tool again. And we have to take the additional responsibility for maintaining compatibility between the new tool and rbd-nbd every time we make some changes to the rbd-nbd options.

I don't see anything wrong with a new tool. This tool would be using exactly one command, rbd-nbd attach, which should be stable. The equivalent rbd map and related commands have remained largely the same and backwards compatible for about ten years.

I mean to say, if we have to introduce a new tool to read the saveconfig.json (data that we are storing at the time of map) captured by rbd-nbd map, won't it be additional maintenance. Rather, I don't see anything wrong in rbd-nbd or rbd reading the file that was created by itself.

Why should this go into a script when we have a lot of control with the rbd/rbd-nbd itself?

Also, I'm not saying that this tool (a script, really) must be maintained outside of Ceph. It can very well be part of Ceph, just not part of rbd or rbd-nbd.

If we are stressing on a more generic way for organizing this mapping, the go-to way is:

  • top level rbd interface (May be Device.CC) implement restore command
  • src/tools/rbd/action/Nbd.cc, src/tools/rbd/action/Ggate.cc, src/tools/rbd/action/Kernel.cc ...etc. -> let the individual device types decide the logic on what to store/retreive from the saveconfig.json per device.
  • And on rbd device restore -t <type> Let the rbd itself parse through the various JSON objects and format the device level commands and fork (very like what is done for rbd device map -t <type> now)

Thanks!

@trociny
Copy link
Contributor

trociny commented Apr 15, 2021

  • src/tools/rbd/action/Nbd.cc, src/tools/rbd/action/Ggate.cc, src/tools/rbd/action/Kernel.cc ...etc. -> let the individual device types decide the logic on what to store/retreive from the saveconfig.json per device.

I don't quite understand why you need this special handling per device type? Why can't you just save everything that was passed as arguments when map was called and succeeded and then restore everything that was saved? I guess there is some complication but I would like to know, probably it could be solved some other way.

@trociny
Copy link
Contributor

trociny commented Apr 15, 2021

This is my first contribution to ceph, I'm unaware of the pattern, so excuse me if this feature appears as a surprise.
And, thanks for the info, I will make sure to bring it up in the community meetings if there are any major changes from me in the future.

You are completely free to do it in a way you want. It just will be sad if after the discussion it will turn out you will have to do many changes or even chose another approach. Discussing this in advance allows to avoid this.

@pkalever
Copy link
Author

BTW at some point I also want to open a discussion about daemonizing rbd-nbd, maybe I will open a new tracker for this later, but I had a plan about this. This should help us save resources, like no of processes used, performance benefit from reusing the shared buffers, reusing the RADOS connection objects etc.. did anyone thought on these lines before?

It has come up a couple of times, but nothing concrete. I don't think it would change anything with respect to the fact that attach in its current form is fundamentally unsafe for general use -- the same concerns would apply to a single daemon working on multiple images. Absent kernel changes, the only way to make attach safe seems to be to exploit and build on the nature of ceph-csi: tightly controlled environment, uniquely named images, etc.

Sorry for clubbing it here. Yes, demonizing is a different topic, I asked about it, to get an overall idea w.r.t the plans.

I spend some time to patch nbd kernel module to avoid the dangerous operation that we discussed before, here is the link [1]

For anyone who is interested in testing kernel nbd changes [1], needs userspace rbd-nbd changes [2]

Assuming the kernel patch will get in, and given the dangerous operation is avoided now, can I get an ack on this for moving this code to rbd integrated cli?

As far as I can see @trociny and @lxbsz is in agreement with moving the code to Device.cc, I'm counting on @idryomov

[1] https://lkml.org/lkml/2021/4/29/274
[2] 0001-rbd-nbd-defend-on-attach-for-image-and-device-mismat.txt

Thanks!

@github-actions
Copy link

github-actions bot commented May 3, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@idryomov
Copy link
Contributor

idryomov commented May 3, 2021

For anyone who is interested in testing kernel nbd changes [1], needs userspace rbd-nbd changes [2]

The kernel patch seems fine, at least at a high level. However @trociny mentioned that we might want to have a way to override the safety check:

And still we might want to make it optional, because there may be cases when reattaching with different image spec is useful, e.g. for live migration.

I don't have a strong opinion here, just bumping so it doesn't get lost.

Assuming the kernel patch will get in, and given the dangerous operation is avoided now, can I get an ack on this for moving this code to rbd integrated cli?

The attached rbd-nbd patch is not sufficient for promoting to rbd CLI because the interface would need to be restricted to new kernels -- older unsafe kernels must be refused. Also, I don't think an image spec alone is reliable because one can trash the image and create a new one with the same name in less than a second. I think we need some sort of a cookie -- a u64 or even a UUID for identifying the mapping.

@idryomov
Copy link
Contributor

idryomov commented May 3, 2021

From an ack perspective, I see three separate pieces:

a) making rbd-nbd attach safe. Definitely useful!

b) promoting rbd-nbd attach and rbd-nbd detach actions to rbd CLI. Here, since krbd doesn't have a concept of attach/detach, I would like to understand whether ggate can make use of them and if so, how similar the semantics are going to be. Things like the safety, the difference between detach and unmap (for nbd, rbd-nbd detach is a literal SIGTERM), etc. If the semantics are close enough, let's do it. Otherwise, if it's going to be just nbd (and just on new kernels) it probably shouldn't be promoted.

c) the whole saveconfig functionality. Here, I'll repeat my previous comment: it seems really specific to the combination of CSI and nbd. It hasn't come up before and no concrete use case outside of ceph-csi has been presented. So far I don't see a value in implementing it in rbd.

@trociny
Copy link
Contributor

trociny commented May 3, 2021

@pkalever The kernel patch looks very useful to me. Thanks!

The necessary changes to rbd-nbd to support the new kernel feature are welcome! And if it were some sort of a cookie instead of the image spec for identification I would withdraw my requirement to make it optional. I would need it optional if the image spec were used as an identifier, because an image migration (e.g. to some other pool) is one of the cases where detach/attach may be useful (to make it "live" for the client), and the image spec is changed in this case.

As for "saveconfig" functionality, initially I said that although I did not like it much I would prefer it to be in rbd instead of rbd-nbd if we really needed it. But after our discussion here I like the idea less and less, because it really looks like the thing that should be implemented in ceph-csi.

@wjwithagen
Copy link
Contributor

a) making rbd-nbd attach safe. Definitely useful!

b) promoting rbd-nbd attach and rbd-nbd detach actions to rbd CLI. Here, since krbd doesn't have a concept of attach/detach, I would like to understand whether ggate can make use of them and if so, how similar the semantics are going to be. Things like the safety, the difference between detach and unmap (for nbd, rbd-nbd detach is a literal SIGTERM), etc. If the semantics are close enough, let's do it. Otherwise, if it's going to be just nbd (and just on new kernels) it probably shouldn't be promoted.

From a FreeBSD/rbd-ggate perspective:
I too am really trying to understand the difference between unmap and detach , but not being very familiar with the Linux side I guess that unmap destroys the device, and detach keeps the device? And when being detached, access to the device is stalled until something is attached again?

As said rbd-nbd detach just signals SIGTERM. With rbd-ggate that will destroy the device in /dev/ leaving the entry up for grabs. On the otherhand, when /dev/ggate0 is used by a filesystem mount then trying to bluntly reconnect gives:

# bin/rbd-ggate --device /dev/ggate0 map testrbdggate47177/test
rbd::ggate: failed to create ggate device: (17) File exists

Trying to access the mount-point after rbd-ggate has crashed gives hard to kill programs that are suspended on that mount.
On the otherhand, if the underlying rbd-image gets blocked due to not enough backing OSDs, work continues once there are enough OSDs up to safely continue.

Now if I look at the options with geom_gate and other tools, they particularly specify (from man ggatel):

rescue       Take over a previously created provider and handle pending
                  and future requests.  This is useful if the initial ggatel
                  process died.  To prevent data loss, the given path must
                  lead to the regular file or device that was used to create
                  the provider.

So it is possible to pickup a previously connection to a created device.

And thus I think it should be possible to offer similar semantics in rbd-ggate.
But it is currently not implemented.

@pkalever
Copy link
Author

pkalever commented May 5, 2021

The attached rbd-nbd patch is not sufficient for promoting to rbd CLI because the interface would need to be restricted to new kernels -- older unsafe kernels must be refused. Also, I don't think an image spec alone is reliable because one can trash the image and create a new one with the same name in less than a second. I think we need some sort of a cookie -- a u64 or even a UUID for identifying the mapping.

I agree with this and yes cookie sounds good. If we need to generate a UUID per mapping then it must be taken care of by the rbd-nbd itself. Can you add some ideas on how you think this should be implemented and maintained in rbd-nbd? I'm worried and do not want to forward with assumptions.

From an ack perspective, I see three separate pieces:

a) making rbd-nbd attach safe. Definitely useful!

Yeah, thanks!

b) promoting rbd-nbd attach and rbd-nbd detach actions to rbd CLI. Here, since krbd doesn't have a concept of attach/detach, I would like to understand whether ggate can make use of them and if so, how similar the semantics are going to be. Things like the safety, the difference between detach and unmap (for nbd, rbd-nbd detach is a literal SIGTERM), etc. If the semantics are close enough, let's do it. Otherwise, if it's going to be just nbd (and just on new kernels) it probably shouldn't be promoted.

I have not used ggate by myself, I will let others talk about it. I'm happy to contribute with the code changes for this as promised earlier, will keep an eye and follow on the final decision about this topic.

c) the whole saveconfig functionality. Here, I'll repeat my previous comment: it seems really specific to the combination of CSI and nbd. It hasn't come up before and no concrete use case outside of ceph-csi has been presented. So far I don't see a value in implementing it in rbd.

Sure, thanks for all your thoughts on this Ilya, I will move this logic to the ceph-csi.

BRs

@pkalever
Copy link
Author

pkalever commented May 5, 2021

@pkalever The kernel patch looks very useful to me. Thanks!

@trociny Glad to hear that is helping you.

The necessary changes to rbd-nbd to support the new kernel feature are welcome! And if it were some sort of a cookie instead of the image spec for identification I would withdraw my requirement to make it optional. I would need it optional if the image spec were used as an identifier, because an image migration (e.g. to some other pool) is one of the cases where detach/attach may be useful (to make it "live" for the client), and the image spec is changed in this case.

Sure, I can work on it, I had asked Ilya about his thoughts on the implementation side of it, just to avoid assumptions, will start once we all agree on it.

As for "saveconfig" functionality, initially I said that although I did not like it much I would prefer it to be in rbd instead of rbd-nbd if we really needed it. But after our discussion here I like the idea less and less, because it really looks like the thing that should be implemented in ceph-csi.

Sure, I will move it to ceph-csi for now. We can always bring it here if there are more usecases in the future.

Thanks for your thoughts!

BRs

@pkalever
Copy link
Author

pkalever commented May 5, 2021

One another thought (on a similar note)
@idryomov ceph-csi uses rbd interface only, hence it will be nice if rbd CLI promotes rbd-nbd attach and detach actions.
For now, at rbd, we can only promote attach/detach with nbd and others can return EOPNOTSUPP.

Thoughts?

@idryomov
Copy link
Contributor

idryomov commented May 7, 2021

To prevent data loss, the given path must
lead to the regular file or device that was used to create
the provider.

@wjwithagen So I take it it would be as easy to shoot yourself in the foot as with current nbd?

@wjwithagen
Copy link
Contributor

To prevent data loss, the given path must
lead to the regular file or device that was used to create
the provider.

@wjwithagen So I take it it would be as easy to shoot yourself in the foot as with current nbd?

Yes, once the underlying session is gone, there is currently no way to get to the device.
@trociny was the original author of rbd-ggate, and ATM I'm only patching things up when it does not build, or run.

But I guess it can do with the detach/attach options as well to prevent this foot-shooting.

@idryomov
Copy link
Contributor

idryomov commented May 7, 2021

I had asked Ilya about his thoughts on the implementation side of it, just to avoid assumptions, will start once we all agree on it.

We could generate a UUID and store it in omap in the header on image creation. The problem is that existing images won't have that property so we may need to add a new command just for that or make rbd device map optionally generate a UUID if it's not already there.

Another approach might be to stick with the spec but encode create timestamp in addition to the spec. Less reliable, but image timestamps were added a while ago, so relatively few images would be missing them and we could just skip providing the info to the kernel (therefore disallowing rbd-nbd attach for them).

I'd prefer the former, since it's more reliable and a per-image UUID will come useful for other things in the future.

@idryomov
Copy link
Contributor

idryomov commented May 7, 2021

@idryomov ceph-csi uses rbd interface only, hence it will be nice if rbd CLI promotes rbd-nbd attach and detach actions.
For now, at rbd, we can only promote attach/detach with nbd and others can return EOPNOTSUPP.

Promoting attach/detach before attach is made safe is just asking for data corruption tickets from frustrated users.

Is there a problem with using rbd-nbd in ceph-csi? The binary is there, in the same container.

@trociny
Copy link
Contributor

trociny commented May 7, 2021

We could generate a UUID and store it in omap in the header on image creation. The problem is that existing images won't have that property so we may need to add a new command just for that or make rbd device map optionally generate a UUID if it's not already there.

I was thinking that the UUID could be generated on rbd-nbd map command and returned to the user, which could remember it and use on reattach. Do you see issues with such approach?

The problem would be how to return the uuid without breaking compatibility. E.g. on map the "{dev} {uuid}" string could be returned instead of just "{dev}" as it is currently, if an option like --return-uuid is specified (to preserve the compatibility).
Or it could be shown in rbd-nbd list-mapped output (though it would break compatibility). Or yet another command could be added to rbd-nbd like rbd-nbd status <dev> that could return the uuid here.

@trociny
Copy link
Contributor

trociny commented May 7, 2021

Another thing to consider is that we may map an image snapshot. So we need to store uuid for every snapshot, if we want to permanently store it in the image metadata.

@idryomov
Copy link
Contributor

idryomov commented May 9, 2021

I was thinking that the UUID could be generated on rbd-nbd map command and returned to the user, which could remember it and use on reattach. Do you see issues with such approach?

Ah, I see. Now I realize why you said that you would withdraw your requirement to make the kernel enforcement optional -- you meant a per-mapping cookie, not a per-image one.

It sounds good to me and easier to implement too. I was leaning towards a persistent per-image UUID because I suspect we will end up needing one in the future anyway, for something entirely unrelated.

We never promised plain unstructured output compatibility. For json and xml, it should be compatible -- just a new field. I think having list show it is enough for the initial cut, something like --device-cookie can always be added later.

@pkalever
Copy link
Author

@idryomov ceph-csi uses rbd interface only, hence it will be nice if rbd CLI promotes rbd-nbd attach and detach actions.
For now, at rbd, we can only promote attach/detach with nbd and others can return EOPNOTSUPP.

Promoting attach/detach before attach is made safe is just asking for data corruption tickets from frustrated users.

Honestly, I'm seeing these updates now after pushing #41279
Excuse me for jumping the gun, for some reason I didn't see the notifications in my inbox, might have got filtered or lost in my backlog. Sure if you think so, we can hold it until the kernel patch and the userspace changes are available.

Is there a problem with using rbd-nbd in ceph-csi? The binary is there, in the same container.

The existing model uses rbd integrated CLI and I do not want to abuse it by introducing multiple CLIs parsing and formatting logic, which is additional maintenance in ceph-csi. I don't see a reason why attach/detach shouldn't be promoted at rbd CLI though.

@pkalever
Copy link
Author

pkalever commented May 11, 2021

I was thinking that the UUID could be generated on rbd-nbd map command and returned to the user, which could remember it and use on reattach. Do you see issues with such approach?

Ah, I see. Now I realize why you said that you would withdraw your requirement to make the kernel enforcement optional -- you meant a per-mapping cookie, not a per-image one.

It sounds good to me and easier to implement too. I was leaning towards a persistent per-image UUID because I suspect we will end up needing one in the future anyway, for something entirely unrelated.

We never promised plain unstructured output compatibility. For json and xml, it should be compatible -- just a new field. I think having list show it is enough for the initial cut, something like --device-cookie can always be added later.

I'm some what on the same with @trociny 's idea here.

Repeating the flow to cross-check the missing, before implementing it:

  • On each map
    • generate a new UUID
    • request kernel to store the same UUID in /sys/block/nbdX/backend
    • return the {dev, UUID} to user, as part of map command output
    • we do not save this UUID anywhere (like in omap or file or elsewhere)
  • list-mapped should read UUID from /sys/block/nbdX/backend and print it.
  • On every attach request, user is supposed to pass the required UUID, if UUID doesn't match with /sys/block/nbdX/backend we fail.

Overall the flow should look like:

$ rbd-nbd map rbd-pool/image1 --try-netlink
/dev/nbd0, 81adc51d-2dd1-466b-b360-a4a2ad8b2d5c

$ cat /sys/block/nbd0/backend
81adc51d-2dd1-466b-b360-a4a2ad8b2d5c

$ rbd-nbd list-mapped
id     pool           namespace  image   snap  device     cookie
16411  cephfs.a.data             image1  -     /dev/nbd0  81adc51d-2dd1-466b-b360-a4a2ad8b2d5c

$ rbd-nbd attach --device=/dev/nbd0 rbd-pool/image1 --cookie 81adc51d-2dd1-466b-b360-a4a2ad8b2d5c
/dev/nbd0

Hope I didn't miss anything.

@trociny
Copy link
Contributor

trociny commented May 11, 2021

$ rbd-nbd map rbd-pool/image1 --try-netlink
/dev/nbd0, 81adc51d-2dd1-466b-b360-a4a2ad8b2d5c

If we return the uuid in list-mapped output, it is not necessary to return it here -- the caller may find it out by running list-mapped.

I am just not very happy about the idea to change the rbd-nbd map output, because users may expect it to return the device only and use this in their scripts, like our test script:

DEV=`rbd-nbd map ...`

$ cat /sys/block/nbd0/backend
81adc51d-2dd1-466b-b360-a4a2ad8b2d5c

And, now, taking that the user may get the uuid from /sys/block/nbd0/backend I wonder do we really need to modify list-mapped to provide it here?

@pkalever
Copy link
Author

If we return the uuid in list-mapped output, it is not necessary to return it here -- the caller may find it out by running list-mapped.

I am just not very happy about the idea to change the rbd-nbd map output, because users may expect it to return the device only and use this in their scripts, like our test script:

DEV=`rbd-nbd map ...`

Sounds good, we can always get it like

DEV=`rbd-nbd map ...`
UUID=`cat /sys/block/$DEV/backend`

And, now, taking that the user may get the uuid from /sys/block/nbd0/backend I wonder do we really need to modify list-mapped to provide it here?

Not everyone is aware of /sys/block/nbdX/backend, so I think it's good to get it printed with list-mapped, provided we already do not want to return with map

@idryomov
Copy link
Contributor

Not everyone is aware of /sys/block/nbdX/backend, so I think it's good to get it printed with list-mapped, provided we already do not want to return with map

Yes, users should not need to poke in sysfs. I see three options:

  1. rbd device map returning e.g. /dev/nbd0 81adc51d-2dd1-466b-b360-a4a2ad8b2d5c when --device-cookie is specified (and, as before, /dev/nbd0 if --device-cookie isn't specified)
  2. rbd device list getting a new (perhaps hidden by default and requiring --all or similar) column and json/xml field
  3. both of the above

I guess it boils down to whether we want to make it easy to discover the cookie after rbd device map runs. If not, (1) is sufficient and very easy to implement.

@trociny
Copy link
Contributor

trociny commented May 11, 2021

I see three options:

I am fine with any.

@pkalever
Copy link
Author

Yes, users should not need to poke in sysfs. I see three options:

  1. rbd device map returning e.g. /dev/nbd0 81adc51d-2dd1-466b-b360-a4a2ad8b2d5c when --device-cookie is specified (and, as before, /dev/nbd0 if --device-cookie isn't specified)
  2. rbd device list getting a new (perhaps hidden by default and requiring --all or similar) column and json/xml field
  3. both of the above

I guess it boils down to whether we want to make it easy to discover the cookie after rbd device map runs. If not, (1) is sufficient and very easy to implement.

As per this discussion, I have opened #41323 which implements both (1) and (2)

Thanks!

@stale
Copy link

stale bot commented Jul 21, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 21, 2021
@pkalever
Copy link
Author

pkalever commented Dec 2, 2021

We have introduced a volume healer in ceph-csi to workaround this problem. I'm closing it, in case anyone needed it for generic restore of rbd-nbd devices, this should guide.

Thanks!

@stale stale bot removed the stale label Dec 2, 2021
@pkalever pkalever closed this Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants