rbd-nbd: add ability to restore nbd devices#40809
rbd-nbd: add ability to restore nbd devices#40809pkalever wants to merge 6 commits intoceph:masterfrom
Conversation
define defaults, will be used in later set of patches. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Problem: ------- Currently, with the rbd-nbd tool, we can map the nbd devices and scale them as needed. But in case if the node is rebooted, there is no way to restore the nbd device mappings and one has to do remap of the big list of devices manually. This is a serious problem in container environments, if the container hosting the rbd-nbd processes is restarted, all the mapping is just gone and The application will start seeing the IO errors. Solution: -------- With these changes, on every successful Map operation, we save the details of the nbd device mappings to a persistent file (here on called as saveconfig file) in the JSON format which will be used for restore. Similarly, on Unmap we clean the JSON object related to the device. Thanks to the attach command, the restore command will leverage it and brings the rbd-nbd processes to life. The default path of the JSON file is /etc/ceph/saveconfig.json Example usage: ------------- $ rbd-nbd list-mapped id pool namespace image snap device 617318 rbd-pool image1 - /dev/nbd0 617356 rbd-pool image2 - /dev/nbd1 617394 rbd-pool image3 - /dev/nbd2 617433 rbd-pool image4 - /dev/nbd3 617472 rbd-pool image5 - /dev/nbd4 $ pkill -9 rbd-nbd $ rbd-nbd restore => Attempting to attach: /dev/nbd0 -> rbd-pool/image1 SUCCESS => Attempting to attach: /dev/nbd1 -> rbd-pool/image2 SUCCESS => Attempting to attach: /dev/nbd2 -> rbd-pool/image3 SUCCESS => Attempting to attach: /dev/nbd3 -> rbd-pool/image4 SUCCESS => Attempting to attach: /dev/nbd4 -> rbd-pool/image5 SUCCESS Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
These changes add the ability to restore a single or a partial list of rbd-nbd devices that need to be restored. Example usage: ------------- $ rbd-nbd restore --attach-list="rbd-pool/image1,/dev/nbd4" => Attempting to attach: /dev/nbd0 -> rbd-pool/image1 SUCCESS => Attempting to attach: /dev/nbd4 -> rbd-pool/image5 SUCCESS Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
restore should also honor a few common options like keyfile, keyring, id/user, mon_host and pass the same to all the attach commands that restore will invoke. Example usage: ------------- $ rbd-nbd --id="admin" --keyfile=/tmp/keyfile -m "10.70.1.100:6789" restore --attach-list="rbd-pool/image2,/dev/nbd3" => Attempting to attach: /dev/nbd1 -> rbd-pool/image2 SUCCESS => Attempting to attach: /dev/nbd3 -> rbd-pool/image4 SUCCESS Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Environments like containers running rbd-nbd in the pods, need to tune this saveconfig path to the persistent host-path (bind mounted) locations. Not just that, for example, CSI drivers need a saveconfig file per rbd-nbd device. To satisfy the above requirements, changes here introduce --saveconfig-file option with the map/restore/unmap commands, which allows tuning the default location of the saveconfig file. Please note its the duty of the user to provide the same path of the saveconfig file at the time of unmap if the rbd-nbd process is attached manually instead of invoking it via the restore command, especially if the --saveconfig-file option is not used with attach command. Example usage: ------------- $ ./bin/rbd-nbd map --saveconfig-file /tmp/rbd-nbd-cfg.json $ ./bin/rbd-nbd restore --saveconfig-file /tmp/rbd-nbd-cfg.json $ ./bin/rbd-nbd unmap Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This change updates the rbd-nbd man page about the new restore command and its options. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
|
@pkalever Was it discussed earlier in Ceph community before you started working on this? Because, although the feature looks useful to me and you did a great job here, I hate to say, I am not sure it is the right approach. It seems like it could be done on a higher layer, keeping the rbd-nbd simple. Moreover it could be more generic, so it could also handle krbd and FreeFSD's rbd-ggate. We tried to make these drivers provide similar interface so you could use The functionality provided here resembles Note, it is just my humble opinion though. I would love to see what other people think about it. That is why I asked if there were some discussion about this. [1] https://docs.ceph.com/en/latest/man/8/rbdmap/ |
|
@idryomov Do you have any opinion about this? |
This look good to me, I haven't start that yet and not recently, because I am not sure whether could this work for the containers in rbd-nbd. While for now, this approach maybe a very good start for this feature IMHO, and let's make it more generic after this if possible later. BRs |
When added to rbd-nbd it will be difficult to remove later as it may already be used so we will have to maintain both. My actual proposal it to move the current implementation from |
|
Honestly, I'm confused by the description of the problem:
If the node is rebooted, there won't be any applications there. When the node is back, a static list of mappings in If the container hosting |
|
Hello @trociny, It was a holiday for me yesterday, sorry for my delayed response. Reply on comment 1:
This is my first contribution to ceph, I'm unaware of the pattern, so excuse me if this feature appears as a surprise.
This certainly sounds cool, and it's very much understandable that one simple interface is what we expect.
I haven't heard of rbdmap tool before, on first glance I see it's a script, and IMHO that will not be enough in the long run.
Reply on comment 2:
Thanks for welcoming the changes. Also, thanks for the hint, I looked at tools/rbd/action/Device.cc, and here is what I understand at my first glance: Coming back to the changes w.r.t this PR, which is the That said, at some point in time we should deprecate the individual CLI's (rbd-nbd, rbd-ggate, etc) and for that, we need to start preparing our users to adapt to the upcoming integrated CLI changes, this way we will stop duplicating the CLI parsing logic in the individual CLI's and generic CLI. One example, not to point out or hurt anyone, rather just to give an example, the @trociny, great comments!! I'm happy to get corrected and/or hear your further thoughts. |
|
Hello @idryomov
To be honest, when I started working on this, with this approach I was trying to address both cases you have listed above [although my primary goal was to fix the containers one]. But, thanks for highlighting this, I just realized that I forget to add a force option upon which the Cheers! |
idryomov
left a comment
There was a problem hiding this comment.
Before you respin, I think we need to discuss whether adding this functionality to rbd-nbd is a good idea. So far, I don't see any reason this couldn't be done externally by (an equivalent of) a simple shell script. Since you said that the container restart case was the primary goal, let's focus on it. The reboot case should be taken care of by #40844. Some questions to kick off the discussion:
-
What is achieved by storing the mapping details at map time? Could you just grab
rbd device listoutput before restarting the container? If not, what is missing fromrbd device listoutput? -
What is the use case for partial restoring (i.e. filtering by device node and image spec)?
-
What is achieved by having
rbd-nbddo the restore (i.e. fork itself for each device, etc)? Could whoever grabbedrbd device listoutput just iterate through it and callrbd-nbd attachfor each entry?
Sure I will wait until I see an ack on the approach from you or @trociny or @lxbsz
A more meaningful understanding can be achieved if we visualize the problem from container setups. Let me take you through it:
Right, if we want someone to read the saveconfig.json and execute the Thanks! |
|
I'm just a simple soul that likes KISS. This sound horribly to a sort of |
Is that really the case? I might be missing some k8s / CSI spec detail here, but I would argue that if the container gets killed (as opposed to being gracefully restarted for an upgrade), we shouldn't be restoring anything. I mean,
This can be addressed fairly easily.
I see, thanks for the explanation.
I don't see anything wrong with a new tool. This tool would be using exactly one command, Also, I'm not saying that this tool (a script, really) must be maintained outside of Ceph. It can very well be part of Ceph, just not embedded in |
I'm having trouble parsing this paragraph ;) |
Sorry about that.... The file that is maintained is actually a receipt to connect to all rbd images, just as fstab tells you where to mount the storage devices. Mount is the all-purpose entry used to execute the rules in /etc/fstab. And that handles all kinds of device and per device different options. YNow you can use And what we are talking about is just the same, but now its about RBD images and mounting to a device that it was previously mounted to. Could be a file structure similar to fstab. So why not use |
Just out of my curiosity, like:
I mean to say, if we have to introduce a new tool to read the saveconfig.json (data that we are storing at the time of map) captured by Why should this go into a script when we have a lot of control with the
If we are stressing on a more generic way for organizing this mapping, the go-to way is:
Thanks! |
I don't quite understand why you need this special handling per device type? Why can't you just save everything that was passed as arguments when map was called and succeeded and then restore everything that was saved? I guess there is some complication but I would like to know, probably it could be solved some other way. |
You are completely free to do it in a way you want. It just will be sad if after the discussion it will turn out you will have to do many changes or even chose another approach. Discussing this in advance allows to avoid this. |
Sorry for clubbing it here. Yes, demonizing is a different topic, I asked about it, to get an overall idea w.r.t the plans. I spend some time to patch nbd kernel module to avoid the dangerous operation that we discussed before, here is the link [1] For anyone who is interested in testing kernel nbd changes [1], needs userspace rbd-nbd changes [2] Assuming the kernel patch will get in, and given the dangerous operation is avoided now, can I get an ack on this for moving this code to rbd integrated cli? As far as I can see @trociny and @lxbsz is in agreement with moving the code to Device.cc, I'm counting on @idryomov [1] https://lkml.org/lkml/2021/4/29/274 Thanks! |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
The kernel patch seems fine, at least at a high level. However @trociny mentioned that we might want to have a way to override the safety check:
I don't have a strong opinion here, just bumping so it doesn't get lost.
The attached |
|
From an ack perspective, I see three separate pieces: a) making b) promoting c) the whole |
|
@pkalever The kernel patch looks very useful to me. Thanks! The necessary changes to rbd-nbd to support the new kernel feature are welcome! And if it were some sort of a cookie instead of the image spec for identification I would withdraw my requirement to make it optional. I would need it optional if the image spec were used as an identifier, because an image migration (e.g. to some other pool) is one of the cases where detach/attach may be useful (to make it "live" for the client), and the image spec is changed in this case. As for "saveconfig" functionality, initially I said that although I did not like it much I would prefer it to be in rbd instead of rbd-nbd if we really needed it. But after our discussion here I like the idea less and less, because it really looks like the thing that should be implemented in ceph-csi. |
From a FreeBSD/ As said Trying to access the mount-point after Now if I look at the options with So it is possible to pickup a previously connection to a created device. And thus I think it should be possible to offer similar semantics in |
I agree with this and yes cookie sounds good. If we need to generate a UUID per mapping then it must be taken care of by the rbd-nbd itself. Can you add some ideas on how you think this should be implemented and maintained in rbd-nbd? I'm worried and do not want to forward with assumptions.
Yeah, thanks!
I have not used ggate by myself, I will let others talk about it. I'm happy to contribute with the code changes for this as promised earlier, will keep an eye and follow on the final decision about this topic.
Sure, thanks for all your thoughts on this Ilya, I will move this logic to the ceph-csi. BRs |
@trociny Glad to hear that is helping you.
Sure, I can work on it, I had asked Ilya about his thoughts on the implementation side of it, just to avoid assumptions, will start once we all agree on it.
Sure, I will move it to ceph-csi for now. We can always bring it here if there are more usecases in the future. Thanks for your thoughts! BRs |
|
One another thought (on a similar note) Thoughts? |
@wjwithagen So I take it it would be as easy to shoot yourself in the foot as with current nbd? |
Yes, once the underlying session is gone, there is currently no way to get to the device. But I guess it can do with the detach/attach options as well to prevent this foot-shooting. |
We could generate a UUID and store it in omap in the header on image creation. The problem is that existing images won't have that property so we may need to add a new command just for that or make Another approach might be to stick with the spec but encode create timestamp in addition to the spec. Less reliable, but image timestamps were added a while ago, so relatively few images would be missing them and we could just skip providing the info to the kernel (therefore disallowing I'd prefer the former, since it's more reliable and a per-image UUID will come useful for other things in the future. |
Promoting attach/detach before attach is made safe is just asking for data corruption tickets from frustrated users. Is there a problem with using |
I was thinking that the UUID could be generated on The problem would be how to return the uuid without breaking compatibility. E.g. on map the "{dev} {uuid}" string could be returned instead of just "{dev}" as it is currently, if an option like |
|
Another thing to consider is that we may map an image snapshot. So we need to store uuid for every snapshot, if we want to permanently store it in the image metadata. |
Ah, I see. Now I realize why you said that you would withdraw your requirement to make the kernel enforcement optional -- you meant a per-mapping cookie, not a per-image one. It sounds good to me and easier to implement too. I was leaning towards a persistent per-image UUID because I suspect we will end up needing one in the future anyway, for something entirely unrelated. We never promised plain unstructured output compatibility. For |
Honestly, I'm seeing these updates now after pushing #41279
The existing model uses rbd integrated CLI and I do not want to abuse it by introducing multiple CLIs parsing and formatting logic, which is additional maintenance in ceph-csi. I don't see a reason why attach/detach shouldn't be promoted at rbd CLI though. |
I'm some what on the same with @trociny 's idea here. Repeating the flow to cross-check the missing, before implementing it:
Overall the flow should look like: Hope I didn't miss anything. |
If we return the uuid in I am just not very happy about the idea to change the
And, now, taking that the user may get the uuid from |
Sounds good, we can always get it like
Not everyone is aware of |
Yes, users should not need to poke in sysfs. I see three options:
I guess it boils down to whether we want to make it easy to discover the cookie after |
I am fine with any. |
As per this discussion, I have opened #41323 which implements both (1) and (2) Thanks! |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
We have introduced a volume healer in ceph-csi to workaround this problem. I'm closing it, in case anyone needed it for generic restore of rbd-nbd devices, this should guide. Thanks! |
Problem:
Currently, with the rbd-nbd tool, we can map the nbd devices and scale
them as needed. But in case if the node is rebooted, there is no way to restore
the nbd device mappings and one has to do remap of the big list of devices
manually. This is a serious problem in container environments, if the container
hosting the rbd-nbd processes is restarted, all the mapping is just gone and
The application will start seeing the IO errors.
Solution:
With these changes, on every successful Map operation, we save the
details of the nbd device mappings to a persistent file (here on called as
saveconfig file) in the JSON format which will be used for restore. Similarly,
on Unmap we clean the JSON object related to the device.
Thanks to the attach command, the restore command will leverage it and
brings the rbd-nbd processes to life.
The default path of the JSON file is /etc/ceph/saveconfig.json
Example usage:
Restore all from default location:
Restore partial list from default location:
Restore partial list from default location with id, keyfile and mon_host:
Restore from configurable saveconfig file location:
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox