Skip to content

graphdriver: add support for ceph graph driver#9146

Closed
simon3z wants to merge 1 commit intomoby:masterfrom
simon3z:master
Closed

graphdriver: add support for ceph graph driver#9146
simon3z wants to merge 1 commit intomoby:masterfrom
simon3z:master

Conversation

@simon3z
Copy link

@simon3z simon3z commented Nov 13, 2014

This is an initial proof of concept for adding a ceph graph driver. It's not fully in a good shape yet but it works pretty well at the moment.

Please leave a feedback, I am willing to finish this up and get it merged if you think it's interesting.

@cpuguy83
Copy link
Member

At first glance, it seems like all your functions ought to be private, except the stuff in driver.go (since that's implementing the public interface).

@simon3z
Copy link
Author

simon3z commented Nov 13, 2014

Thanks @cpuguy83 I'll fix that in the next revision.

@cpuguy83
Copy link
Member

What's needed to make this work? After installing librados and librbd I was at least able to get it to compile, albiet with an error from librados.
The built binary segfaulted.

@simon3z
Copy link
Author

simon3z commented Nov 13, 2014

@cpuguy83 you need your host to be configured to use your ceph cluster. For example this should work:

ceph osd lspools

0 data,1 metadata,2 rbd

Then the only thing you need to do is:

ceph osd pool create docker-data 100

ceph osd lspools

0 data,1 metadata,2 rbd,3 docker-data,

and you should be good to go.

BTW, can you paste the librados error that you got?

@cpuguy83
Copy link
Member

---> Making bundle: binary (in bundles/1.3.1-dev/binary)
# github.com/docker/docker/docker
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/librados.a(addr_parsing.o): In function `resolve_addrs':
(.text+0x20e): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Created binary: /go/src/github.com/docker/docker/bundles/1.3.1-dev/binary/docker-1.3.1-dev

It would be super nice for the driver to handle creating this pool... and potentially support a --storage-opt option on the daemon to use a custom pool.

@simon3z
Copy link
Author

simon3z commented Nov 13, 2014

@cpuguy83 I just updated the patch with your suggestions: private functions and --storage-opt ceph.datapool to select the pool. I also added a TODO (that I'll handle later) to create the pool automatically. Thanks for the feedback, let me know if it works for you.

@simon3z simon3z force-pushed the master branch 2 times, most recently from dd8d2a6 to 62ec005 Compare November 13, 2014 22:37
@simon3z
Copy link
Author

simon3z commented Nov 13, 2014

@cpuguy83 do you know if there's anything I can do to let the patch pass drone.io? It seems that the ceph dev packages are missing. Is it something I can fix or is it a private setting of the drone.io environment?

@cpuguy83
Copy link
Member

@simon3z You'd need to add librados-dev and librbd-dev to the Dockerfile

@simon3z
Copy link
Author

simon3z commented Nov 14, 2014

@cpuguy83 yes that's what I've done in the last revision. Isn't it?

@cpuguy83
Copy link
Member

Maybe an issue with drone...

I personally can't get it to not segfault upon start.

@simon3z
Copy link
Author

simon3z commented Nov 14, 2014

@cpuguy83 I need more information.

What distro are you using? (ubuntu 14.04?)

What is the output of "ceph osd lspools"?

Can you run with gdb and check where it segfaults?

Additionally you can contact me on irc (fsimonce on freenode) so that we can try to fix this asap. Thanks.

@yadutaf
Copy link
Contributor

yadutaf commented Nov 14, 2014

I gave this PR a try and also hit a segfault. It seems to occur in some global constructor librados::ObjectOperation::size(), before go runtime kicks off.

ceph osd lspools and rbd commands all work as expected.

gdb output:

(gdb) run -d
Starting program: /home/admin/docker/bundles/1.3.1-dev/binary/docker -d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000000000408010 in _GLOBAL__sub_I__ZN8librados15ObjectOperation4sizeEv ()
#2  0x0000000000c11677 in __libc_csu_init ()
#3  0x0000000000c11109 in __libc_start_main ()
#4  0x0000000000409867 in _start ()

strace output:

execve("./bundles/1.3.1-dev/binary/docker", ["./bundles/1.3.1-dev/binary/docke"..., "-d"], [/* 14 vars */]) = 0 
uname({sys="Linux", node="dev-jt-01", ...}) = 0 
brk(0)                                  = 0x1df74f0
brk(0x1df86b0)                          = 0x1df86b0
arch_prctl(ARCH_SET_FS, 0x1df7d80)      = 0 
set_tid_address(0x1df8050)              = 16507
set_robust_list(0x1df8060, 24)          = 0 
futex(0x7fffb27735f0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 1df7d80) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0xbfc940, [], SA_RESTORER|SA_SIGINFO, 0xbfd000}, NULL, 8) = 0 
rt_sigaction(SIGRT_1, {0xbfc9d0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0xbfd000}, NULL, 8) = 0 
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 
readlink("/proc/self/exe", "/home/admin/docker/bundles/1.3.1"..., 4096) = 60
brk(0x1e196b0)                          = 0x1e196b0
brk(0x1e1a000)                          = 0x1e1a000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/proc/self/cmdline", O_RDONLY)    = 3 
read(3, "./bundles/1.3.1-dev/binary/docke"..., 256) = 37
read(3, "", 475)                        = 0 
close(3)                                = 0 
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---

@simon3z
Copy link
Author

simon3z commented Nov 17, 2014

@cpuguy83 @yadutaf it seems that static linking is not possible at the moment and it results in the sigsegv that you are experiencing. The main problem is that at least 1 lib used by ceph is not available for static linking (nss3).

I tested this on ubuntu 14.04.1 with the following packages:

golang 2:1.2.1-2ubuntu1
gcc 4:4.8.2-1ubuntu6
ceph 0.80.5-0ubuntu0.14.04.1

building with:

$ AUTO_GOPATH=1 DOCKER_BUILDTAGS="exclude_graphdriver_btrfs exclude_graphdriver_devicemapper" ./project/make.sh dynbinary

And the ceph graph driver works as expected.

You could be hitting one of these issues (other sigsegv):

ceph/ceph#2937
http://tracker.ceph.com/issues/8912

The mitigation for both at the moment is to remove "rbd cache = true" from the ceph.conf.

@yadutaf
Copy link
Contributor

yadutaf commented Nov 17, 2014

The dynbinary works fine. I had trouble getting to pass the initialization phase though as it authenticates with "client.admin". hard-coding my client name fixed the issue. Can you add a parameter like --storage-opt ceph.client=client.<dedicated_user> to capture this use case?

On start, it attempts to remove base-image if it was only partially created. This is helpful to capture crashes in init phase but it does not attempt to unmap the image prior to deleting it in case it was mapped on current host. This is a case I hit when troubleshooting the authentication. Manually unmapping fixed the boot. The error message was "FATA[0000] Cannot remove RBD image: base-image"

docker pull went to "D" process state on all attempts. It eventually unlocks and moves to the next layer after some (long) time. I did not spot similar issues on the production images. But I had some in the past with older kernels so may still be related to my cluster. i still need to investigate this.

In the source code, I noticed a couple of execs to rbd. It could be interesting to replace the map/unmap execs with writes to the system bus. To detect the device name, I rely on a combination of udev and symlink dereference. This is interesting if we want to avoid having dependency to ceph-common

Design question: what happens if say, 2 nodes of a cluster are pulling the same image / common layers simultaneously?

Btw, could you add some logging especially when raising an error? It would be awesome to help troubleshoot ceph related errors.

Signed-off-by: Federico Simoncelli <fsimonce@redhat.com>
@simon3z
Copy link
Author

simon3z commented Nov 18, 2014

@yadutaf I added --storage-opt ceph.client as you suggested.

At the moment I haven't wired the unmap of the half-baked base image yet. I think I'll add it when I'll remove the need of the devices map and devicesLock.

I'd love to get rid of "rbd map/unmap" but at the moment their logic is not present in the rbd library and removing them at the moment would result in duplicating too much logic inside the driver.
We'll have to ask ceph to provide map/unmap through libraries.

WRT multiple hosts: well the initial idea was that multiple hosts would be able to share the same images. This requires a lot of work in docker infrastructure so I am postponing it and now I am focusing in providing a regular graph driver. For this purpose I added the storage opt "ceph.imageprefix" that can be used on different hosts with different values (hostname?) to avoid (unmanaged at this time) collisions.

Errors should be logged already. Can you give me an example of more verbose reporting? Thanks.

@cpuguy83
Copy link
Member

I'm thinking a better use of this would be for implementing as a backend for volume storage. Right now "vfs" is hardcoded as the storage for volumes, but ceph might be really nice there for certain things.

@simon3z
Copy link
Author

simon3z commented Nov 18, 2014

@cpuguy83 yes, that makes perfectly sense, this patch is providing some bindings that you could use for that as well.

@thaJeztah
Copy link
Member

implementing as a backend for volume storage.

Nice thinking!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be interesting to log the output of strerror(-err) to help debugging

@yadutaf
Copy link
Contributor

yadutaf commented Nov 19, 2014

Thanks for the new options, I'll give it a try.

IMHO, the main advantage of having a ceph graph driver inside Docker is precisely Host coordination / data sharing so that pulling an image on a host makes it immediately available on any other host having access to the cluster. Use cases for this includes failover and fast application scaling. (but probably not storage space due to the 3 replicates common setup).

Regarding multi-host safety, it could rely on advisory locking (https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L232).

@cpuguy83
Copy link
Member

I personally feel it's a little weird for image/container storage since these things are static and there is the registry. Having multiple hosts accessing the same /var/lib/docker would be really bad. A registry backend driver would be really nice, too.

@lebauce
Copy link
Contributor

lebauce commented Nov 19, 2014

@simon3z If I remember correctly, the functions related to map/unmap have been splitted into a separate library (libkrbd) in Ceph master

@lebauce
Copy link
Contributor

lebauce commented Nov 19, 2014

@cpuguy83 I wrote this little patch on top of this branch and your 'add_volumes_command' branch to be able to use the Ceph graphdriver to create volumes

lebauce@20e842b

@simon3z
Copy link
Author

simon3z commented Nov 20, 2014

@yadutaf @cpuguy83 a next possible step (on top of this patch) could be to provide drivers also for the metadata that we now store in sqlite. In fact if we do so we could use a ceph object store to maintain also the metadata and then share common images between hosts (using ceph locking). Anyway this could be too far fetched, it's a possibility.

@lebauce thanks, I discovered libkrbd few days ago but I still want this patch to be easily consumable on fedora/el6/el7/ubuntu where that lib is not available yet. I'll definitely use libkrbd in the future.

@cpuguy83
Copy link
Member

@simon3z Does this still have an issue with static compilation? That will hinder being able to merge.

@simon3z
Copy link
Author

simon3z commented Nov 20, 2014

@cpuguy83 the problem is that ubuntu is not providing the static libs for some packages (e.g. nss3, ceph, etc.). This means that in the Dockerfile we would have to do what we do for devmapper... download the sources and recompile them statically.

Where can I get some background on why this is strictly required? (Maybe a previous thread, a faq, etc.)

From my point of view the static compilation is not an issue in the driver but simply there's no distribution providing the required static libraries today.

Any suggestion on how to proceed?

@jessfraz
Copy link
Contributor

Hi @simon3z,
Thanks so much for the contribution. Really this is a cool idea. There are some problems with it being bound by the ceph version, so people could easily run into errors when taking a binary built against a wrong version and trying to use it.
This would be a great external plugin. Ceph is cool because it is a distributed filesystem. We would just need to be sure that it could work like the other existing graph drivers for any version of ceph and make it easy for users. So we are going to close for now, but we welcome you to reopen when it is a solution that can work on any version.
Thanks so much, and so sorry for leaving this hanging for so long.

@jessfraz jessfraz closed this Dec 16, 2014
@unclejack
Copy link
Contributor

As an alternative to an external plugin, this could have two parts: 1) the part which is integrated into Docker 2) a tool which is linked against the ceph libraries.

The tool which is linked against the ceph libraries could then a) become part of ceph itself and provided by ceph or b) be recompiled by the user / distribution packager

This dependency on the exact version of the library it's linked against is a problem. The way it is right now, users would always have to recompile the Docker binaries if they install newer ceph packages or if their distribution has another version of ceph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants