Containerization¶
Bare Metal LXC 0.9 Notes¶
Note
For the most part the following directions should be automated by higher-level tools, but the following is still informative.
Needed to add to /etc/fstab:
cgroup /sys/fs/cgroup cgroup defaults 0 0
lxc-create:
lxc-create -B lvm --vgname raid1 --lvname bunker_rootfs --fssize 32GB --fstype ext4 -n bunker -t wheezy-nsa
A copy of the wheezy-nsa template should be included in this documentation
directory in lxc_templates.
Post lxc-create, edit /var/lib/lxc/<name>/config:
<delete line: lxc.network.type = empty>
<append the below>
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.ipv4 = <full ipv4>/<range> <broadcast>
lxc.network.ipv4.gateway = <gateway_ipv4>
lxc.network.ipv6 = <ipv6_addr... no prefixlen?>
lxc.network.ipv6.gateway = <ipv6_gateway>
## Limits
#lxc.cgroup.cpu.shares = 1024
#lxc.cgroup.cpuset.cpus = 0
lxc.cgroup.memory.limit_in_bytes = 16G
lxc.cgroup.memory.memsw.limit_in_bytes = 16G
## Hooks
lxc.hook.mount = /etc/lxc/locker_mount.sh
lxc.hook.pre-start = /etc/lxc/locker_pre-start.sh
lxc.hook.post-stop = /etc/lxc/locker_post-stop.sh
Warning
The lxc.hook.mount is crucial. Without it, all lockers for all
containers will be unmounted on lxc-start.
Next, edit etc/network/interfaces in the rootfs (delete the whole eth0
block).
The mount hook will try to install a fix for /dev/ptmx permission in
/etc/rc.local. If this doesn’t work, will need to manually chmod 666
/dev/ptmx inside the container.
In the future, encrypted container rootfs might be possible. See also EncFS (sp?) or:
“Wormhole” Locker Mounting¶
This section is a brief overview; see nitty gritties below
LVM is used for the rootfs partitions of containers, and also used to create user- and project-specific “locker” data volumes. We would like to be able to create and mount any locker into any container (modulo permissions) at run time without restarting containers. Because the container LVM rootfs partitions are not mounted to any mountpoint in the host (bare-metal) filesystem, we must achive this by creating shared “wormhole” bind mountpoints in the host filesystem which get copied into individual containers. Lockers are mounted normally in the host filesystem, then bind mounted into wormholes to make them appear in the containers.
On the host, wormholes live in /var/locker-mappings and are created by
lxc-start hook scripts; regular locker mountpoints are in /lockers and
should be added to /etc/fstab. In the containers, the wormhole shows up at
/lockers and individual lockers show up beneath that. No files or
directories should ever be created in the base wormhole directory in
containers, or cleanup and re-mounting will fail.
After the lxc-start, the host needs to setup the actual locker mounts. For the
case of lockers with the same name as the container (eg, ‘bunker’ for the
‘bunker’ vm, or substituting _vol for _vm in the container name), the mount
hook will try to mount the locker automatically. The sandbox locker also
gets mounted for every container. Additional mounts must be configured in the
host and will not persist across LXC restarts.
Refs:
http://s3hh.wordpress.com/2011/09/22/sharing-mounts-with-a-container
http://www.ibm.com/developerworks/linux/library/l-mount-namespaces/index.html
Wormhole Nitty-Gritties¶
A relatively large amount of pain was involved getting the wormhole mounting to work. A bit of background on linux mount namespaces is required as context.
Mount namespaces are part of a group of namespace features added to the Linux
kernel a few years back that make LXC and other containerization schemes
possible. The canonical documentation is in the kernel tree.
Similar namespaces include hostname (“UTS”), networking, etc. In a nutshell,
each linux process lives in some mount namespace, and the mount events
(creation, deletion) that happen in that namespace may or may not propagate to
other namespaces. By default processes are in the same namespace as their
parent process; new namespaces are created using the clone() or
unshare() functions/syscalls. When a process creates a new namespace, it
gets copies of all the mount objects in the parent namespace. By default
every mount object is “private”, which means that changes to the mount in one
namespace will not propagate to other namespaces. So, eg, normally a child
process in a new namespace can mount or umount however it likes and this will
not affect what processes in the parent namespace see. It is important to note
that mount “objects” (aka, “mounts”, or “vfsmounts”) are internal to the kernel
and aren’t simply tied to a particular file path; a mountpoint can be moved or
renamed but it’s still the same “mount object”.
One feature of mount namespaces is the ability to put mount objects in “shared” modes, in which case events (changes) taking place on that object are propagated across namespace barriers. The modes of interest are private (no propagation), shared (events are passed in both directions), and slave (in the parent side the mount is shared, in the child slave; events from the parent propagate into the child, but changes in the child do not propagate into the parent). Some details are that a mount chan be shared in a parent and many children (in which case actions in one child will propagate to the parent and all other childred), that the shared mode must be set up in the parent before the child is created, and that the slave mode must be configured only in the child (not also in the parent), and thus must be enabled after the child is created.
LXC creates new mount namespaces for each container. The order of operations is
that lxc-start first does some prep work in the host (“parent”) namespace (eg,
running the pre-start hook script), then calls clone/unshare to enter a new
container namespace (“child”). Right after entering the child namespace, the
process is still in a very host-like environment, and any mounts that were
shared in the host are still mounted and shared in the child. lxc-start then
sets up container-specific mounts (if using LVM, the rootfs is mounted at the
temporary folder /usr/lib/x86_64-linux-gnu/lxc, otherwise this is done at
/var/lib/lxc/<container>/rootfs) and the ‘mount’ hook is run. At this time
(not sure about the exact order) a bunch of other namespace stuff happens, eg
the hostname is changed and networking for the container is configured. Then
the pivot_root command is called, which basically moves the base root
(/) for the container to the rootfs mount point, and any unused mountpoints
are unmounted in the child. Lastly the ‘start’ hook is run (the host filesystem
is now gone/inaccessible, so the ‘start’ hook must be stored in the child
rootfs), any last mounts (eg, sysfs, devpts, proc) are mounted, and finally the
container’s ‘init’ script is run.
The pivot_root moment is what causes so much pain: any mounts in the host
namespace that are shared will get (recursively) unmounted at this moment. This
doesn’t affect, eg, the host’s rootfs, or mounts to the /lockers/
directory, because those mounts are private. It does affect all the other
wormhole mounts in /var/locker-mappings/ though, because those are (and
must remain) shared. Boo! To work around this, the chlid process must carefully
mark all the shared mounts as slave mode before the pivot_root is run. This
prevents the umount events for propagating to the host or other namespaces.
There were some initial problems getting this working because of bugs in the mount
program (part of the linux-utils package in debian) which prevent the
--make-rslave mode from actually applying recursively. The work around was
to have a shell script manually find all locker mounts and individually mark
each one as slave. This is implemented in the ‘mount’ LXC hook, which is
operating in the child mount namespace but before pivot_root happens. An easy
gotcha here is that every LXC container must always include this mount
script; running lxc-start just once with a fresh container which doesn’t have
this hook configured (in it’s config file) will result in all locker mounts
being unmounted. Bummer!
There has been much discussion of this style of LXC use, and hopefully improvements/extensions will be made to lxc-start to make all this easier.
LXC Hooks¶
/etc/lxc/locker_pre-start.sh¶
Before every lxc-start, we need to ensure that mapper directory exists and has special mount configured. The way we achieve this is by creating it fresh every time (cleaning up or bailing if it already existed) using an LXC hook.
#!/bin/bash
# fail fast on any errors; verbose output
set -x
set -e
NAME=$1
echo "Entering pre-start hook for $NAME"
# bail if we don't exit
[ -n $NAME ] || exit
# If mount point already exists, trouble...
if [ -e /var/locker-mappings/$1 ]; then
# if not a directory, bail
[ -d /var/locker-mappings/$1 ] || exit -1
# try to clean up, eg from an unclean shutdown
rmdir /var/locker-mappings/$1/* || true
umount /var/locker-mappings/$1 || true
# finally try to remove dir, else bail out
rmdir /var/locker-mappings/$1 || exit -1
fi
mkdir -p /var/locker-mappings/$1
# bind to self
# there should be no sub-mounts, so the 'r' versions below do nothing extra
mount --rbind /var/locker-mappings/$1 /var/locker-mappings/$1
mount --make-runbindable /var/locker-mappings/$1
mount --make-rshared /var/locker-mappings/$1
touch /var/locker-mappings/$1/.stamp_created
# mount sandbox locker by default, if it exists
if [ -d /lockers/sandbox ]; then
mkdir /var/locker-mappings/$1/sandbox
mount --bind /lockers/sandbox /var/locker-mappings/$1/sandbox
fi
# try to mount _vol locker if it exists; rename _vm -> _vol if needed
VOLNAME=`echo $1 | sed "s/_vm$/_vol/"`
if [ -n "$VOLNAME" -a -d "/lockers/$VOLNAME" ]; then
mkdir /var/locker-mappings/$1/$VOLNAME
mount --bind /lockers/$VOLNAME /var/locker-mappings/$1/$VOLNAME
fi
echo "Done creating mount points." > /dev/stderr
/etc/lxc/locker_mount.sh¶
#!/bin/bash
# fail fast on errors; verbose output
set -e
set -x
NAME=$1
PREFIX=/usr/lib/x86_64-linux-gnu/lxc
# echoing to stderr is sometimes necessary to punch through to logging?
echo "Entering mount hook for $NAME" > /dev/stderr
# set up ptmx fix 'start' hook
cp /etc/lxc/fixptmx.sh $PREFIX/etc/rc.local || true
# nullglob means "don't return glob if nothing matches"
shopt -s nullglob
# slave all locker mounts so we don't clobber then with pivot_root
# IMPORTANT: --make-rslave IS BROKEN (?!?!), which is why this is done
for m in `echo /var/locker-mappings/*/*`; do
mount --make-slave $m || true
done
for m in `echo /var/locker-mappings/*`; do
mount --make-slave $m || true
done
touch /var/locker-mappings/$1/.stamp_slaved
# mount our wormhole
mkdir -p $PREFIX/lockers
mount --rbind /var/locker-mappings/$1 $PREFIX/lockers
# ... and make it a slave mount. this is defensive, might not be necessary
mount --make-rslave $PREFIX/lockers
touch /var/locker-mappings/$1/.stamp_mounted
echo "Done with bind mount."
/etc/lxc/fixptmx.sh¶
After the container starts, we need to fix permissions on /dev/ptmx to
allow pseudo-terminals to be created by regular users in the container.
#!/bin/sh
echo "Setting global r/w permissions on /dev/ptmx"
chmod 666 /dev/ptmx
ls -l /dev/ptmx > /dev/stderr
/etc/lxc/locker_post-stop.sh¶
After lxc-stop, we need to unmount everything and rmdir the empty directories (DO NOT rm -rf in case real data was somehow still mounted).
#!/bin/bash
NAME=$1
echo "Entering post-stop hook for $NAME"
# The could be used to try and ensure everything is unmounted already
umount /var/locker-mappings/$1/* &> /dev/null || true
rmdir /var/locker-mappings/$1/* &> /dev/null || true
umount /var/locker-mappings/$1 || true
rm /var/locker-mappings/$1/.stamp* || true
rm /var/locker-mappings/$1/*.sock &> /dev/null || true
rmdir /var/locker-mappings/$1
echo "Done."
docker (aka ops)¶
The “ops” container is special in that it hosts many communal services using the “docker”
- runs ubuntu 14.04 (not debian) to be more docker-friendly
- also hosts gitolite (not in a container for now)
Initially we experimented with shipyard as a frontend/proxy, but currently just use raw docker commands and nginx for proxying.
Gotchas and Misc Helpers¶
The /dev/ptmx device must have read/write permissions for users or mosh
and screen will not work within containers. This device is configured by
the lxc-start program after other mounting takes place (so it can’t be
configured from the mount hook), and after the ‘start’ hook is executed, but
before the container’s init starts. It’s a bit unclear how to best set the
permissions in a safe and generic way (fstab entries somewhere? LXC’s autodev
option?), but as one work around /etc/rc.local can be overwritten with a
script that does the appropriate chmod.
To list the running/not-running status and PIDs of all containers:
for l in `lxc-ls`; do echo "name: $l"; lxc-info -n $l; done