Containerization

Bare Metal LXC 0.9 Notes

Note

For the most part the following directions should be automated by higher-level tools, but the following is still informative.

Needed to add to /etc/fstab:

cgroup  /sys/fs/cgroup  cgroup  defaults  0   0

lxc-create:

lxc-create -B lvm --vgname raid1 --lvname bunker_rootfs --fssize 32GB --fstype ext4 -n bunker -t wheezy-nsa

A copy of the wheezy-nsa template should be included in this documentation directory in lxc_templates.

Post lxc-create, edit /var/lib/lxc/<name>/config:

<delete line: lxc.network.type = empty>
<append the below>
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.ipv4 = <full ipv4>/<range> <broadcast>
lxc.network.ipv4.gateway = <gateway_ipv4>
lxc.network.ipv6 = <ipv6_addr... no prefixlen?>
lxc.network.ipv6.gateway = <ipv6_gateway>
## Limits
#lxc.cgroup.cpu.shares                  = 1024
#lxc.cgroup.cpuset.cpus                 = 0
lxc.cgroup.memory.limit_in_bytes       = 16G
lxc.cgroup.memory.memsw.limit_in_bytes = 16G
## Hooks
lxc.hook.mount = /etc/lxc/locker_mount.sh
lxc.hook.pre-start = /etc/lxc/locker_pre-start.sh
lxc.hook.post-stop = /etc/lxc/locker_post-stop.sh

Warning

The lxc.hook.mount is crucial. Without it, all lockers for all containers will be unmounted on lxc-start.

Next, edit etc/network/interfaces in the rootfs (delete the whole eth0 block).

The mount hook will try to install a fix for /dev/ptmx permission in /etc/rc.local. If this doesn’t work, will need to manually chmod 666 /dev/ptmx inside the container.

In the future, encrypted container rootfs might be possible. See also EncFS (sp?) or:

https://help.ubuntu.com/community/EncryptedFSOnLVMOnRAID

Containers with Shared Networking

Warning

Great care must be taken to ensure that containers configured this way to not clobber each other or the bare-metal server!

In particular, make sure containers like this don’t listen on 22 (SSH; move to a different port), 80/443 (move web services to other ports), or any email ports (use nullmailer).

There are a small number of special containers which do not get their own IP addresses; instead they share with the bare-metal host.

To configure a container this way, edit it’s lxc config file and comment out all lines starting with lxc.network; the default configuration is what we want.

“Wormhole” Locker Mounting

This section is a brief overview; see nitty gritties below

LVM is used for the rootfs partitions of containers, and also used to create user- and project-specific “locker” data volumes. We would like to be able to create and mount any locker into any container (modulo permissions) at run time without restarting containers. Because the container LVM rootfs partitions are not mounted to any mountpoint in the host (bare-metal) filesystem, we must achive this by creating shared “wormhole” bind mountpoints in the host filesystem which get copied into individual containers. Lockers are mounted normally in the host filesystem, then bind mounted into wormholes to make them appear in the containers.

On the host, wormholes live in /var/locker-mappings and are created by lxc-start hook scripts; regular locker mountpoints are in /lockers and should be added to /etc/fstab. In the containers, the wormhole shows up at /lockers and individual lockers show up beneath that. No files or directories should ever be created in the base wormhole directory in containers, or cleanup and re-mounting will fail.

After the lxc-start, the host needs to setup the actual locker mounts. For the case of lockers with the same name as the container (eg, ‘bunker’ for the ‘bunker’ vm, or substituting _vol for _vm in the container name), the mount hook will try to mount the locker automatically. The sandbox locker also gets mounted for every container. Additional mounts must be configured in the host and will not persist across LXC restarts.

Refs:

http://s3hh.wordpress.com/2011/09/22/sharing-mounts-with-a-container

http://www.ibm.com/developerworks/linux/library/l-mount-namespaces/index.html

Wormhole Nitty-Gritties

A relatively large amount of pain was involved getting the wormhole mounting to work. A bit of background on linux mount namespaces is required as context.

Mount namespaces are part of a group of namespace features added to the Linux kernel a few years back that make LXC and other containerization schemes possible. The canonical documentation is in the kernel tree. Similar namespaces include hostname (“UTS”), networking, etc. In a nutshell, each linux process lives in some mount namespace, and the mount events (creation, deletion) that happen in that namespace may or may not propagate to other namespaces. By default processes are in the same namespace as their parent process; new namespaces are created using the clone() or unshare() functions/syscalls. When a process creates a new namespace, it gets copies of all the mount objects in the parent namespace. By default every mount object is “private”, which means that changes to the mount in one namespace will not propagate to other namespaces. So, eg, normally a child process in a new namespace can mount or umount however it likes and this will not affect what processes in the parent namespace see. It is important to note that mount “objects” (aka, “mounts”, or “vfsmounts”) are internal to the kernel and aren’t simply tied to a particular file path; a mountpoint can be moved or renamed but it’s still the same “mount object”.

One feature of mount namespaces is the ability to put mount objects in “shared” modes, in which case events (changes) taking place on that object are propagated across namespace barriers. The modes of interest are private (no propagation), shared (events are passed in both directions), and slave (in the parent side the mount is shared, in the child slave; events from the parent propagate into the child, but changes in the child do not propagate into the parent). Some details are that a mount chan be shared in a parent and many children (in which case actions in one child will propagate to the parent and all other childred), that the shared mode must be set up in the parent before the child is created, and that the slave mode must be configured only in the child (not also in the parent), and thus must be enabled after the child is created.

LXC creates new mount namespaces for each container. The order of operations is that lxc-start first does some prep work in the host (“parent”) namespace (eg, running the pre-start hook script), then calls clone/unshare to enter a new container namespace (“child”). Right after entering the child namespace, the process is still in a very host-like environment, and any mounts that were shared in the host are still mounted and shared in the child. lxc-start then sets up container-specific mounts (if using LVM, the rootfs is mounted at the temporary folder /usr/lib/x86_64-linux-gnu/lxc, otherwise this is done at /var/lib/lxc/<container>/rootfs) and the ‘mount’ hook is run. At this time (not sure about the exact order) a bunch of other namespace stuff happens, eg the hostname is changed and networking for the container is configured. Then the pivot_root command is called, which basically moves the base root (/) for the container to the rootfs mount point, and any unused mountpoints are unmounted in the child. Lastly the ‘start’ hook is run (the host filesystem is now gone/inaccessible, so the ‘start’ hook must be stored in the child rootfs), any last mounts (eg, sysfs, devpts, proc) are mounted, and finally the container’s ‘init’ script is run.

The pivot_root moment is what causes so much pain: any mounts in the host namespace that are shared will get (recursively) unmounted at this moment. This doesn’t affect, eg, the host’s rootfs, or mounts to the /lockers/ directory, because those mounts are private. It does affect all the other wormhole mounts in /var/locker-mappings/ though, because those are (and must remain) shared. Boo! To work around this, the chlid process must carefully mark all the shared mounts as slave mode before the pivot_root is run. This prevents the umount events for propagating to the host or other namespaces.

There were some initial problems getting this working because of bugs in the mount program (part of the linux-utils package in debian) which prevent the --make-rslave mode from actually applying recursively. The work around was to have a shell script manually find all locker mounts and individually mark each one as slave. This is implemented in the ‘mount’ LXC hook, which is operating in the child mount namespace but before pivot_root happens. An easy gotcha here is that every LXC container must always include this mount script; running lxc-start just once with a fresh container which doesn’t have this hook configured (in it’s config file) will result in all locker mounts being unmounted. Bummer!

There has been much discussion of this style of LXC use, and hopefully improvements/extensions will be made to lxc-start to make all this easier.

LXC Hooks

/etc/lxc/locker_pre-start.sh

Before every lxc-start, we need to ensure that mapper directory exists and has special mount configured. The way we achieve this is by creating it fresh every time (cleaning up or bailing if it already existed) using an LXC hook.

#!/bin/bash

# fail fast on any errors; verbose output
set -x
set -e

NAME=$1

echo "Entering pre-start hook for $NAME"

# bail if we don't exit
[ -n $NAME ] || exit

# If mount point already exists, trouble...
if [ -e /var/locker-mappings/$1 ]; then
    # if not a directory, bail
    [ -d /var/locker-mappings/$1 ] || exit -1
    # try to clean up, eg from an unclean shutdown
    rmdir /var/locker-mappings/$1/* || true
    umount /var/locker-mappings/$1 || true
    # finally try to remove dir, else bail out
    rmdir /var/locker-mappings/$1 || exit -1
fi

mkdir -p /var/locker-mappings/$1
# bind to self
# there should be no sub-mounts, so the 'r' versions below do nothing extra
mount --rbind /var/locker-mappings/$1 /var/locker-mappings/$1
mount --make-runbindable /var/locker-mappings/$1
mount --make-rshared /var/locker-mappings/$1
touch /var/locker-mappings/$1/.stamp_created

# mount sandbox locker by default, if it exists
if [ -d /lockers/sandbox ]; then
	mkdir /var/locker-mappings/$1/sandbox
	mount --bind /lockers/sandbox /var/locker-mappings/$1/sandbox
fi

# try to mount _vol locker if it exists; rename _vm -> _vol if needed
VOLNAME=`echo $1 | sed "s/_vm$/_vol/"`
if [ -n "$VOLNAME" -a -d "/lockers/$VOLNAME" ]; then
	mkdir /var/locker-mappings/$1/$VOLNAME
	mount --bind /lockers/$VOLNAME /var/locker-mappings/$1/$VOLNAME
fi

echo "Done creating mount points." > /dev/stderr

/etc/lxc/locker_mount.sh

#!/bin/bash

# fail fast on errors; verbose output
set -e
set -x

NAME=$1
PREFIX=/usr/lib/x86_64-linux-gnu/lxc

# echoing to stderr is sometimes necessary to punch through to logging?
echo "Entering mount hook for $NAME" > /dev/stderr

# set up ptmx fix 'start' hook
cp /etc/lxc/fixptmx.sh $PREFIX/etc/rc.local || true

# nullglob means "don't return glob if nothing matches"
shopt -s nullglob
# slave all locker mounts so we don't clobber then with pivot_root
# IMPORTANT: --make-rslave IS BROKEN (?!?!), which is why this is done
for m in `echo /var/locker-mappings/*/*`; do
    mount --make-slave $m || true
done
for m in `echo /var/locker-mappings/*`; do
    mount --make-slave $m || true
done
touch /var/locker-mappings/$1/.stamp_slaved

# mount our wormhole
mkdir -p $PREFIX/lockers
mount --rbind /var/locker-mappings/$1 $PREFIX/lockers
# ... and make it a slave mount. this is defensive, might not be necessary
mount --make-rslave $PREFIX/lockers
touch /var/locker-mappings/$1/.stamp_mounted

echo "Done with bind mount."

/etc/lxc/fixptmx.sh

After the container starts, we need to fix permissions on /dev/ptmx to allow pseudo-terminals to be created by regular users in the container.

#!/bin/sh

echo "Setting global r/w permissions on /dev/ptmx"
chmod 666 /dev/ptmx
ls -l /dev/ptmx > /dev/stderr

/etc/lxc/locker_post-stop.sh

After lxc-stop, we need to unmount everything and rmdir the empty directories (DO NOT rm -rf in case real data was somehow still mounted).

#!/bin/bash

NAME=$1

echo "Entering post-stop hook for $NAME"

# The could be used to try and ensure everything is unmounted already
umount /var/locker-mappings/$1/* &> /dev/null || true
rmdir /var/locker-mappings/$1/* &> /dev/null || true
umount /var/locker-mappings/$1 || true
rm /var/locker-mappings/$1/.stamp* || true
rm /var/locker-mappings/$1/*.sock &> /dev/null || true
rmdir /var/locker-mappings/$1

echo "Done."

docker (aka ops)

The “ops” container is special in that it hosts many communal services using the “docker”

  • runs ubuntu 14.04 (not debian) to be more docker-friendly
  • also hosts gitolite (not in a container for now)

Initially we experimented with shipyard as a frontend/proxy, but currently just use raw docker commands and nginx for proxying.

ulimits

In /etc/security/limits.conf on ops, to prevent problems with docker:

@docker         soft    nofile 4096
@docker         hard    nofile 10240

Gotchas and Misc Helpers

The /dev/ptmx device must have read/write permissions for users or mosh and screen will not work within containers. This device is configured by the lxc-start program after other mounting takes place (so it can’t be configured from the mount hook), and after the ‘start’ hook is executed, but before the container’s init starts. It’s a bit unclear how to best set the permissions in a safe and generic way (fstab entries somewhere? LXC’s autodev option?), but as one work around /etc/rc.local can be overwritten with a script that does the appropriate chmod.

To list the running/not-running status and PIDs of all containers:

for l in `lxc-ls`; do echo "name: $l"; lxc-info -n $l; done