Skip to content

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jul 12, 2017

Since the commit d88fe44 ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.

Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by criu checkpoint, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed as the data is not there (which usually results in a fatal crash). For example, see
https://github.com/xemul/criu/issues/300 and https://github.com/xemul/criu/issues/320.

This commit solves the issue by introducing new IPC modes for containers
(in addition to host and container:ID). The new modes are:

  • shareable: enables sharing this container's IPC with others
    (this used to be a default behavior)

  • private: disables sharing this container's IPC.

In private mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue for checkpoint/restore.

Surely, to ultimately solve it, we'd need to make private the default
mode, but it breaks the backward compatibility. So, let's make the
default container IPC mode per-daemon configurable (with built-in
default set to shareable for now). The default can be changed
either via a daemon CLI option (--default-shm-mode) or configuration
file parameter of the same name.

Surely, one can only set either shareable or private IPC modes as a
daemon default. Other modes (like 'host' or 'container') do not make
sense in this context and are therefore not allowed.

Some other changes this patch introduces are:

  1. A mount for /dev/shm is added to default OCI Linux spec.

  2. IpcMode.Valid() is simplified to remove duplicated code that parsed
    'container:ID' form. Note the old version used to check that ID does
    not contain a semicolon -- this is no longer the case (tests are
    modified accordingly). The motivation is we should either do a
    proper check for container ID validity, or don't check it at all
    (since it is checked in other places anyway). I chose the latter.

  3. IPC mode tests are modified to add checks for newly added values.

I did some manual testing, and it appears to be working.

TODO: write integration tests
Test cases come in the second patch, and they cover the scenarios I can think of.

@kolyshkin kolyshkin force-pushed the ipcmode branch 2 times, most recently from 42827a6 to ea5b4eb Compare July 20, 2017 17:45
@kolyshkin kolyshkin changed the title [WIP] [RFC] Implement private and shareable ipc modes Implement private and shareable ipc modes Jul 20, 2017
@kolyshkin
Copy link
Contributor Author

kolyshkin added a commit to kolyshkin/cli that referenced this pull request Jul 20, 2017
This builds (and depends) on moby/moby#34087

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kolyshkin - I left some comments (but interested in other reviews as well)

Wondering would this also reduce the chance of /shm "device in use" problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default be set here? ("shareable" - perhaps a constant defaultIPCMode)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention the possible options as well, given that they're limited;

`Default mode for containers ipc ("debug" | "info")`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there's already a couple of these, but we should try to avoid the API being imported into the daemon package. The daemon should not be aware of the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change the wording a bit here.. something like;

"IPC mode (%v) is not supported as default value. Select either 'shareable' or 'private')".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look "right". The IpcMode should be stored in the container's hostconfig, not requested from the daemon. When creating the container, we should;

  • check if docker run --ipc=.. is set
  • if not, get the daemon's default, and set it to that value

Doing so prevents the container's IPC-mode to be updated if the daemon defaults are changed and the container restarted. Possibly have a similar function as runconfig.SetDefaultNetModeIfBlank(), and apply the defaults in daemon.create()



Unrelated to this PR, but looks like we should really evaluate where defaults should be set. I noticed the SetDefaultNetModeIfBlank() method is called in three different locations(!);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above; at this point, c.HostConfig.IpcMode should already be set to a value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the IpcMode is set during create, this will have to be updated to (e.g.) containertypes.IpcMode(config.IpcMode).Valid(). At this point we should not be checking the daemon defaults, but the container's setting.

verifyDefaultIpcMode() could still be needed, but only when verifying the daemon configuration (i.e. when starting the daemon and/or (re)loading the daemon configuration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remarks apply here as I made in container_operations_unix.go. The container should already have this configuration set at this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be set earlier, so that it's baked into the container's config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.HostConfig.IpcMode.IsPrivate() (related to earlier comments)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it be more readable to use a struct for these? The second loop for parsing container names could be combined in that case;

func TestIpcModeTest(t *testing.T) {
	ipcModes := map[container.IpcMode]struct {
		private   bool
		host      bool
		container bool
		shareable bool
		valid     bool
		ctrName   string
	}{
		"":                      {valid: true},
		"private":               {private: true, valid: true},
		"something":             {},
		"something:weird":       {ctrName: "weird"},
		":weird":                {ctrName: "weird"},
		"host":                  {host: true, valid: true},
		"container":             {},
		"container:name":        {container: true, ctrName: "name", valid: true},
		"container:name1:name2": {container: true, ctrName: "name1:name2", valid: true},
		"container:":            {container: true, valid: true},
		"shareable":             {shareable: true, valid: true},
	}
	for ipcMode, state := range ipcModes {
		assert.Equal(t, state.private, ipcMode.IsPrivate(), "IpcMode.IsPrivate() parsing failed for %q", ipcMode)
		assert.Equal(t, state.host, ipcMode.IsHost(), "IpcMode.IsHost()  parsing failed for %q", ipcMode)
		assert.Equal(t, state.container, ipcMode.IsContainer(), "IpcMode.IsContainer()  parsing failed for %q", ipcMode)
		assert.Equal(t, state.shareable, ipcMode.IsShareable(), "IpcMode.IsShareable()  parsing failed for %q", ipcMode)
		assert.Equal(t, state.valid, ipcMode.Valid(), "IpcMode.Valid()  parsing failed for %q", ipcMode)
		assert.Equal(t, state.ctrName, ipcMode.Container(), "IpcMode.Container() parsing failed for %q", ipcMode)
	}
}

@rhatdan
Copy link
Contributor

rhatdan commented Jul 22, 2017

The way we have gone after this is to add a RUNC hook oci-umount which cleans up leaked mount points into a container.

https://github.com/projectatomic/oci-umount

@kolyshkin kolyshkin force-pushed the ipcmode branch 4 times, most recently from 1db6766 to aa438d4 Compare July 25, 2017 05:03
@kolyshkin kolyshkin force-pushed the ipcmode branch 5 times, most recently from ab5cac0 to e3be977 Compare July 25, 2017 23:46
@kolyshkin
Copy link
Contributor Author

@thaJeztah thank you for review! I think I addressed all your comments in the updated patches, can you please take another look?

@thaJeztah
Copy link
Member

Thank you @kolyshkin ! I gave it a quick glance, and definitely looks better; I'll do a full review soon.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Aug 14, 2017

It could do with a test (make sure it has no mount at least).

Done. Thank you for reviewing!

@kolyshkin
Copy link
Contributor Author

Windows test failure seems to be caused by some tests (or execution environment) race, not related to the patchset. Will restart

@thaJeztah
Copy link
Member

All green now; merging 👍

@thaJeztah thaJeztah merged commit bb6fc72 into moby:master Aug 14, 2017
andrewhsu pushed a commit to docker-archive/docker-ce that referenced this pull request Aug 14, 2017
It was noted[1] that container's HostConfig.ShmSize, if not set, should be
initialized to daemon default value during container creation.

In fact, it is already done in daemon.adaptContainerSettings, so we can use
value from container.HostConfig directly.

[1] moby/moby#34087 (comment)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 0fb1fb1
Component: engine
andrewhsu pushed a commit to docker-archive/docker-ce that referenced this pull request Aug 14, 2017
Since the commit d88fe44 ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.

Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).

This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:

 - 'shareable':	enables sharing this container's IPC with others
		(this used to be the implicit default);

 - 'private':	disables sharing this container's IPC.

In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.

While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:

> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...

...so here's yet yet another mode:

 - 'none':	no /dev/shm mount inside the container (though it still
		has its own private IPC namespace).

Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.

Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).

Some other changes this patch introduces are:

1. A mount for /dev/shm is added to default OCI Linux spec.

2. IpcMode.Valid() is simplified to remove duplicated code that parsed
   'container:ID' form. Note the old version used to check that ID does
   not contain a semicolon -- this is no longer the case (tests are
   modified accordingly). The motivation is we should either do a
   proper check for container ID validity, or don't check it at all
   (since it is checked in other places anyway). I chose the latter.

3. IpcMode.Container() is modified to not return container ID if the
   mode value does not start with "container:", unifying the check to
   be the same as in IpcMode.IsContainer().

3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
   to add checks for newly added values.

[v2: addressed review at moby/moby#34087 (review)]
[v3: addressed review at moby/moby#34087 (review)]
[v4: addressed the case of upgrading from older daemon, in this case
     container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 7120976
Component: engine
@kolyshkin kolyshkin deleted the ipcmode branch August 14, 2017 17:19
andrewhsu pushed a commit to docker-archive/docker-ce that referenced this pull request Aug 30, 2017
This builds (and depends) on moby/moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 9285db67525c4dce4308903dcf80627f693388dc
Component: cli
riyazdf pushed a commit to riyazdf/cli that referenced this pull request Sep 18, 2017
This builds (and depends) on moby/moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
riyazdf pushed a commit to riyazdf/cli that referenced this pull request Sep 18, 2017
This builds (and depends) on moby/moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
riyazdf pushed a commit to riyazdf/cli that referenced this pull request Sep 18, 2017
This builds (and depends) on moby/moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/kubernetes that referenced this pull request Nov 28, 2018
Currently, Docker make IPC of every container shareable by default,
which means other containers can join it's IPC namespace. This is
implemented by creating a tmpfs mount on the host, and then
bind-mounting it to a container's /dev/shm. Other containers
that want to share the same IPC (and the same /dev/shm) can also
bind-mount the very same host's mount.

Now, since moby/moby@7120976d7
(moby/moby#34087) there is a possiblity
to have per-daemon default of having "private" IPC mode,
meaning all the containers created will have non-shareable
/dev/shm.

For shared IPC to work in the above scenario, we need to
explicitly make the "pause" container's IPC mode as "shareable",
which is what this commit does.

To test: add "default-ipc-mode: private" to /etc/docker/daemon.json,
try using kube as usual, there should be no errors.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 16, 2020
It was noted[1] that container's HostConfig.ShmSize, if not set, should be
initialized to daemon default value during container creation.

In fact, it is already done in daemon.adaptContainerSettings, so we can use
value from container.HostConfig directly.

[1] moby/moby#34087 (comment)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 0fb1fb1
Component: engine
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 16, 2020
Since the commit d88fe44 ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.

Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).

This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:

 - 'shareable':	enables sharing this container's IPC with others
		(this used to be the implicit default);

 - 'private':	disables sharing this container's IPC.

In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.

While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:

> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...

...so here's yet yet another mode:

 - 'none':	no /dev/shm mount inside the container (though it still
		has its own private IPC namespace).

Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.

Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).

Some other changes this patch introduces are:

1. A mount for /dev/shm is added to default OCI Linux spec.

2. IpcMode.Valid() is simplified to remove duplicated code that parsed
   'container:ID' form. Note the old version used to check that ID does
   not contain a semicolon -- this is no longer the case (tests are
   modified accordingly). The motivation is we should either do a
   proper check for container ID validity, or don't check it at all
   (since it is checked in other places anyway). I chose the latter.

3. IpcMode.Container() is modified to not return container ID if the
   mode value does not start with "container:", unifying the check to
   be the same as in IpcMode.IsContainer().

3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
   to add checks for newly added values.

[v2: addressed review at moby/moby#34087 (review)]
[v3: addressed review at moby/moby#34087 (review)]
[v4: addressed the case of upgrading from older daemon, in this case
     container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 7120976
Component: engine
corhere pushed a commit to corhere/moby that referenced this pull request Aug 6, 2024
This builds (and depends) on moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
corhere pushed a commit to corhere/moby that referenced this pull request Aug 26, 2024
This builds (and depends) on moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit c23d4b0)
Signed-off-by: Cory Snider <csnider@mirantis.com>
corhere pushed a commit to corhere/moby that referenced this pull request Aug 26, 2024
This builds (and depends) on moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit c23d4b0)
Signed-off-by: Cory Snider <csnider@mirantis.com>
corhere pushed a commit to corhere/moby that referenced this pull request Aug 29, 2024
This builds (and depends) on moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit c23d4b0)
Signed-off-by: Cory Snider <csnider@mirantis.com>
maggie44 pushed a commit to maggie44/moby that referenced this pull request Dec 8, 2024
This builds (and depends) on moby#34087

Version 2:
 - remove --ipc argument validation (it is now done by daemon)
 - add/document 'none' value
 - docs/reference/run.md: add a table with better modes description
 - dockerd(8) typesetting fixes

Version 3:
 - remove ipc mode tests from cli/command/container/opts_test.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants