Skip to content

Conversation

@justincormack
Copy link
Contributor

@justincormack justincormack commented May 26, 2020

fixes #8460
fixes #38664

Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.

Signed-off-by: Justin Cormack justin.cormack@docker.com

leverets

(picture by https://twitter.com/CountrysideBen)

@justincormack
Copy link
Contributor Author

Seems to be erroring in user namespace, need to exclude this case apparently.

@thaJeztah
Copy link
Member

added "fixes #8460" to the description

@tianon this one is for you! 🤗

Comment on lines 775 to 777
s.Linux.Sysctl["net.ipv4.ping_group_range"] = "0 2147483647"
// allow opening any port less than 1024 without CAP_NET_BIND_SERVICE
s.Linux.Sysctl["net.ipv4.ip_unprivileged_port_start"] = "0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are custom values already set at this point? If so, should we skip these if a custom value was set? (not sure if someone would have any reason to)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosifkit looking over my shoulder while I'm cheering for this glorious PR also asked if one or both of these should be specifically excluded when using --network=host? (it's not clear from the code if this is already the case 🙈)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! 👍 (not sure what it does when manually specifying --sysctl 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it "works" just fine? 😅

$ docker run --init -it --rm --network host --sysctl net.ipv4.ip_unprivileged_port_start=0 --user 1000:1000 tianon/network-toolbox nc -vvv -l -p 123
Listening on [0.0.0.0] (family 2, port 123)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and persistently adjusts the value for my host: 🥳

$ cat /proc/sys/net/ipv4/ip_unprivileged_port_start
0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it change your host's configuration? 😬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL; race condition in comments.

Looks like that's a .... "feature"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, better exclude net=host too...

Copy link
Member

@tianon tianon Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up on this thread, @kolyshkin separately fixed the "--sysctl=net.* + --net=host changes the host's settings" part of this discussion in runc itself via opencontainers/runc#2706 (issue opencontainers/runc#2693). 😄 🎉

$ docker run --init -it --rm --network host --sysctl net.ipv4.ip_unprivileged_port_start=0 --user 1000:1000 tianon/network-toolbox nc -vvv -l -p 123
docker: Error response from daemon: OCI runtime create failed: sysctl "net.ipv4.ip_unprivileged_port_start" not allowed in host network namespace: unknown.

@thaJeztah
Copy link
Member

Looks like userns is still failing;


[2020-05-27T19:14:03.024Z] === FAIL: amd64.integration-cli TestDockerDaemonSuite/TestDaemonUserNamespaceRootSetting (1.70s)
[2020-05-27T19:14:03.024Z]     --- FAIL: TestDockerDaemonSuite/TestDaemonUserNamespaceRootSetting (1.70s)
[2020-05-27T19:14:03.024Z]         docker_cli_userns_test.go:50: assertion failed: error is not nil: exit status 125: Output: df409cf62ef426ddc2fc30b44eb45c79203b4a2c20ab46b9e587389f9cb5095d
[2020-05-27T19:14:03.024Z]             /usr/local/cli/docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ping_group_range: write /proc/sys/net/ipv4/ping_group_range: invalid argument\"": unknown.

Could that be because of the order in which options are applied (both WithSysctls and WithNamespaces are applied after this)?

moby/daemon/oci_linux.go

Lines 983 to 999 in 07e6b84

opts = append(opts,
WithCommonOptions(daemon, c),
WithCgroups(daemon, c),
WithResources(c),
WithSysctls(c),
WithDevices(daemon, c),
WithUser(c),
WithRlimits(daemon, c),
WithNamespaces(daemon, c),
WithCapabilities(c),
WithSeccomp(daemon, c),
WithMounts(daemon, c),
WithLibnetwork(daemon, c),
WithApparmor(c),
WithSelinux(c),
WithOOMScore(&c.HostConfig.OomScoreAdj),
)

Or is it something in how those tests are setup?

@thaJeztah
Copy link
Member

Also wondering if we should set the defaults inside WithSysctls 🤔 (might possibly be clearer, although there's something to be said for having the defaults in a central place)

@justincormack
Copy link
Contributor Author

I think WithSysctls is designed to be a generic method so we should not set defaults there. There was a typo in the userns maybe this will work. I also need to check rootless mode, and a couple of other tweaks (need to support old kernels without the unpriv support I think).

@thaJeztah
Copy link
Member

I think WithSysctls is designed to be a generic method so we should not set defaults there.

Yes, agreed.

Should we make WithSysctls return an error if the container is using host networking, and Linux.Sysctl is non-empty? (I'd have to check the flow; unfortunately our "validation" is handled in various places, so not sure if that would work / be the best place; also may be out of scope for this PR)

@justincormack
Copy link
Contributor Author

For host networking, it would be all sysctls starting with "net.*" that should be blocked. Thats independent of this PR but I think it does make sense. Maybe it should happen in runc though, which is where non namespaced sysctls are blocked now?

@justincormack justincormack force-pushed the default-sysctls branch 5 times, most recently from d3e1917 to bd4bd48 Compare June 2, 2020 11:41
@thaJeztah
Copy link
Member

thaJeztah commented Jun 4, 2020

Looks like there's a panic on ppc64le;


[2020-06-04T09:04:38.364Z] === FAIL: ppc64le.integration.build TestBuildMultiStageOnBuild (22.32s)
[2020-06-04T09:04:38.364Z]     build_test.go:377: assertion failed: string "{\"stream\":\"Step 1/7 : FROM busybox AS stage1\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e f4279da41337\\n\"}\r\n{\"stream\":\"Step 2/7 : ONBUILD RUN echo 'foo' \\u003esomefile\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 2dd830333d56\\n\"}\r\n{\"stream\":\"Removing intermediate container 2dd830333d56\\n\"}\r\n{\"stream\":\" ---\\u003e ed9596ea0ee0\\n\"}\r\n{\"stream\":\"Step 3/7 : ONBUILD ENV bar=baz\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in a65eb56bb730\\n\"}\r\n{\"stream\":\"Removing intermediate container a65eb56bb730\\n\"}\r\n{\"stream\":\" ---\\u003e 72c5aefcdca5\\n\"}\r\n{\"aux\":{\"ID\":\"sha256:72c5aefcdca5c616afb139e3bd6a90d5e68c22124645b9ea51fe358fe5038154\"}}\r\n{\"stream\":\"Step 4/7 : FROM stage1\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\"# Executing 2 build trigger\"}\r\n{\"stream\":\"s\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 670fc6466a09\\n\"}\r\n{\"stream\":\"Removing intermediate container 670fc6466a09\\n\"}\r\n{\"errorDetail\":{\"message\":\"OCI runtime create failed: container_linux.go:349: starting container process caused \\\"process_linux.go:449: container init caused \\\\\\\"write sysctl key net.ipv4.ip_unprivileged_port_start: open /proc/sys/net/ipv4/ip_unprivileged_port_start: no such file or directory\\\\\\\"\\\": unknown\"},\"error\":\"OCI runtime create failed: container_linux.go:349: starting container process caused \\\"process_linux.go:449: container init caused \\\\\\\"write sysctl key net.ipv4.ip_unprivileged_port_start: open /proc/sys/net/ipv4/ip_unprivileged_port_start: no such file or directory\\\\\\\"\\\": unknown\"}\r\n" does not contain "Successfully built"
[2020-06-04T09:04:38.364Z]     build_test.go:381: assertion failed: 3 (int) != 1 (int)
[2020-06-04T09:04:38.364Z] panic: runtime error: index out of range [2] with length 1 [recovered]
[2020-06-04T09:04:38.364Z] 	panic: runtime error: index out of range [2] with length 1
write sysctl key net.ipv4.ip_unprivileged_port_start: open /proc/sys/net/ipv4/ip_unprivileged_port_start: no such file or directory

@thaJeztah
Copy link
Member

That error seems to come from runc; @kolyshkin @AkihiroSuda

@justincormack
Copy link
Contributor Author

I am just adding checks for whether the sysctl node exists, which should work around this, although it was mainly designed around old kernels. However it shouldn't panic.

@justincormack justincormack force-pushed the default-sysctls branch 2 times, most recently from 65bcfa9 to ed291d9 Compare June 4, 2020 13:41
…o capabilities

Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@kolyshkin
Copy link
Contributor

The panic is not from runc, but from the test case itself, here:

imageIDs, err := getImageIDsFromBuild(out.Bytes())
assert.NilError(t, err)
assert.Check(t, is.Equal(3, len(imageIDs)))
image, _, err := apiclient.ImageInspectWithRaw(context.Background(), imageIDs[2])

It panics right there, line 383. The panic fix is

diff --git a/integration/build/build_test.go b/integration/build/build_test.go
index edbfb37935..cd682a5e58 100644
--- a/integration/build/build_test.go
+++ b/integration/build/build_test.go
@@ -378,7 +378,7 @@ RUN cat somefile`
 
        imageIDs, err := getImageIDsFromBuild(out.Bytes())
        assert.NilError(t, err)
-       assert.Check(t, is.Equal(3, len(imageIDs)))
+       assert.Assert(t, is.Equal(3, len(imageIDs)))
 
        image, _, err := apiclient.ImageInspectWithRaw(context.Background(), imageIDs[2])
        assert.NilError(t, err)

@thaJeztah
Copy link
Member

oh, thanks! I must've read the output wrong 😅

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

(asked @kolyshkin to double-check the approach for sysctlExists() 🤗)

joe4dev added a commit to localstack/lambda-runtime-init that referenced this pull request Mar 8, 2023
As @dfangl pointed out:
Binding port 53 might require root permissions for binding port < 1024 depending on the Docker version
moby/moby#41030
dfangl pushed a commit to localstack/lambda-runtime-init that referenced this pull request Mar 8, 2023
As @dfangl pointed out:
Binding port 53 might require root permissions for binding port < 1024 depending on the Docker version
moby/moby#41030
BlackDex added a commit to BlackDex/alloy that referenced this pull request May 1, 2024
In this PR grafana/agent#6817 the setcap was
added for the alloy/agent binary for some reason unknown to me.

Since the docker engine allows binding to unpriviliged ports already for a
long time via moby/moby#41030, which was
implemented in the Docker engine v20.10.x and also other container
runtimes.

This should solve the following issues.
Fixes grafana#117
Fixes grafana#303
BlackDex added a commit to BlackDex/alloy that referenced this pull request May 1, 2024
In this PR grafana/agent#6817 the setcap was
added for the alloy/agent binary for some reason unknown to me.

Since the docker engine allows binding to unpriviliged ports already for a
long time via moby/moby#41030, which was
implemented in the Docker engine v20.10.x and also other container
runtimes.

This should solve the following issues.
Fixes grafana#117
Fixes grafana#303
captncraig pushed a commit to grafana/alloy that referenced this pull request May 1, 2024
In this PR grafana/agent#6817 the setcap was
added for the alloy/agent binary for some reason unknown to me.

Since the docker engine allows binding to unpriviliged ports already for a
long time via moby/moby#41030, which was
implemented in the Docker engine v20.10.x and also other container
runtimes.

This should solve the following issues.
Fixes #177
Fixes #303
polyrain pushed a commit to polyrain/alloy that referenced this pull request May 2, 2024
In this PR grafana/agent#6817 the setcap was
added for the alloy/agent binary for some reason unknown to me.

Since the docker engine allows binding to unpriviliged ports already for a
long time via moby/moby#41030, which was
implemented in the Docker engine v20.10.x and also other container
runtimes.

This should solve the following issues.
Fixes grafana#177
Fixes grafana#303
@lukanoluwatobi
Copy link

For docker compose file use this instead

...
sysctls:
      net.ipv4.ip_unprivileged_port_start: 0
    user: "${UID}:${GID}"
...

@thaJeztah
Copy link
Member

@lukanoluwatobi this option is set by default on current versions of the engine (20.10 and up), so I don't think there should be a need to manually pass that option. Was there a reason you had to set it manually?

Randy88-art

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docker privileged / capabilities not working with non-root user Can't bind to privileged ports as non-root

7 participants