Skip to content

libct: fix resetting CPU affinity#5025

Merged
kolyshkin merged 1 commit intoopencontainers:mainfrom
askervin:5eA-workaround-max-1kcpus
Mar 4, 2026
Merged

libct: fix resetting CPU affinity#5025
kolyshkin merged 1 commit intoopencontainers:mainfrom
askervin:5eA-workaround-max-1kcpus

Conversation

@askervin
Copy link
Copy Markdown
Contributor

unix.CPUSet is limited to 1024 CPUs. Calling
unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024 from allowed CPUs of pid, even if cpuset is all ones. The consequence of runc trying to reset CPU affinity by default is that it prevents all containers from using those CPUs.

This change is a quick fix that brings runc behavior back to what it was in v1.3.0 in 1024+ CPU systems. Real fix requires calling sched_setaffinity with cpusetsize fitting all CPUs in the system, which cannot be done with current unix.SchedSetaffinity.

Fixes: #5023

@ningmingxiao
Copy link
Copy Markdown
Contributor

ningmingxiao commented Nov 18, 2025

how about @askervin

	if runtime.NumCPU() > 1024 {
		return
	}

@askervin
Copy link
Copy Markdown
Contributor Author

	if runtime.NumCPU() > 1024 {
		return
	}

NumCPU() returns the number of CPUs usable by the current process. The purpose of the tryResetCPUAffinity() is to make that number bigger, just in case an external entity has made it smaller than enabled CPUs in the whole system. Now assume that the external entity has set affinity to cpuset 1023-1122, giving NumCPUs()==100, the logic would continue to SchedSetaffinity(pid, cpuset(0-1023)), and allow using only 1 CPU, namely 1023.

Besides, I would avoid adding any magic values (like 1024) to this logic. If, for instance, unix.CPUSet would be changed to [64]uint64 instead of current [16]uint64, the current workaround calling SchedGetaffinity() would start working 4096-CPU systems, or if unix.SchedGetaffinity/SchedSetaffinity would be updated to work with dynamic (large enough) cpuset sizes, then this fix would work as is. With magic numbers we would have introduced only a new place that needs to be fixed at some point.

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Nov 19, 2025

I'm not sure this completely fixes #5023 -- yes, it stops the reset issue but still leaves you with the same problem that #4858 was trying to solve. If you are explicitly requesting CPUs >= 1024 then you will still not be able to get them AFAICS because we still use unix.SchedSetaffinity which doesn't support CPUs >= 1024 (this appears to be what #5023 is talking about, but the reset case could also cause problems if you are forcing the affinity using cgroups with cpuset).

I think we should just be calling sched_setaffinity directly. We can either just call it directly for this one case (i.e., pass an array of 0xFF that is "long enough" -- the current upstream kernel maximum is 8192 CPUs which would be an 128-long uint64 array) or we can copy the code from golang.org/x/sys and make it use slices instead of a fixed-size array so that we can support larger CPU values for the explicit CPU affinity configuration.

For what it's worth, I think even the current behaviour of resetting to use the first 1024 CPUs by default is better than regressing #4858.

@askervin askervin force-pushed the 5eA-workaround-max-1kcpus branch 2 times, most recently from c92d043 to be8dbc6 Compare November 19, 2025 09:04
@askervin
Copy link
Copy Markdown
Contributor Author

askervin commented Nov 19, 2025

@cyphar, I think you're right: why making a quick fix when the proper fix is not really that much harder.

Updated. Playing as safe as the latest go runtime in the size of the CPU mask.

@askervin askervin force-pushed the 5eA-workaround-max-1kcpus branch from be8dbc6 to 8618a16 Compare November 19, 2025 11:19
@askervin askervin changed the title libct: do not reset CPU affinity if it prevents using cpu >= 1024 libct: fix resetting CPU affinity Nov 19, 2025
@cyphar cyphar added this to the 1.4.1 milestone Nov 20, 2025
Copy link
Copy Markdown
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single nit, otherwise LGTM

@askervin askervin force-pushed the 5eA-workaround-max-1kcpus branch from 8618a16 to 016fac8 Compare December 5, 2025 15:56
@kolyshkin
Copy link
Copy Markdown
Contributor

Hmm, should we try to fix unix.CPUSet instead?

@askervin
Copy link
Copy Markdown
Contributor Author

askervin commented Dec 8, 2025

Hmm, should we try to fix unix.CPUSet instead?

I'm afraid trying to make unix.CPUSet dynamic will break lots of code. At least I can't come up with any nice change that would make it just work without modifications in applications. Yet the code that uses it is already broken in 1024+ CPU systems, at least trivial changes that would make it dynamic, can easily break it in smaller systems.

One possibility could be adding unix.CPUSetS, following the spirit in the C library and macros
CPU_SET_S,
CPU_CLEAR_S,
CPU_ZERO_S,
...
That are dynamic size variants of simple CPU_SET, CPU_CLEAR, CPU_ZERO, ... macros. Maybe we could implement CPUSetS with an interface as similar to CPUSet as possible, and use it.

We could perhaps practice it in runc internal or libcontainer, and then propose to unix.CPUSetS when we are happy and runc works fine?

How this would sound to you?

A prototype of CPUSetS is in the topmost commit (WIP: dynamic cpuset) in this branch:
https://github.com/askervin/runc/tree/5eU-cpuset

@kolyshkin
Copy link
Copy Markdown
Contributor

FWIW I opened https://go.dev/cl/727540 and https://go.dev/cl/727541 and later replaced these two with https://go-review.googlesource.com/c/sys/+/735380 which is currently in review. If/when approved it can be reused to solve this.

@askervin
Copy link
Copy Markdown
Contributor Author

FWIW I opened https://go.dev/cl/727540 and https://go.dev/cl/727541 and later replaced these two with https://go-review.googlesource.com/c/sys/+/735380 which is currently in review. If/when approved it can be reused to solve this.

This is great, @kolyshkin, thanks!

Copy link
Copy Markdown
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the fact that https://go-review.googlesource.com/c/sys/+/735380 is stuck in review and we want 1.4.1 out, I'm going to give this one a go (plus, most probably, we'll still need retryOnEINTR wrapper anyway).

Just a nit to SchedSetaffinity

@askervin askervin force-pushed the 5eA-workaround-max-1kcpus branch 2 times, most recently from 9a0a6cc to 3307324 Compare February 17, 2026 07:33
@AkihiroSuda
Copy link
Copy Markdown
Member

ping @kolyshkin

@kolyshkin kolyshkin force-pushed the 5eA-workaround-max-1kcpus branch from 3307324 to e66c563 Compare March 2, 2026 17:40
Copy link
Copy Markdown
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can maybe use /sys/devices/system/cpu/possible or /sys/devices/system/cpu/kernel_max, and fall back to 1024 CPUs if we can't read the above file(s) -- assuming that systems with more than 1024 CPUs do have /sys/devices/system/cpu available.

@kolyshkin
Copy link
Copy Markdown
Contributor

I wonder if we can maybe use /sys/devices/system/cpu/possible or /sys/devices/system/cpu/kernel_max, and fall back to 1024 CPUs if we can't read the above file(s) -- assuming that systems with more than 1024 CPUs do have /sys/devices/system/cpu available.

Or will it be too slow?

@askervin
Copy link
Copy Markdown
Contributor Author

askervin commented Mar 3, 2026

I wonder if we can maybe use /sys/devices/system/cpu/possible or /sys/devices/system/cpu/kernel_max, and fall back to 1024 CPUs if we can't read the above file(s) -- assuming that systems with more than 1024 CPUs do have /sys/devices/system/cpu available.

Or will it be too slow?

@kolyshkin, I considered .../cpu/possible as the first implementation option, too. But I ended up skipping extra open+read+close syscalls as I expected it to be too slow compared to using only the single unavoidable syscall sched_setaffinity. I did not measure the difference but I assumed that behaving exactly like the go runtime does in getCPUCount() would be a safe bet in performance perspective.

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Mar 3, 2026

I did a quick benchmark, if you are really concerned about performance then using [N]uint64 is 50% faster than bytes.Repeat approach used here. Reading from sysfs is 5x slower.

goos: linux
goarch: amd64
pkg: schedaffinity
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkResetAffinity
BenchmarkResetAffinity/naiveResetAffinity
BenchmarkResetAffinity/naiveResetAffinity-16              395606              2671 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              407454              2654 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              398432              2665 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              417326              2650 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              399789              2733 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              382095              2745 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              415849              2763 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              404832              2851 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              368491              2857 ns/op
BenchmarkResetAffinity/naiveResetAffinity-16              387709              2777 ns/op
BenchmarkResetAffinity/uint64ResetAffinity
BenchmarkResetAffinity/uint64ResetAffinity-16             913251              1107 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16            1073331              1121 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16            1058706              1133 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16            1032540              1166 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16            1030147              1171 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16             879085              1148 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16            1042898              1150 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16             954447              1083 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16            1102652              1085 ns/op
BenchmarkResetAffinity/uint64ResetAffinity-16             949170              1099 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        527901              1993 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        576104              2013 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        541650              2052 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        542589              2101 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        517498              2046 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        504320              2107 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        545593              2093 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        564112              2090 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        529090              2070 ns/op
BenchmarkResetAffinity/bytesRepeatResetAffinity-16        536346              2096 ns/op
BenchmarkResetAffinity/sysfsResetAffinity
BenchmarkResetAffinity/sysfsResetAffinity-16               92196             11997 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16               93710             12098 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16              104853             11198 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16              101617             11791 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16              102028             11816 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16              105645             11530 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16               95286             12206 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16               96847             11930 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16              101791             12040 ns/op
BenchmarkResetAffinity/sysfsResetAffinity-16               91552             12287 ns/op
PASS
ok      schedaffinity   45.031s
test code
package main

import (
	"bytes"
	"errors"
	"fmt"
	"os"
	"strconv"
	"testing"
	"unsafe"

	"golang.org/x/sys/unix"
)

// retryOnEINTR takes a function that returns an error and calls it
// until the error returned is not EINTR.
func retryOnEINTR(fn func() error) error {
	for {
		err := fn()
		if !errors.Is(err, unix.EINTR) {
			return err
		}
	}
}

func schedSetaffinity(pid int, buf []byte) error {
	err := retryOnEINTR(func() error {
		_, _, errno := unix.Syscall(
			unix.SYS_SCHED_SETAFFINITY,
			uintptr(pid),
			uintptr(len(buf)),
			uintptr((unsafe.Pointer)(&buf[0])))
		if errno != 0 {
			return errno
		}
		return nil
	})
	return os.NewSyscallError("sched_setaffinity", err)
}

func naiveResetAffinity(pid int) error {
	const maxCPUs = 64 * 1024
	var buf [maxCPUs / 8]byte
	for i := range buf {
		buf[i] = 0xFF
	}
	return schedSetaffinity(pid, buf[:])
}

func uint64ResetAffinity(pid int) error {
	const maxCPUs = 64 * 1024
	var buf [maxCPUs / 64]uint64
	for i := range buf {
		buf[i] = 0xFFFF_FFFF
	}

	err := retryOnEINTR(func() error {
		_, _, errno := unix.Syscall(
			unix.SYS_SCHED_SETAFFINITY,
			uintptr(pid),
			unsafe.Sizeof(buf),
			uintptr((unsafe.Pointer)(&buf[0])))
		if errno != 0 {
			return errno
		}
		return nil
	})
	return os.NewSyscallError("sched_setaffinity", err)
}

func bytesRepeatResetAffinity(pid int) error {
	const maxCPUs = 64 * 1024
	buf := bytes.Repeat([]byte{0xff}, maxCPUs/8)
	return schedSetaffinity(pid, buf)
}

func sysfsResetAffinity(pid int) error {
	maxStr, err := os.ReadFile("/sys/devices/system/cpu/kernel_max")
	if err != nil {
		return fmt.Errorf("failed to get max CPUS supported by kernel: %w", err)
	}
	maxCPUs, err := strconv.Atoi(string(bytes.TrimSpace(maxStr)))
	if err != nil {
		return fmt.Errorf("failed to parse max CPUS supported by kernel: %w", err)
	}
	buf := bytes.Repeat([]byte{0xff}, maxCPUs/8)
	return schedSetaffinity(pid, buf)
}

func BenchmarkResetAffinity(b *testing.B) {
	for _, test := range []struct {
		name    string
		benchFn func(pid int) error
	}{
		{"naiveResetAffinity", naiveResetAffinity},
		{"uint64ResetAffinity", uint64ResetAffinity},
		{"bytesRepeatResetAffinity", bytesRepeatResetAffinity},
		{"sysfsResetAffinity", sysfsResetAffinity},
	} {
		b.Run(test.name, func(b *testing.B) {
			pid := os.Getpid()
			benchFn := test.benchFn
			for b.Loop() {
				benchFn(pid)
			}
		})
	}
}

@askervin askervin force-pushed the 5eA-workaround-max-1kcpus branch from e66c563 to 69693f0 Compare March 4, 2026 09:47
@askervin
Copy link
Copy Markdown
Contributor Author

askervin commented Mar 4, 2026

Big thanks for the benchmark code, @cyphar! I ran it and few additional variants on a couple of different systems. Yet uint64 performed best on those as well, absolute time difference was very big. Only sysfs access looked a bit bad in my eyes. So I decided to stick with bytes.Repeat and buf []byte as SchedSetaffinity parameter for now. Of course, if wanted, sure I can switch this to buf []uint64, too.

Copy link
Copy Markdown
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I would like us to use our homegrown linux.SchedSetaffinity for the rest of our affinity setting operations (to allow us to fully support >1024 CPU systems) but that can be done in a later PR.

unix.CPUSet is limited to 1024 CPUs. Calling
unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024
from allowed CPUs of pid, even if cpuset is all ones. As a
consequence, when runc tries to reset CPU affinity to "allow all" by
default, it prevents all containers from CPUs 1024 onwards.

This change uses a huge CPU mask to play safe and get all possible
CPUs enabled with a single sched_setaffinity call.

Fixes: opencontainers#5023

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
@kolyshkin kolyshkin force-pushed the 5eA-workaround-max-1kcpus branch from 69693f0 to 700c944 Compare March 4, 2026 21:06
@kolyshkin kolyshkin enabled auto-merge March 4, 2026 21:07
@kolyshkin kolyshkin added the backport/1.4-todo A PR in main branch which needs to backported to release-1.4 label Mar 4, 2026
@kolyshkin kolyshkin merged commit bc7f1c2 into opencontainers:main Mar 4, 2026
40 of 41 checks passed
@kolyshkin kolyshkin added backport/1.4-done A PR in main branch which has been backported to release-1.4 and removed backport/1.4-todo A PR in main branch which needs to backported to release-1.4 labels Mar 4, 2026
@kolyshkin kolyshkin removed this from the 1.4.1 milestone Mar 4, 2026
@kolyshkin
Copy link
Copy Markdown
Contributor

1.4 backport: #5149

@DrLucifer19
Copy link
Copy Markdown

Nice work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.4-done A PR in main branch which has been backported to release-1.4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resetting CPU affinity does the opposite on 1024+ CPU systems

6 participants