libct: fix resetting CPU affinity#5025
Conversation
|
how about @askervin |
NumCPU() returns the number of CPUs usable by the current process. The purpose of the Besides, I would avoid adding any magic values (like 1024) to this logic. If, for instance, unix.CPUSet would be changed to [64]uint64 instead of current [16]uint64, the current workaround calling SchedGetaffinity() would start working 4096-CPU systems, or if unix.SchedGetaffinity/SchedSetaffinity would be updated to work with dynamic (large enough) cpuset sizes, then this fix would work as is. With magic numbers we would have introduced only a new place that needs to be fixed at some point. |
|
I'm not sure this completely fixes #5023 -- yes, it stops the reset issue but still leaves you with the same problem that #4858 was trying to solve. If you are explicitly requesting CPUs >= 1024 then you will still not be able to get them AFAICS because we still use I think we should just be calling For what it's worth, I think even the current behaviour of resetting to use the first 1024 CPUs by default is better than regressing #4858. |
c92d043 to
be8dbc6
Compare
|
@cyphar, I think you're right: why making a quick fix when the proper fix is not really that much harder. Updated. Playing as safe as the latest go runtime in the size of the CPU mask. |
be8dbc6 to
8618a16
Compare
kolyshkin
left a comment
There was a problem hiding this comment.
a single nit, otherwise LGTM
8618a16 to
016fac8
Compare
|
Hmm, should we try to fix unix.CPUSet instead? |
I'm afraid trying to make unix.CPUSet dynamic will break lots of code. At least I can't come up with any nice change that would make it just work without modifications in applications. Yet the code that uses it is already broken in 1024+ CPU systems, at least trivial changes that would make it dynamic, can easily break it in smaller systems. One possibility could be adding unix.CPUSetS, following the spirit in the C library and macros We could perhaps practice it in runc internal or libcontainer, and then propose to unix.CPUSetS when we are happy and runc works fine? How this would sound to you? A prototype of CPUSetS is in the topmost commit (WIP: dynamic cpuset) in this branch: |
|
FWIW I opened https://go.dev/cl/727540 and https://go.dev/cl/727541 and later replaced these two with https://go-review.googlesource.com/c/sys/+/735380 which is currently in review. If/when approved it can be reused to solve this. |
This is great, @kolyshkin, thanks! |
kolyshkin
left a comment
There was a problem hiding this comment.
Given the fact that https://go-review.googlesource.com/c/sys/+/735380 is stuck in review and we want 1.4.1 out, I'm going to give this one a go (plus, most probably, we'll still need retryOnEINTR wrapper anyway).
Just a nit to SchedSetaffinity
9a0a6cc to
3307324
Compare
|
ping @kolyshkin |
3307324 to
e66c563
Compare
kolyshkin
left a comment
There was a problem hiding this comment.
I wonder if we can maybe use /sys/devices/system/cpu/possible or /sys/devices/system/cpu/kernel_max, and fall back to 1024 CPUs if we can't read the above file(s) -- assuming that systems with more than 1024 CPUs do have /sys/devices/system/cpu available.
Or will it be too slow? |
@kolyshkin, I considered |
|
I did a quick benchmark, if you are really concerned about performance then using test codepackage main
import (
"bytes"
"errors"
"fmt"
"os"
"strconv"
"testing"
"unsafe"
"golang.org/x/sys/unix"
)
// retryOnEINTR takes a function that returns an error and calls it
// until the error returned is not EINTR.
func retryOnEINTR(fn func() error) error {
for {
err := fn()
if !errors.Is(err, unix.EINTR) {
return err
}
}
}
func schedSetaffinity(pid int, buf []byte) error {
err := retryOnEINTR(func() error {
_, _, errno := unix.Syscall(
unix.SYS_SCHED_SETAFFINITY,
uintptr(pid),
uintptr(len(buf)),
uintptr((unsafe.Pointer)(&buf[0])))
if errno != 0 {
return errno
}
return nil
})
return os.NewSyscallError("sched_setaffinity", err)
}
func naiveResetAffinity(pid int) error {
const maxCPUs = 64 * 1024
var buf [maxCPUs / 8]byte
for i := range buf {
buf[i] = 0xFF
}
return schedSetaffinity(pid, buf[:])
}
func uint64ResetAffinity(pid int) error {
const maxCPUs = 64 * 1024
var buf [maxCPUs / 64]uint64
for i := range buf {
buf[i] = 0xFFFF_FFFF
}
err := retryOnEINTR(func() error {
_, _, errno := unix.Syscall(
unix.SYS_SCHED_SETAFFINITY,
uintptr(pid),
unsafe.Sizeof(buf),
uintptr((unsafe.Pointer)(&buf[0])))
if errno != 0 {
return errno
}
return nil
})
return os.NewSyscallError("sched_setaffinity", err)
}
func bytesRepeatResetAffinity(pid int) error {
const maxCPUs = 64 * 1024
buf := bytes.Repeat([]byte{0xff}, maxCPUs/8)
return schedSetaffinity(pid, buf)
}
func sysfsResetAffinity(pid int) error {
maxStr, err := os.ReadFile("/sys/devices/system/cpu/kernel_max")
if err != nil {
return fmt.Errorf("failed to get max CPUS supported by kernel: %w", err)
}
maxCPUs, err := strconv.Atoi(string(bytes.TrimSpace(maxStr)))
if err != nil {
return fmt.Errorf("failed to parse max CPUS supported by kernel: %w", err)
}
buf := bytes.Repeat([]byte{0xff}, maxCPUs/8)
return schedSetaffinity(pid, buf)
}
func BenchmarkResetAffinity(b *testing.B) {
for _, test := range []struct {
name string
benchFn func(pid int) error
}{
{"naiveResetAffinity", naiveResetAffinity},
{"uint64ResetAffinity", uint64ResetAffinity},
{"bytesRepeatResetAffinity", bytesRepeatResetAffinity},
{"sysfsResetAffinity", sysfsResetAffinity},
} {
b.Run(test.name, func(b *testing.B) {
pid := os.Getpid()
benchFn := test.benchFn
for b.Loop() {
benchFn(pid)
}
})
}
} |
e66c563 to
69693f0
Compare
|
Big thanks for the benchmark code, @cyphar! I ran it and few additional variants on a couple of different systems. Yet uint64 performed best on those as well, absolute time difference was very big. Only sysfs access looked a bit bad in my eyes. So I decided to stick with |
cyphar
left a comment
There was a problem hiding this comment.
Looks good. I would like us to use our homegrown linux.SchedSetaffinity for the rest of our affinity setting operations (to allow us to fully support >1024 CPU systems) but that can be done in a later PR.
unix.CPUSet is limited to 1024 CPUs. Calling unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024 from allowed CPUs of pid, even if cpuset is all ones. As a consequence, when runc tries to reset CPU affinity to "allow all" by default, it prevents all containers from CPUs 1024 onwards. This change uses a huge CPU mask to play safe and get all possible CPUs enabled with a single sched_setaffinity call. Fixes: opencontainers#5023 Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
69693f0 to
700c944
Compare
|
1.4 backport: #5149 |
|
Nice work |
unix.CPUSet is limited to 1024 CPUs. Calling
unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024 from allowed CPUs of pid, even if cpuset is all ones. The consequence of runc trying to reset CPU affinity by default is that it prevents all containers from using those CPUs.
This change is a quick fix that brings runc behavior back to what it was in v1.3.0 in 1024+ CPU systems. Real fix requires calling sched_setaffinity with cpusetsize fitting all CPUs in the system, which cannot be done with current unix.SchedSetaffinity.
Fixes: #5023