-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Conda Pytorch set processor affinity to the first physical core after fork #99625
Description
🐛 Describe the bug
This issue may be related to #98836 and #91989.
Background
The issue arise when using PyTorch with Ray. The raylet process is forked from the main process after importing torch, then raylet uses execvpe to create worker processes. But taskset -pc `pgrep raylet` shows current affinity list: 0,1, and worker processes inherits this, so all Ray worker processes use only one physical core, causing significant performance penalty.
Investigation
This issue boils down to the fork usage, a minimal Python example is
# Env: conda create -n torch -c conda-forge -c pytorch --override-channels python=3.9 pytorch cpuonly
import torch
import time
import os
if os.fork() == 0:
print(f"Child PID: {os.getpid()}")
time.sleep(100)Then taskset -pc <child pid> shows bad affinity. Using strace like strace -e sched_setaffinity -f python ... also clearly shows this.
I further narrowed down this issue with a C only script
// Env: conda create -n mkl -c conda-forge --override-channels mkl mkl-include
// Build: gcc -I ~/miniconda/envs/mkl/include -L ~/miniconda/envs/mkl/lib -o test test-mkl.c -lgomp
// Run: LD_LIBRARY_PATH=~/miniconda/envs/mkl/lib strace -e sched_setaffinity -f ./test
#include <unistd.h>
#include <omp.h>
int main() {
omp_get_max_threads();
if (fork() == 0) {
omp_get_max_threads();
// sleep(100);
}
return 0;
}So it turns out that this may not be an issue specific to PyTorch. Linking in an environment with libgomp and without mkl will not reproduce this issue, only mkl calls sched_setaffinity. As mkl does not come with source code, I was not able to further investigate.
Workaround
Multiple ways could workaround this issue
- Fork before importing torch
import ray; ray.init() import torch
- Install the PYPI version of PyTorch
- (not tested) manually reset affinity in the child processes
- Pin llvm-openmp to
14.0.*(found this when comparing dependencies with a older normal conda environment)So this may be an issue caused by un-pinned upgraded dependency.conda create -n torch2 -c conda-forge -c pytorch --override-channels python=3.9 pytorch cpuonly llvm-openmp=14
Versions
The mkl conda environment
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_kmp_llvm conda-forge
icu 72.1 hcb278e6_0 conda-forge
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libhwloc 2.9.1 hd6dc26d_0 conda-forge
libiconv 1.17 h166bdaf_0 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libxml2 2.10.4 hfdac1af_0 conda-forge
libzlib 1.2.13 h166bdaf_4 conda-forge
llvm-openmp 16.0.1 h417c0b6_0 conda-forge
mkl 2022.1.0 h84fe81f_915 conda-forge
mkl-include 2022.1.0 h84fe81f_915 conda-forge
tbb 2021.9.0 hf52228f_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
zstd 1.5.2 h3eb15da_6 conda-forge
cc @ezyang @gchanan @zou3519 @seemethere @malfet @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10
Metadata
Metadata
Assignees
Labels
Type
Projects
Status