Skip to content

Conda Pytorch set processor affinity to the first physical core after fork #99625

@jjyyxx

Description

@jjyyxx

🐛 Describe the bug

This issue may be related to #98836 and #91989.

Background

The issue arise when using PyTorch with Ray. The raylet process is forked from the main process after importing torch, then raylet uses execvpe to create worker processes. But taskset -pc `pgrep raylet` shows current affinity list: 0,1, and worker processes inherits this, so all Ray worker processes use only one physical core, causing significant performance penalty.

Investigation

This issue boils down to the fork usage, a minimal Python example is

# Env: conda create -n torch -c conda-forge -c pytorch --override-channels python=3.9 pytorch cpuonly
import torch
import time
import os
if os.fork() == 0:
    print(f"Child PID: {os.getpid()}")
    time.sleep(100)

Then taskset -pc <child pid> shows bad affinity. Using strace like strace -e sched_setaffinity -f python ... also clearly shows this.

I further narrowed down this issue with a C only script

// Env: conda create -n mkl -c conda-forge --override-channels mkl mkl-include
// Build: gcc -I ~/miniconda/envs/mkl/include -L ~/miniconda/envs/mkl/lib -o test test-mkl.c -lgomp
// Run: LD_LIBRARY_PATH=~/miniconda/envs/mkl/lib strace -e sched_setaffinity -f ./test
#include <unistd.h>
#include <omp.h>
int main() {
    omp_get_max_threads();
    if (fork() == 0) {
        omp_get_max_threads();
        // sleep(100);
    }
    return 0;
}

So it turns out that this may not be an issue specific to PyTorch. Linking in an environment with libgomp and without mkl will not reproduce this issue, only mkl calls sched_setaffinity. As mkl does not come with source code, I was not able to further investigate.

Workaround

Multiple ways could workaround this issue

  1. Fork before importing torch
    import ray; ray.init()
    import torch
  2. Install the PYPI version of PyTorch
  3. (not tested) manually reset affinity in the child processes
  4. Pin llvm-openmp to 14.0.* (found this when comparing dependencies with a older normal conda environment)
    conda create -n torch2 -c conda-forge -c pytorch --override-channels python=3.9 pytorch cpuonly llvm-openmp=14
    
    So this may be an issue caused by un-pinned upgraded dependency.

Versions

The mkl conda environment

_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
icu                       72.1                 hcb278e6_0    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libhwloc                  2.9.1                hd6dc26d_0    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libxml2                   2.10.4               hfdac1af_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               16.0.1               h417c0b6_0    conda-forge
mkl                       2022.1.0           h84fe81f_915    conda-forge
mkl-include               2022.1.0           h84fe81f_915    conda-forge
tbb                       2021.9.0             hf52228f_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge

cc @ezyang @gchanan @zou3519 @seemethere @malfet @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: binariesAnything related to official binaries that we release to usersmodule: dependency bugProblem is not caused by us, but caused by an upstream library we usemodule: intelSpecific to x86 architecturemodule: mklRelated to our MKL supportmodule: third_partytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions