Skip to content

[BUG] run unit test shows 'Cannot re-initialize CUDA in forked subprocess' #1987

@delock

Description

@delock

Describe the bug
When run unit test test_zero2_reduce_scatter_off, get error message: "Cannot re-initialize CUDA in forked subprocess"

To Reproduce
Steps to reproduce the behavior:

  1. Go to tests/unit
  2. run command 'pytest -k test_zero2_reduce_scatter_off'
  3. The test would fail and see message "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
  4. Most of unit test have same error message when they have @distributed_test decorator
  5. Manually modify common.py to use spawn method does not solve the problem, because in spawn cannot pickle local function

Expected behavior
The test would not fail with error message "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/guizili/miniconda3/envs/ds/lib/python3.9/site-packages/torch']
torch version .................... 1.11.0+cu115
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/guizili/frameworks.ai.benchmarking.other.deepspeed/deepspeed']
deepspeed info ................... 0.6.6+3da84185, 3da84185, upmaster
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types: one machine with x1 A100
  • Interconnects (if applicable): N/A
  • Python version: 3.9.7
  • Any other relevant info about your setup

Launcher context
N/A, I'm running unit test

Docker context
N/A

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions