Skip to content

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

@youkaichao

Description

@youkaichao

Hi, I would like to report a memory issue with nccl. A reproducible example is attached below:

In a gcp g2-standard-24 instance (with 2 L4 GPUs):

docker pull us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce
docker run --gpus all --shm-size=2g -it us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce -- /bin/bash

# inside docker
cd /vllm-workspace/tests/distributed
export NCCL_DEBUG=TRACE
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s --forked test_basic_distributed_correctness.py

Note that the code manually links against a pre-downloaded nccl==2.18.3. There is also a nccl==2.19.3 available inside the image, the path is /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 .

By adding the following line in /vllm-workspace/vllm/model_executor/parallel_utils/pynccl.py, before nccl = ctypes.CDLL(so_file):

so_file = "/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2"

We can force the program to use nccl 2.19.3, and we will get an OOM error.

The background:

In distributed inference, https://github.com/vllm-project/vllm uses nccl together with cudagraph. We capture about 30 graphs with different batch sizes. The memory overhead when we use pytorch 2.1.2 (with nccl==2.18.3) is nearly zero (about 10MB per graph, and sometimes is zero); however, when we upgrade to pytorch 2.2.0 (with nccl==2.19.3), the memory overhead is more than 100MB per graph.

We spent more than a week (to be honest, more time than one would feel comfortable with) to investigate the issue. We used to think it should be related with pytorch, but finally we find the problem comes from the nccl library.

For more code on measuring the memory overhead, please check vllm-project/vllm#3442 (comment) .

It would be very helpful if the nccl team can point our the root cause of the memory overhead, and potential knobs to control it (e.g. via some environment variables). The above problem happens for both nccl==2.19.3 and nccl==2.20.5 .

Thank you for your time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions