Report of increased memory overhead during cudagraph capture with nccl >= 2.19

Hi, I would like to report a memory issue with `nccl`. A reproducible example is attached below:

In a gcp g2-standard-24 instance (with 2 L4 GPUs):

```shell
docker pull us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce
docker run --gpus all --shm-size=2g -it us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce -- /bin/bash

# inside docker
cd /vllm-workspace/tests/distributed
export NCCL_DEBUG=TRACE
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s --forked test_basic_distributed_correctness.py
```

Note that the code manually links against a pre-downloaded `nccl==2.18.3`.  There is also a `nccl==2.19.3` available inside the image, the path is `/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2` . 

By adding the following line in `/vllm-workspace/vllm/model_executor/parallel_utils/pynccl.py`, before `nccl = ctypes.CDLL(so_file)`:

```python
so_file = "/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2"
```

We can force the program to use nccl 2.19.3, and we will get an OOM error.

The background:

In distributed inference, https://github.com/vllm-project/vllm uses nccl together with cudagraph. We capture about 30 graphs with different batch sizes. The memory overhead when we use `pytorch 2.1.2` (with `nccl==2.18.3`) is nearly zero (about 10MB per graph, and sometimes is zero); however, when we upgrade to `pytorch 2.2.0` (with `nccl==2.19.3`), the memory overhead is more than 100MB per graph.

We spent more than a week (to be honest, more time than one would feel comfortable with) to investigate the issue. We used to think it should be related with pytorch, but finally we find the problem comes from the `nccl` library.

For more code on measuring the memory overhead, please check https://github.com/vllm-project/vllm/pull/3442#issuecomment-2002131611 .

It would be very helpful if the `nccl` team can point our the root cause of the memory overhead, and potential knobs to control it (e.g. via some environment variables). The above problem happens for both `nccl==2.19.3` and `nccl==2.20.5` .

Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions