Hi, I would like to report a memory issue with nccl. A reproducible example is attached below:
In a gcp g2-standard-24 instance (with 2 L4 GPUs):
docker pull us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce
docker run --gpus all --shm-size=2g -it us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce -- /bin/bash
# inside docker
cd /vllm-workspace/tests/distributed
export NCCL_DEBUG=TRACE
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s --forked test_basic_distributed_correctness.py
Note that the code manually links against a pre-downloaded nccl==2.18.3. There is also a nccl==2.19.3 available inside the image, the path is /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 .
By adding the following line in /vllm-workspace/vllm/model_executor/parallel_utils/pynccl.py, before nccl = ctypes.CDLL(so_file):
so_file = "/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2"
We can force the program to use nccl 2.19.3, and we will get an OOM error.
The background:
In distributed inference, https://github.com/vllm-project/vllm uses nccl together with cudagraph. We capture about 30 graphs with different batch sizes. The memory overhead when we use pytorch 2.1.2 (with nccl==2.18.3) is nearly zero (about 10MB per graph, and sometimes is zero); however, when we upgrade to pytorch 2.2.0 (with nccl==2.19.3), the memory overhead is more than 100MB per graph.
We spent more than a week (to be honest, more time than one would feel comfortable with) to investigate the issue. We used to think it should be related with pytorch, but finally we find the problem comes from the nccl library.
For more code on measuring the memory overhead, please check vllm-project/vllm#3442 (comment) .
It would be very helpful if the nccl team can point our the root cause of the memory overhead, and potential knobs to control it (e.g. via some environment variables). The above problem happens for both nccl==2.19.3 and nccl==2.20.5 .
Thank you for your time.
Hi, I would like to report a memory issue with
nccl. A reproducible example is attached below:In a gcp g2-standard-24 instance (with 2 L4 GPUs):
Note that the code manually links against a pre-downloaded
nccl==2.18.3. There is also anccl==2.19.3available inside the image, the path is/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2.By adding the following line in
/vllm-workspace/vllm/model_executor/parallel_utils/pynccl.py, beforenccl = ctypes.CDLL(so_file):We can force the program to use nccl 2.19.3, and we will get an OOM error.
The background:
In distributed inference, https://github.com/vllm-project/vllm uses nccl together with cudagraph. We capture about 30 graphs with different batch sizes. The memory overhead when we use
pytorch 2.1.2(withnccl==2.18.3) is nearly zero (about 10MB per graph, and sometimes is zero); however, when we upgrade topytorch 2.2.0(withnccl==2.19.3), the memory overhead is more than 100MB per graph.We spent more than a week (to be honest, more time than one would feel comfortable with) to investigate the issue. We used to think it should be related with pytorch, but finally we find the problem comes from the
nccllibrary.For more code on measuring the memory overhead, please check vllm-project/vllm#3442 (comment) .
It would be very helpful if the
ncclteam can point our the root cause of the memory overhead, and potential knobs to control it (e.g. via some environment variables). The above problem happens for bothnccl==2.19.3andnccl==2.20.5.Thank you for your time.