[train] Refactor AcceleratorSetupCallback to use before_init_train_context#56509
Conversation
…ntext Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request refactors AcceleratorSetupCallback to use the before_init_train_context hook instead of after_worker_group_start. This change is crucial for correctly setting up the CUDA context on workers before the training context is initialized, which resolves an import deserialization issue with PyTorch. The refactoring correctly passes the list of workers down through _maybe_share_cuda_visible_devices, _share_cuda_visible_devices, and _share_accelerator_ids, and updates the remote execution calls accordingly. My feedback includes a minor style improvement for a docstring. As noted in the PR description, tests will need to be updated to reflect these changes, as the current tests for AcceleratorSetupCallback will likely fail.
justinvyu
left a comment
There was a problem hiding this comment.
I feel this fix could be related to the ordering of the TorchBackend setup and the cuda visible device sharing callback. But the AcceleratorSetupCallback already happens before the BackendSetupCallback based on the default callback ordering:
Another hypothesis I have is that the first torch import on the Worker actor happens on init_train_context, through the deserialization of something in the train run context depending on torch.
…lerator-callback
|
More minimal repro without Ray datasets confusion: import os
os.environ["RAY_TRAIN_V2_ENABLED"]="1"
import ray
ray.init(ignore_reinit_error=True)
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
from helper import noop
def train_func():
print(os.environ["CUDA_VISIBLE_DEVICES"])
# Capturing in the train function scope only doesn't fail:
print(noop)
...
trainer = TorchTrainer(
train_func,
# This fails:
# train_loop_config={"asdf": noop},
scaling_config=ScalingConfig(num_workers=2, use_gpu=True),
)
trainer.fit()helper.py: import torch
torch.cuda.is_available()
def noop(batch):
return batch |
|
More minimal repro with just torch, mimicking what happens on the worker actor initialization: import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# Fresh RayTrainWorker state at the beginning, ex: CUDA_VISIBLE_DEVICES=0
print(f"BEFORE SETTING! {os.environ['CUDA_VISIBLE_DEVICES']=}")
# init_train_context(train_run_context) gets called and a bunch of imports
# happen on deserialization.
# A local module import with a torch.cuda.is_available() call initializes
# CUDA using this incorrect CUDA_VISIBLE_DEVICES, which "locks in" the invalid
# state and won't be re-initialized.
print(f"{torch.cuda.is_available()=}")
# AcceleratorSetupCallback updates the CUDA_VISIBLE_DEVICES AFTER the CUDA init
# has already happened. Ex: CUDA_VISIBLE_DEVICES=0,1
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
print(f"AFTER SETTING! {os.environ['CUDA_VISIBLE_DEVICES']=}")
# Setting the CUDA device now errors, probably an assertion that fails due to this
# mismatch between "locked in state" with old CUDA_VISIBLE_DEVICES and the updated CUDA_VISIBLE_DEVICES.
# torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=
device = "cuda:0"
print(f"{device=}")
torch.cuda.set_device(device) |
…lerator-callback
|
Can you also describe why we don't fix with |
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…_context` (ray-project#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
…_context` (#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…_context` (ray-project#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>
…_context` (#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…_context` (ray-project#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…_context` (ray-project#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…_context` (ray-project#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…_context` (ray-project#56509) This fixes an issue in which the CUDA context is not properly configured during import deserialization. ``` RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus= ``` The fix is to update the CUDA_VISIBLE_DEVICE sharing logic to be implemented in before_init_train_context instead of after_worker_group_start, so that torch.cuda initialization happens after the environment variable is set up properly. --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
This fixes an issue in which the CUDA context is not properly configured during import deserialization.
Context
The relevant logic happens in
WorkerGroup._start_impl:ray/python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Lines 328 to 347 in ce4e473
The logic is as follows:
WorkerGroupCallback.before_init_train_contextWorkerGroup._init_train_context_on_workersWorkerGroupCallback.after_worker_group_startProblem
The error occurs when
CUDA_VISIBLE_DEVICESare not properly configured before torch.cuda initialization happens (when the TrainContext is initialized).torch.cuda.is_available()forces CUDA to be initialized, which reads theCUDA_VISIBLE_DEVICESat that time and locks that state in.Here's the order of events of the original issue:
CUDA_VISIBLE_DEVICES=X, since Ray Core sets the environment variable automatically.TrainRunContext, which holds user code such asdatasetsandtrain_loop_config, which can depend on a user module dependency.init_train_contextdeserializes all of the arguments on the RayTrainWorker, which triggers a bunch of imports, including torch and the user modules.torch.cuda.is_available()at the import level, then the CUDA initialization locks in theCUDA_VISIBLE_DEVICES=Xstate.Solution
The fix is to update the
CUDA_VISIBLE_DEVICEsharing logic to be implemented inbefore_init_train_contextinstead ofafter_worker_group_start, so that any calls totorch.cudawill happen after the devices are set up properly.Alternative Solution
There is another option to set the
EXPERIMENTAL_NOSET_CUDAenvironment variable on theTrainWorkers, so that when they are first scheduled they are not restricted to just the single GPU device. However, this will also allow them to be exposed to more devices, which may not be desired if the user wants to restrict the GPU devices to those required for the training job. The solution implemented in this PR gives the least access while still solving the problem.Repro
Run on a GPU node with multiple GPUs.
repro.py:RAY_TRAIN_V2_ENABLED=1 python repro.pyFailure
Repro Notes: The repro requires the following characteristics, and will not result in failure if any of these are false.
use_gpu=Truenum_workers >= 2torch.cuda.is_available()torch.cudainstantiation on import.