Skip to content

Cannot use Unsloth DDP #421

@magnusdtd

Description

@magnusdtd

Describe the bug
I have installed uv in Kaggle to run my script. My script uses DDP for fine-tuning the Qwen3-VL model on multiple GPUs (Tesla T4 x2).

The error raises at the following code:

model, tokenizer = FastVisionModel.from_pretrained(
    model_id,
    load_in_4bit = True,
    use_gradient_checkpointing = "unsloth",
)

Error: NameError: name 'dist' is not defined. Did you mean: 'dict'?
I think this error occurs due to the missing import of torch.distributed at this line.

Command:

!uv run torchrun --nproc_per_node=2 main.py --mode train \
    --model_id unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit \
    --save_dir Medico2025-Qwen3-VL-4B-Instruct-QLoRA-SFT \
    --save_repo_id magnusdtd/Medico2025-Qwen3-VL-4B-Instruct-QLoRA-SFT \
    --token ${HF_TOKEN} \
    --val_size 0.01 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --per_device_eval_batch_size 8 \
    --eval_accumulation_steps 4 \
    --max_steps -1 \
    --num_train_epochs 1 \
    --save_steps 100 \
    --eval_steps 100 \
    --early_stopping_patience 5

Environment

  • unsloth==2025.11.6
  • unsloth-zoo==2025.12.7
  • torch==2.9.1
  • CUDA: 12.8
  • Python: 3.12

Traceback

W1230 15:37:24.137000 254 torch/distributed/run.py:803] 
W1230 15:37:24.137000 254 torch/distributed/run.py:803] *****************************************
W1230 15:37:24.137000 254 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1230 15:37:24.137000 254 torch/distributed/run.py:803] *****************************************
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torchao/quantization/quant_api.py:2511: SyntaxWarning: invalid escape sequence '\.'
  """Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs).
/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torchao/quantization/quant_api.py:2511: SyntaxWarning: invalid escape sequence '\.'
  """Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs).
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
DDP initialized with world_size=2
README.md: 4.42kB [00:00, 10.8MB/s]
data/train-00000-of-00001.parquet: 100%|███| 14.9M/14.9M [00:00<00:00, 14.9MB/s]
data/test-00000-of-00001.parquet: 100%|████| 1.65M/1.65M [00:00<00:00, 8.95MB/s]
Generating train split: 100%|█| 143594/143594 [00:00<00:00, 742410.60 examples/s
Generating test split: 100%|███| 15955/15955 [00:00<00:00, 715884.00 examples/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/kaggle/working/Medico2025/main.py", line 74, in <module>
[rank1]:     main(parse_args())
[rank1]:   File "/kaggle/working/Medico2025/main.py", line 36, in main
[rank1]:     train(
[rank1]:   File "/kaggle/working/Medico2025/src/train.py", line 38, in train
[rank1]:     model, tokenizer = FastVisionModel.from_pretrained(
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/loader.py", line 1094, in from_pretrained
[rank1]:     model_types, supports_sdpa = unsloth_compile_transformers(
[rank1]:                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/_utils.py", line 1875, in unsloth_compile_transformers
[rank1]:     _unsloth_compile_transformers(
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 2258, in unsloth_compile_transformers
[rank1]:     patch_lora_forwards(torch_compile_options)
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 1855, in patch_lora_forwards
[rank1]:     forward = create_new_function(
[rank1]:               ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 572, in create_new_function
[rank1]:     compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank1]:                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 311, in get_compile_folder
[rank1]:     location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank1]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/utils.py", line 119, in distributed_function
[rank1]:     dist.broadcast_object_list(obj_list, src = 0)
[rank1]:     ^^^^
[rank1]: NameError: name 'dist' is not defined. Did you mean: 'dict'?
[rank0]: Traceback (most recent call last):
[rank0]:   File "/kaggle/working/Medico2025/main.py", line 74, in <module>
[rank0]:     main(parse_args())
[rank0]:   File "/kaggle/working/Medico2025/main.py", line 36, in main
[rank0]:     train(
[rank0]:   File "/kaggle/working/Medico2025/src/train.py", line 38, in train
[rank0]:     model, tokenizer = FastVisionModel.from_pretrained(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/loader.py", line 1094, in from_pretrained
[rank0]:     model_types, supports_sdpa = unsloth_compile_transformers(
[rank0]:                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/_utils.py", line 1875, in unsloth_compile_transformers
[rank0]:     _unsloth_compile_transformers(
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 2258, in unsloth_compile_transformers
[rank0]:     patch_lora_forwards(torch_compile_options)
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 1855, in patch_lora_forwards
[rank0]:     forward = create_new_function(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 572, in create_new_function
[rank0]:     compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank0]:                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 311, in get_compile_folder
[rank0]:     location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank0]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/utils.py", line 119, in distributed_function
[rank0]:     dist.broadcast_object_list(obj_list, src = 0)
[rank0]:     ^^^^
[rank0]: NameError: name 'dist' is not defined. Did you mean: 'dict'?
[rank0]:[W1230 15:38:02.207200923 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1230 15:38:03.734000 254 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 260 closing signal SIGTERM
E1230 15:38:03.799000 254 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 1 (pid: 261) of binary: /kaggle/working/Medico2025/.venv/bin/python
Traceback (most recent call last):
  File "/kaggle/working/Medico2025/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-30_15:38:03
  host      : 80b701afd174
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 261)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions