W1230 15:37:24.137000 254 torch/distributed/run.py:803]
W1230 15:37:24.137000 254 torch/distributed/run.py:803] *****************************************
W1230 15:37:24.137000 254 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1230 15:37:24.137000 254 torch/distributed/run.py:803] *****************************************
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torchao/quantization/quant_api.py:2511: SyntaxWarning: invalid escape sequence '\.'
"""Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs).
/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torchao/quantization/quant_api.py:2511: SyntaxWarning: invalid escape sequence '\.'
"""Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs).
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
DDP initialized with world_size=2
README.md: 4.42kB [00:00, 10.8MB/s]
data/train-00000-of-00001.parquet: 100%|███| 14.9M/14.9M [00:00<00:00, 14.9MB/s]
data/test-00000-of-00001.parquet: 100%|████| 1.65M/1.65M [00:00<00:00, 8.95MB/s]
Generating train split: 100%|█| 143594/143594 [00:00<00:00, 742410.60 examples/s
Generating test split: 100%|███| 15955/15955 [00:00<00:00, 715884.00 examples/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/kaggle/working/Medico2025/main.py", line 74, in <module>
[rank1]: main(parse_args())
[rank1]: File "/kaggle/working/Medico2025/main.py", line 36, in main
[rank1]: train(
[rank1]: File "/kaggle/working/Medico2025/src/train.py", line 38, in train
[rank1]: model, tokenizer = FastVisionModel.from_pretrained(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/loader.py", line 1094, in from_pretrained
[rank1]: model_types, supports_sdpa = unsloth_compile_transformers(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/_utils.py", line 1875, in unsloth_compile_transformers
[rank1]: _unsloth_compile_transformers(
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 2258, in unsloth_compile_transformers
[rank1]: patch_lora_forwards(torch_compile_options)
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 1855, in patch_lora_forwards
[rank1]: forward = create_new_function(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 572, in create_new_function
[rank1]: compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 311, in get_compile_folder
[rank1]: location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/utils.py", line 119, in distributed_function
[rank1]: dist.broadcast_object_list(obj_list, src = 0)
[rank1]: ^^^^
[rank1]: NameError: name 'dist' is not defined. Did you mean: 'dict'?
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/Medico2025/main.py", line 74, in <module>
[rank0]: main(parse_args())
[rank0]: File "/kaggle/working/Medico2025/main.py", line 36, in main
[rank0]: train(
[rank0]: File "/kaggle/working/Medico2025/src/train.py", line 38, in train
[rank0]: model, tokenizer = FastVisionModel.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/loader.py", line 1094, in from_pretrained
[rank0]: model_types, supports_sdpa = unsloth_compile_transformers(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth/models/_utils.py", line 1875, in unsloth_compile_transformers
[rank0]: _unsloth_compile_transformers(
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 2258, in unsloth_compile_transformers
[rank0]: patch_lora_forwards(torch_compile_options)
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 1855, in patch_lora_forwards
[rank0]: forward = create_new_function(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 572, in create_new_function
[rank0]: compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 311, in get_compile_folder
[rank0]: location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/unsloth_zoo/utils.py", line 119, in distributed_function
[rank0]: dist.broadcast_object_list(obj_list, src = 0)
[rank0]: ^^^^
[rank0]: NameError: name 'dist' is not defined. Did you mean: 'dict'?
[rank0]:[W1230 15:38:02.207200923 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1230 15:38:03.734000 254 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 260 closing signal SIGTERM
E1230 15:38:03.799000 254 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 1 (pid: 261) of binary: /kaggle/working/Medico2025/.venv/bin/python
Traceback (most recent call last):
File "/kaggle/working/Medico2025/.venv/bin/torchrun", line 10, in <module>
sys.exit(main())
^^^^^^
File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kaggle/working/Medico2025/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-12-30_15:38:03
host : 80b701afd174
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 261)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Describe the bug
I have installed uv in Kaggle to run my script. My script uses DDP for fine-tuning the Qwen3-VL model on multiple GPUs (Tesla T4 x2).
The error raises at the following code:
Error: NameError: name 'dist' is not defined. Did you mean: 'dict'?
I think this error occurs due to the missing import of
torch.distributedat this line.Command:
Environment
Traceback