Skip to content

[Bug] ROCm hip_global.cpp Module Error. #3526

@CarlosR759

Description

@CarlosR759
  1. Did you update? pip install --upgrade unsloth unsloth_zoo
    Yes, it does create another error which is worse I think:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
[W1029 14:05:37.832849026 OperatorEntry.cpp:218] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: quantized::embedding_bag_byte_unpack(Tensor weight) -> Tensor
    registered at /pytorch/aten/src/ATen/native/quantized/library.cpp:4
  dispatch key: CUDA
  previous kernel: registered at /pytorch/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp:265
       new kernel: registered at /build/python-pytorch/src/pytorch-rocm/aten/src/ATen/native/quantized/hip/EmbeddingBag.hip:566 (function operator())
Key already registered with the same priority: CUDA
[W1029 14:05:38.628794535 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())

  1. Number GPUs used, use nvidia-smi
    one AMD RX series
  2. Which Unsloth version, TRL version, transformers version, PyTorch version?
    Pytorch ROCm version, the packages of uv pip venv are below
  3. Which trainer? SFTTrainer, GRPOTrainer etc```python
    SFTT trainer

Hi I'm having errors with using ROCm to run the fine tuning of my code. When unsloth is going to start the fine tuning I just have this error from the current output:

🦥 Unsloth Zoo will now patch everything to make training faster!
You are going to fine tune your model ^^!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.11: Fast Qwen3 patching. Transformers: 4.57.1.
   \\   /|    AMD Radeon Graphics. Num GPUs = 1. Max memory: 15.984 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+rocm6.4. ROCm Toolkit: 6.4.43482-0f2d60242. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.23s/it]
Unsloth: Will map <|im_end|> to EOS = <|im_end|>.
Unsloth 2025.10.11 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
Model device: cuda:0
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
[datasets.arrow_dataset|WARNING]num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
[datasets.arrow_dataset|WARNING]num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1 | Num Epochs = 3 | Total steps = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained)
  0%|                                                                                                                  | 0/3 [00:00<?, ?it/s]
:0:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/clr/hipamd/src/hip_global.cpp:158 : 24004827963 us:  Module not initialized

Where the error is basically this:

:0:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/clr/hipamd/src/hip_global.cpp:158 : 24004827963 us:  Module not initialized

I installed according to unsloth webpage for AMD GPUs https://docs.unsloth.ai/new/fine-tuning-llms-on-amd-gpus-with-unsloth just by the difference by using uv. So I just setup python 3.13 for my uv environment and install everything with uv pip install "here the things that unsloth documentation says in the order that they say"

After that I made suggestion over here before posting like uv pip install --upgrade unsloth unsloth_zoo, but that changed unsloth with cuda, as you can saw in the beginning of the post.

Before the uv pip install --upgrade unsloth unsloth_zoo this were my packages in my uv environment:

accelerate==1.11.0
aiohappyeyeballs==2.6.1
aiohttp==3.13.2
aiosignal==1.4.0
anyio==4.11.0
attrs==25.4.0
bitsandbytes @ https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl
certifi==2025.10.5
charset-normalizer==3.4.4
datasets==4.3.0
diffusers==0.35.2
dill==0.4.0
docstring-parser==0.17.0
filelock==3.20.0
frozenlist==1.8.0
fsspec==2025.9.0
h11==0.16.0
hf-transfer==0.1.9
hf-xet==1.2.0
httpcore==1.0.9
httpx==0.28.1
huggingface-hub==0.36.0
idna==3.11
importlib-metadata==8.7.0
jinja2==3.1.6
markdown-it-py==4.0.0
markupsafe==3.0.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.7.0
multiprocess==0.70.16
networkx==3.5
numpy==2.3.4
nvidia-cublas-cu12==12.8.4.1
nvidia-cuda-cupti-cu12==12.8.90
nvidia-cuda-nvrtc-cu12==12.8.93
nvidia-cuda-runtime-cu12==12.8.90
nvidia-cudnn-cu12==9.10.2.21
nvidia-cufft-cu12==11.3.3.83
nvidia-cufile-cu12==1.13.1.3
nvidia-curand-cu12==10.3.9.90
nvidia-cusolver-cu12==11.7.3.90
nvidia-cusparse-cu12==12.5.8.93
nvidia-cusparselt-cu12==0.7.1
nvidia-nccl-cu12==2.27.5
nvidia-nvjitlink-cu12==12.8.93
nvidia-nvshmem-cu12==3.3.20
nvidia-nvtx-cu12==12.8.90
packaging==25.0
pandas==2.3.3
peft==0.17.1
pillow==12.0.0
propcache==0.4.1
protobuf==6.33.0
psutil==7.1.2
pyarrow==22.0.0
pygments==2.19.2
python-dateutil==2.9.0.post0
pytorch-triton-rocm==3.4.0
pytz==2025.2
pyyaml==6.0.3
regex==2025.10.23
requests==2.32.5
rich==14.2.0
safetensors==0.6.2
sentencepiece==0.2.1
setuptools==80.9.0
shtab==1.7.2
six==1.17.0
sniffio==1.3.1
sympy==1.14.0
tokenizers==0.22.1
torch==2.8.0+rocm6.4
torchao==0.13.0+rocm6.4
torchaudio==2.8.0+rocm6.4
torchvision==0.23.0+rocm6.4
tqdm==4.67.1
transformers==4.57.1
triton==3.5.0
trl==0.23.0
typeguard==4.4.4
typing-extensions==4.15.0
tyro==0.9.35
tzdata==2025.2
unsloth @ git+https://github.com/unslothai/unsloth@5314c214d21a387791decc6b0f7715ebd7c1eeb7
unsloth-zoo @ git+https://github.com/unslothai/unsloth-zoo.git@f690a5aaa3eccab272f6b64c990a93a7a64a0b60
urllib3==2.5.0
wheel==0.45.1
xformers==0.0.32.post2
xxhash==3.6.0
yarl==1.22.0
zipp==3.23.0

As you can see, it seems I have all the dependencies needed for work, at least according to this page https://docs.unsloth.ai/get-started/beginner-start-here/unsloth-requirements

Here is the code on which I'm currently working:

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from unsloth.trainer import TrainingArguments
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from transformers import EarlyStoppingCallback
from accelerate import Accelerator
import os
import sys

# parameters for unlsoth fine tuning. Change according to your needs. Defaults are okey
max_seq_length = 2048
dtype = None
load_in_4bit = True  # This set Qlora, set to False to enable Lora instead


def main():
    print("You are going to fine tune your model ^^!")

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Qwen3-4B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,  # enables 4bit cuantization traning
        load_in_8bit=False,  # Set true to enable 8 bits cuantization
        full_finetuning=False,  # Set to true to enable full fine tunning
    )

    # Here we setup the chat template for the tokenizer basically
    tokenizer = get_chat_template(
        tokenizer,
        chat_template="chatml",
    )

    def formatting_prompts_func(examples):
        convos = []
        for messages in examples["messages"]:
            user_msg = next(
                (msg["content"] for msg in messages if msg["role"] == "user"), ""
            )
            assistant_msg = next(
                (msg["content"] for msg in messages if msg["role"] == "assistant"), ""
            )

            convos.append(
                [
                    {"role": "user", "content": user_msg},
                    {"role": "assistant", "content": assistant_msg},
                ]
            )
        texts = [
            tokenizer.decode(
                tokenizer.apply_chat_template(
                    convo, tokenizer=False, add_generation_prompt=False
                )
            )
            for convo in convos
        ]
        return {"text": texts}

    pass

    # data loading
    dataset = load_dataset(
        "json", data_files="data.json", split="train"
    )  # ,split = "train" needed for working with huggingface repos
    dataset = dataset.map(formatting_prompts_func, batched=True)
    

    # LoRA hyperparameters tuning
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,  # Lora Rank value
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_alpha=16,  # This should be same as r value or double to more agresive learning
        lora_dropout=0,  # Dropout for [Q]LoRA. Set to 0, change it if you suspect overfitting
        # use_gradient_checkpointing="False",  # True or "unsloth" for very long context
        random_state=3407,  # Seed for ensure deterministic and reproducible runs during training
        use_rslora=False,  # Enables rank stabilized LoRA
        loftq_config=None,  # Enables LoftQ for traning
    )

    model.to("cuda")
    print(f"Model device: {model.device}")

    trainer = SFTTrainer(
        args=SFTConfig(
            fp16_full_eval=True,
            per_device_eval_batch_size=2,
            eval_accumulation_steps=4,
            output_dir="training_checkpoints",  # location for saved checkpoints. Needed for early stopping
            save_strategy="steps",  # we save models ever N steps
            save_steps=10,
            save_total_limit=1,  # Number of checkpoints models being saved. Lower number reduced disk usage
            eval_strategy="steps",
            eval_steps=10,
            load_best_model_at_end=True,  # The best model is get loaded
            metric_for_best_model="eval_loss",  # Loss function for evaluation of best model
            greater_is_better=False,  # Set to false because the code is minimizing the loss function
        ),
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        eval_dataset=dataset,
    )
    """In the case of the last function we are evaluating with the same data set for
    training. That's should not be the case when you are working
    with production models in which you should test with another
    data set to avoid overfitting the model. This is for knowing how to work with unsloth ^^
    """

    early_stopping_callback = EarlyStoppingCallback(
        early_stopping_patience=10,  # Number of waiting steps if the eval loss doesn't decrease
        early_stopping_threshold=0.03,  # Diffrence between loss function to not trigger the early stopping
    )

    accelerator = Accelerator()
    model, trainer = accelerator.prepare(model, trainer)
    accelerator.wait_for_everyone()
    trainer.train()
    accelerator.end_training()

    model.save_pretrained("lora_model")
    tokenizer.save_pretrained("lora_model")
   
    print("done ^^")


if __name__ == "__main__":
    main()

So in the end as I said before, this is the error:

:0:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/clr/hipamd/src/hip_global.cpp:158 : 24004827963 us:  Module not initialized

Any help on this would be so much appreciated ^^

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions