Skip to content

DoLa decoding fails on cuda #31996

@kosstbarz

Description

@kosstbarz

DoLa decoding on Mixtral model with multi-GPU setup returns error:

Traceback (most recent call last):
  File "~/src/evaluation/test.py", line 8, in <module>
    generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1921, in generate
    result = self._dola_decoding(
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2366, in _dola_decoding
    next_token_logits = _dola_select_contrast(
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 4440, in _dola_select_contrast
    kl1 = F.kl_div(log_softmax_mature_layer[None, :, :], avg_dist, reduction="none").mean(-1)
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2988, in kl_div
    reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

System Info

  • transformers version: 4.43.0.dev0
  • Platform: Linux-5.15.0-1042-oracle-x86_64-with-glibc2.31
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: Yes
  • Using GPU in script?: Yes
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This script uses DoLa decoding new feature with multi GPU setup.
Run the file test.py:

import transformers
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

prompt = "Hey, are you conscious? Can you talk to me?\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')
result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

with this command
CUDA_VISIBLE_DEVICES=0,1,2,3 python test.py

Expected behavior

No errors.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions