-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Labels
Description
DoLa decoding on Mixtral model with multi-GPU setup returns error:
Traceback (most recent call last):
File "~/src/evaluation/test.py", line 8, in <module>
generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')
File "~/src/evaluation/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1921, in generate
result = self._dola_decoding(
File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2366, in _dola_decoding
next_token_logits = _dola_select_contrast(
File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 4440, in _dola_select_contrast
kl1 = F.kl_div(log_softmax_mature_layer[None, :, :], avg_dist, reduction="none").mean(-1)
File "~/src/evaluation/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2988, in kl_div
reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
System Info
transformersversion: 4.43.0.dev0- Platform: Linux-5.15.0-1042-oracle-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.29.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
This script uses DoLa decoding new feature with multi GPU setup.
Run the file test.py:
import transformers
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
prompt = "Hey, are you conscious? Can you talk to me?\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')
result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
with this command
CUDA_VISIBLE_DEVICES=0,1,2,3 python test.py
Expected behavior
No errors.
Reactions are currently unavailable