-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
System Info
transformersversion: 4.28.1- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.13.3
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0a0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: 4 x A100 40GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Note: there is a workaround/fix with manual device mapping attached below but I'm wondering if there could be an official fix for the bug.
Code sample
infer.py (Mostly from the HF Hub sample with some modifications to load with multi-GPU and quantization)
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
def main():
model_name = "facebook/nllb-moe-54b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True,
)
batched_input = [
'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.',
"Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days."
"Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes."
"On Monday, Sara Danius, permanent secretary of the Nobel Committee for Literature at the Swedish Academy, publicly announced during a radio program on Sveriges Radio in Sweden the committee, unable to reach Bob Dylan directly about winning the 2016 Nobel Prize in Literature, had abandoned its efforts to reach him.",
'Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."',
"Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage.",
]
inputs = tokenizer(batched_input, return_tensors="pt", padding=True)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
)
outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
print(outputs)
if __name__ == "__main__":
main()Steps:
- Run
CUDA_VISIBLE_DEVICES=0,1,2,3 python infer.py - See error
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <path>/code/nscc_working/engr/multi_node/nllb_inference/error_infer.py:38 in │
│ <module> │
│ │
│ 35 │
│ 36 │
│ 37 if __name__ == "__main__": │
│ ❱ 38 │ main() │
│ 39 │
│ │
│ <path>/code/nscc_working/engr/multi_node/nllb_inference/error_infer.py:30 in main │
│ │
│ 27 │ ] │
│ 28 │ inputs = tokenizer(batched_input, return_tensors="pt", padding=True) │
│ 29 │ │
│ ❱ 30 │ translated_tokens = model.generate( │
│ 31 │ │ **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"] │
│ 32 │ ) │
│ 33 │ outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True) │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/utils/_contextlib.py │
│ :115 in decorate_context │
│ │
│ 112 │ @functools.wraps(func) │
│ 113 │ def decorate_context(*args, **kwargs): │
│ 114 │ │ with ctx_factory(): │
│ ❱ 115 │ │ │ return func(*args, **kwargs) │
│ 116 │ │
│ 117 │ return decorate_context │
│ 118 │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/generation/ut │
│ ils.py:1286 in generate │
│ │
│ 1283 │ │ if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs: │
│ 1284 │ │ │ # if model is encoder decoder encoder_outputs are created │
│ 1285 │ │ │ # and added to `model_kwargs` │
│ ❱ 1286 │ │ │ model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation( │
│ 1287 │ │ │ │ inputs_tensor, model_kwargs, model_input_name │
│ 1288 │ │ │ ) │
│ 1289 │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/generation/ut │
│ ils.py:638 in _prepare_encoder_decoder_kwargs_for_generation │
│ │
│ 635 │ │ model_input_name = model_input_name if model_input_name is not None else self.ma │
│ 636 │ │ encoder_kwargs["return_dict"] = True │
│ 637 │ │ encoder_kwargs[model_input_name] = inputs_tensor │
│ ❱ 638 │ │ model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs) │
│ 639 │ │ │
│ 640 │ │ return model_kwargs │
│ 641 │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/nn/modules/module.py │
│ :1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/models/nllb_m │
│ oe/modeling_nllb_moe.py:1165 in forward │
│ │
│ 1162 │ │ │ │ │ │ (head_mask[idx] if head_mask is not None else None), │
│ 1163 │ │ │ │ │ ) │
│ 1164 │ │ │ │ else: │
│ ❱ 1165 │ │ │ │ │ layer_outputs = encoder_layer( │
│ 1166 │ │ │ │ │ │ hidden_states, │
│ 1167 │ │ │ │ │ │ attention_mask, │
│ 1168 │ │ │ │ │ │ layer_head_mask=(head_mask[idx] if head_mask is not None else No │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/nn/modules/module.py │
│ :1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/models/nllb_m │
│ oe/modeling_nllb_moe.py:701 in forward │
│ │
│ 698 │ │ │
│ 699 │ │ hidden_states = self.ff_layer_norm(hidden_states) │
│ 700 │ │ if self.is_sparse: │
│ ❱ 701 │ │ │ hidden_states, router_states = self.ffn(hidden_states, attention_mask) │
│ 702 │ │ else: │
│ 703 │ │ │ hidden_states = self.ffn(hidden_states) │
│ 704 │ │ hidden_states = self.ff_dropout(hidden_states) │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/nn/modules/module.py │
│ :1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/models/nllb_m │
│ oe/modeling_nllb_moe.py:474 in forward │
│ │
│ 471 │ │ top_1_mask, router_probs = self.router(hidden_states, padding_mask) │
│ 472 │ │ router_mask = router_probs.bool() │
│ 473 │ │ hidden_states = hidden_states.reshape((batch_size * sequence_length), hidden_dim │
│ ❱ 474 │ │ masked_hidden_states = torch.einsum("bm,be->ebm", hidden_states, router_mask) │
│ 475 │ │ for idx, expert in enumerate(self.experts.values()): │
│ 476 │ │ │ token_indices = router_mask[:, idx] │
│ 477 │ │ │ combining_weights = router_probs[token_indices, idx] │
│ │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/functional.py:378 in │
│ einsum │
│ │
│ 375 │ if len(operands) <= 2 or not opt_einsum.enabled: │
│ 376 │ │ # the path for contracting 0 or 1 time(s) is already optimized │
│ 377 │ │ # or the user has disabled using opt_einsum │
│ ❱ 378 │ │ return _VF.einsum(equation, operands) # type: ignore[attr-defined] │
│ 379 │ │
│ 380 │ path = None │
│ 381 │ if opt_einsum.is_available(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and
cuda:0!
Expected behavior
A list of translated text.
The following code contains a workaround to prevent certain module splits and moves certain modules to the same device as the input in order to run the inference without errors.
Code
import torch
from accelerate.big_modeling import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
def main():
model_name = "facebook/nllb-moe-54b"
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
model = AutoModelForSeq2SeqLM.from_config(config)
model.tie_weights()
device_map = infer_auto_device_map(
model,
# Force splits model.encoder into separate layers and devices
max_memory={0: "6GIB", 1: "30GIB", 2: "30GIB", 3: "30GIB"},
no_split_module_classes=model._no_split_modules
+ ["NllbMoeEncoderLayer", "NllbMoeDecoderLayer"],
dtype="int8",
)
# Demonstrate that only "model.encoder.layer_norm" and "model.encoder.embed_tokens"
# needs to be on the same device as the input
for module, device in device_map.items():
if module in {"model.encoder.layer_norm", "model.encoder.embed_tokens"}:
if device != 0:
device_map[module] = 0
else:
if device == 0:
device_map[module] = 1
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map=device_map, # Use the custom device map
load_in_8bit=True,
)
batched_input = [
'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.',
"Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days."
"Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes."
"On Monday, Sara Danius, permanent secretary of the Nobel Committee for Literature at the Swedish Academy, publicly announced during a radio program on Sveriges Radio in Sweden the committee, unable to reach Bob Dylan directly about winning the 2016 Nobel Prize in Literature, had abandoned its efforts to reach him.",
'Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."',
"Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage.",
]
inputs = tokenizer(batched_input, return_tensors="pt", padding=True)
for i in inputs:
if torch.is_tensor(inputs[i]):
inputs[i] = inputs[i].to("cuda:0")
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
)
outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
print(outputs)
if __name__ == "__main__":
main()Output:
['Nous avons maintenant des souris de 4 mois qui ne sont pas diabétiques mais qui l\'étaient", a-t-il ajouté.', "Le Dr Ehud Ur, professeur de médecine à l'Université Dalhousie à Halifax, en Nouvelle-Écosse, et président de la division clinique et scientifique de l'Association canadienne du diabète, a averti que la recherche en était encore à ses débuts. Comme d'autres experts, il est sceptique quant à la possibilité de guérir le diabète, notant que ces résultats n'ont aucune pertinence pour les personnes atteintes de diabète de type 1.", 'Danius a déclaré: "Pour le moment, nous ne faisons rien. J\'ai appelé et envoyé des courriels à son plus proche collaborateur et j\'ai reçu des réponses très amicales. Pour l\'instant, c\'est certainement suffisant".', "Auparavant, le PDG de Ring, Jamie Siminoff, a déclaré que la société avait commencé lorsque sa sonnette n'était pas audible depuis son magasin dans son garage."]
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels