NLLB-MoE 54B multi-GPU inference throws "Expected all tensors to be on the same device" error

### System Info

- `transformers` version: 4.28.1
- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.13.3
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0a0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: 4 x A100 40GB


### Who can help?

@ArthurZucker 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

*Note: there is a workaround/fix with manual device mapping attached below but I'm wondering if there could be an official fix for the bug.*

#### Code sample

infer.py (Mostly from the [HF Hub sample](https://huggingface.co/facebook/nllb-moe-54b) with some modifications to load with multi-GPU and quantization)

```python
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


def main():
    model_name = "facebook/nllb-moe-54b"

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        load_in_8bit=True,
    )

    batched_input = [
        'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.',
        "Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days."
        "Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes."
        "On Monday, Sara Danius, permanent secretary of the Nobel Committee for Literature at the Swedish Academy, publicly announced during a radio program on Sveriges Radio in Sweden the committee, unable to reach Bob Dylan directly about winning the 2016 Nobel Prize in Literature, had abandoned its efforts to reach him.",
        'Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."',
        "Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage.",
    ]
    inputs = tokenizer(batched_input, return_tensors="pt", padding=True)

    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
    )
    outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
    print(outputs)


if __name__ == "__main__":
    main()
```

Steps:
1. Run `CUDA_VISIBLE_DEVICES=0,1,2,3 python infer.py`
2. See error
```
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <path>/code/nscc_working/engr/multi_node/nllb_inference/error_infer.py:38 in       │
│ <module>                                                                                         │
│                                                                                                  │
│   35                                                                                             │
│   36                                                                                             │
│   37 if __name__ == "__main__":                                                                  │
│ ❱ 38 │   main()                                                                                  │
│   39                                                                                             │
│                                                                                                  │
│ <path>/code/nscc_working/engr/multi_node/nllb_inference/error_infer.py:30 in main  │
│                                                                                                  │
│   27 │   ]                                                                                       │
│   28 │   inputs = tokenizer(batched_input, return_tensors="pt", padding=True)                    │
│   29 │                                                                                           │
│ ❱ 30 │   translated_tokens = model.generate(                                                     │
│   31 │   │   **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]                 │
│   32 │   )                                                                                       │
│   33 │   outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)           │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/utils/_contextlib.py │
│ :115 in decorate_context                                                                         │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/generation/ut │
│ ils.py:1286 in generate                                                                          │
│                                                                                                  │
│   1283 │   │   if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs:      │
│   1284 │   │   │   # if model is encoder decoder encoder_outputs are created                     │
│   1285 │   │   │   # and added to `model_kwargs`                                                 │
│ ❱ 1286 │   │   │   model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(           │
│   1287 │   │   │   │   inputs_tensor, model_kwargs, model_input_name                             │
│   1288 │   │   │   )                                                                             │
│   1289                                                                                           │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/generation/ut │
│ ils.py:638 in _prepare_encoder_decoder_kwargs_for_generation                                     │
│                                                                                                  │
│    635 │   │   model_input_name = model_input_name if model_input_name is not None else self.ma  │
│    636 │   │   encoder_kwargs["return_dict"] = True                                              │
│    637 │   │   encoder_kwargs[model_input_name] = inputs_tensor                                  │
│ ❱  638 │   │   model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)          │
│    639 │   │                                                                                     │
│    640 │   │   return model_kwargs                                                               │
│    641                                                                                           │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/nn/modules/module.py │
│ :1501 in _call_impl                                                                              │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/models/nllb_m │
│ oe/modeling_nllb_moe.py:1165 in forward                                                          │
│                                                                                                  │
│   1162 │   │   │   │   │   │   (head_mask[idx] if head_mask is not None else None),              │
│   1163 │   │   │   │   │   )                                                                     │
│   1164 │   │   │   │   else:                                                                     │
│ ❱ 1165 │   │   │   │   │   layer_outputs = encoder_layer(                                        │
│   1166 │   │   │   │   │   │   hidden_states,                                                    │
│   1167 │   │   │   │   │   │   attention_mask,                                                   │
│   1168 │   │   │   │   │   │   layer_head_mask=(head_mask[idx] if head_mask is not None else No  │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/nn/modules/module.py │
│ :1501 in _call_impl                                                                              │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/models/nllb_m │
│ oe/modeling_nllb_moe.py:701 in forward                                                           │
│                                                                                                  │
│    698 │   │                                                                                     │
│    699 │   │   hidden_states = self.ff_layer_norm(hidden_states)                                 │
│    700 │   │   if self.is_sparse:                                                                │
│ ❱  701 │   │   │   hidden_states, router_states = self.ffn(hidden_states, attention_mask)        │
│    702 │   │   else:                                                                             │
│    703 │   │   │   hidden_states = self.ffn(hidden_states)                                       │
│    704 │   │   hidden_states = self.ff_dropout(hidden_states)                                    │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/nn/modules/module.py │
│ :1501 in _call_impl                                                                              │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/transformers/models/nllb_m │
│ oe/modeling_nllb_moe.py:474 in forward                                                           │
│                                                                                                  │
│    471 │   │   top_1_mask, router_probs = self.router(hidden_states, padding_mask)               │
│    472 │   │   router_mask = router_probs.bool()                                                 │
│    473 │   │   hidden_states = hidden_states.reshape((batch_size * sequence_length), hidden_dim  │
│ ❱  474 │   │   masked_hidden_states = torch.einsum("bm,be->ebm", hidden_states, router_mask)     │
│    475 │   │   for idx, expert in enumerate(self.experts.values()):                              │
│    476 │   │   │   token_indices = router_mask[:, idx]                                           │
│    477 │   │   │   combining_weights = router_probs[token_indices, idx]                          │
│                                                                                                  │
│ <path>/.conda/envs/megatron/lib/python3.8/site-packages/torch/functional.py:378 in │
│ einsum                                                                                           │
│                                                                                                  │
│    375 │   if len(operands) <= 2 or not opt_einsum.enabled:                                      │
│    376 │   │   # the path for contracting 0 or 1 time(s) is already optimized                    │
│    377 │   │   # or the user has disabled using opt_einsum                                       │
│ ❱  378 │   │   return _VF.einsum(equation, operands)  # type: ignore[attr-defined]               │
│    379 │                                                                                         │
│    380 │   path = None                                                                           │
│    381 │   if opt_einsum.is_available():                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and
cuda:0!
```

### Expected behavior

A list of translated text.

The following code contains a workaround to prevent certain module splits and moves certain modules to the same device as the input in order to run the inference without errors.

#### Code

```python
import torch
from accelerate.big_modeling import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer


def main():
    model_name = "facebook/nllb-moe-54b"

    config = AutoConfig.from_pretrained(model_name)
    with init_empty_weights():
        model = AutoModelForSeq2SeqLM.from_config(config)
    model.tie_weights()
    device_map = infer_auto_device_map(
        model,
        # Force splits model.encoder into separate layers and devices
        max_memory={0: "6GIB", 1: "30GIB", 2: "30GIB", 3: "30GIB"},
        no_split_module_classes=model._no_split_modules
        + ["NllbMoeEncoderLayer", "NllbMoeDecoderLayer"],
        dtype="int8",
    )

    # Demonstrate that only "model.encoder.layer_norm" and "model.encoder.embed_tokens"
    # needs to be on the same device as the input
    for module, device in device_map.items():
        if module in {"model.encoder.layer_norm", "model.encoder.embed_tokens"}:
            if device != 0:
                device_map[module] = 0
        else:
            if device == 0:
                device_map[module] = 1

    tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map=device_map,  # Use the custom device map
        load_in_8bit=True,
    )

    batched_input = [
        'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.',
        "Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days."
        "Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes."
        "On Monday, Sara Danius, permanent secretary of the Nobel Committee for Literature at the Swedish Academy, publicly announced during a radio program on Sveriges Radio in Sweden the committee, unable to reach Bob Dylan directly about winning the 2016 Nobel Prize in Literature, had abandoned its efforts to reach him.",
        'Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."',
        "Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage.",
    ]
    inputs = tokenizer(batched_input, return_tensors="pt", padding=True)
    for i in inputs:
        if torch.is_tensor(inputs[i]):
            inputs[i] = inputs[i].to("cuda:0")

    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
    )
    outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
    print(outputs)


if __name__ == "__main__":
    main()
```

Output:
```
['Nous avons maintenant des souris de 4 mois qui ne sont pas diabétiques mais qui l\'étaient", a-t-il ajouté.', "Le Dr Ehud Ur, professeur de médecine à l'Université Dalhousie à Halifax, en Nouvelle-Écosse, et président de la division clinique et scientifique de l'Association canadienne du diabète, a averti que la recherche en était encore à ses débuts. Comme d'autres experts, il est sceptique quant à la possibilité de guérir le diabète, notant que ces résultats n'ont aucune pertinence pour les personnes atteintes de diabète de type 1.", 'Danius a déclaré: "Pour le moment, nous ne faisons rien. J\'ai appelé et envoyé des courriels à son plus proche collaborateur et j\'ai reçu des réponses très amicales. Pour l\'instant, c\'est certainement suffisant".', "Auparavant, le PDG de Ring, Jamie Siminoff, a déclaré que la société avait commencé lorsque sa sonnette n'était pas audible depuis son magasin dans son garage."]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLLB-MoE 54B multi-GPU inference throws "Expected all tensors to be on the same device" error #23385

System Info

Who can help?

Information

Tasks

Reproduction

Code sample

Expected behavior

Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NLLB-MoE 54B multi-GPU inference throws "Expected all tensors to be on the same device" error #23385

Description

System Info

Who can help?

Information

Tasks

Reproduction

Code sample

Expected behavior

Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions