WavLM returns empty hidden states when loaded directly to GPU

### System Info

- `transformers` version: 4.42.4
- Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.9.19
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.31.0
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: NVIDIA RTX A6000

### Who can help?

@sanchit-gandhi @gant

### Information

- [X] The official example scripts
- [x] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Outputs of the hidden states are NaN when directly loading the model to the GPU. They work when the model is run on the CPU or first loaded to the CPU then moved to the GPU.

This issue can be reproduced using the following code taken from WavLM's huggingface documentation.

```python
from transformers import WavLMModel, AutoFeatureExtractor
import torch
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True)
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = AutoFeatureExtractor.from_pretrained("microsoft/wavlm-large")
model = WavLMModel.from_pretrained("microsoft/wavlm-large", device_map="cuda:4")
model.eval()

# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs.to("cuda:4"), output_hidden_states=True)

last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)
```

The above outputs a tensor with only NaNs. This does not occur if we load the model to the cpu first and then move it to the gpu. (``` model.to("cuda:4")```)

### Expected behavior

The hidden states are not NaN when the model is loaded directly to the gpu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WavLM returns empty hidden states when loaded directly to GPU #31970

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

WavLM returns empty hidden states when loaded directly to GPU #31970

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions