Skip to content

[BUG] T5 error with 2 GPUs #2385

@molly-smith

Description

@molly-smith

Describe the bug
T5 errors when more than one GPU used.

File "/home/mosm/.local/lib/python3.8/site-packages/deepspeed/module_inject/layers.py", line 42, in forward
    output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeError: mat1 and mat2 shapes cannot be multiplied (22x192 and 384x256)

To Reproduce
Steps to reproduce the behavior:

  1. Script: DeepSpeedExamples/inference/huggingface/test-t5.py
  2. Packages: Deepspeed 0.7.4, torch 1.12, cuda 11.6, transformers 4.21.2
  3. Command: deepspeed --num_gpus 2 test-t5.py

Expected behavior
Output generated text.

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0+cu116
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/mosm/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.4+ff427438, ff427438, master
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

System info

  • OS: Ubuntu 20.04
  • GPU count and types: Two of Tesla V100-SXM2
  • Transformers 4.21.2
  • Python 3.8

Docker context
deepspeed.azurecr.io/deepspeed:latest_torch112-cuda11.6

Additional context
Same issue is also observed when running test-electra.py and test-roberta.py (see other Github issues)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions