Skip to content

[BUG] test-electra.py failing with improper matmul shapes when num gpus > 1  #2386

@lokoppakmsft

Description

@lokoppakmsft

Describe the bug
test-electra.py fails with following error

File "/home/deepspeed/data/DeepSpeed/deepspeed/module_inject/layers.py", line 42, in forward
output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeError: mat1 and mat2 shapes cannot be multiplied (18x128 and 256x128)

To Reproduce
Steps to reproduce the behavior:

  1. Inference Script : https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/test-electra.py
  2. Packages: Deepspeed from master , ff42743, torch 1.12, cuda 11.6, transformers 4.21.2
  3. deepspeed --num_gpus 2 test-electra.py

Expected behavior
image

ds_report output
image

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 20.04.5 LTS
  • GPU count and types: 2x RTX A6000
  • Python version : Python 3.8.10

Additional context
This test does not fail with deepspeed 0.7.3

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions