Skip to content

[Bug] RuntimeError: expected input to be on cuda #2186

@trajepl

Description

@trajepl

Describe the bug
When I try to run the Megatron deepspeed example(https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples) with latest master branch of deepspeed, I met following error:
image

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'Megatron Example'
  2. Install latest DeepSpeed(0.7.0+9dcfb93a) & Megatron(0.5.1)
  3. Prepare the data:
  • bash dataset/download_books.sh
  • bash dataset/download_vocab.sh
  1. Start the training:
    bash examples/azure/run-benchmark-model.sh

The env I used is:
1 node/ 8GPUs, A100.

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
Error:
image

DS_CONFIG:
image

System info (please complete the following information):

  • ubuntu2004-cu115-py38-torch1110
  • 1 machines with x8 A100s each

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions