Skip to content

[BUG] /tmp/torch_extensions directory created without global write permission #2244

@skiingpacman

Description

@skiingpacman

Describe the bug
The directory (and associated sub-directories) /tmp/torch_extensions are created without global write permissions, e.g., only group and user have write permissions. On shared machines this can result in others being unable to run deepspeed as it fails at compile time.

Possibly related to this pytorch issue: pytorch/pytorch#34238

Work around: export TORCH_EXTENSIONS_DIR=/tmp/$USER/torch_extensions/

To Reproduce
Steps to reproduce the behavior:

  1. On a shared (linux) machine or cluster node user1 who is in group1 runs deepspeed, e.g. following https://www.deepspeed.ai/tutorials/inference-tutorial/
  2. On the same (linux) machine or cluster node user2 who is in group2 tries to run deepspeed but gets permission error on export /tmp/torch_extensions/

Expected behavior
deepspeed either sets "other" write-permissions to the /tmp/torch_extensions/ directories or it creates separate directories per user.

System info (please complete the following information):

  • OS: 20.04.1-Ubuntu
  • GPU count and types: AWS cluster nodes with A100s

Launcher context
launched with command deepspeed at the command line

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions