-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
The directory (and associated sub-directories) /tmp/torch_extensions are created without global write permissions, e.g., only group and user have write permissions. On shared machines this can result in others being unable to run deepspeed as it fails at compile time.
Possibly related to this pytorch issue: pytorch/pytorch#34238
Work around: export TORCH_EXTENSIONS_DIR=/tmp/$USER/torch_extensions/
To Reproduce
Steps to reproduce the behavior:
- On a shared (linux) machine or cluster node user1 who is in group1 runs deepspeed, e.g. following https://www.deepspeed.ai/tutorials/inference-tutorial/
- On the same (linux) machine or cluster node user2 who is in group2 tries to run deepspeed but gets permission error on export /tmp/torch_extensions/
Expected behavior
deepspeed either sets "other" write-permissions to the /tmp/torch_extensions/ directories or it creates separate directories per user.
System info (please complete the following information):
- OS: 20.04.1-Ubuntu
- GPU count and types: AWS cluster nodes with A100s
Launcher context
launched with command deepspeed at the command line
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working