Add distributed multi-node cpu only support (MULTI_CPU)#63
Add distributed multi-node cpu only support (MULTI_CPU)#63sgugger merged 1 commit intohuggingface:mainfrom
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Thanks a lot for digging into this! It looks great and with a little bit of polishing (mostly naming), it should be ready to be merged soon!
You should add a line in the main README in the list at the end with - multinode CPU! Also it looks like your suggested integration supports different types of launchers (from the number of env variables looked at). Could you document this somewhere?
| print("Warning: Looks like distributed multinode run but MASTER_ADDR env not set, using '127.0.0.1' as default") | ||
| print("If this run hangs, try exporting rank 0's hostname as MASTER_ADDR") |
There was a problem hiding this comment.
I thin we should raise a ValueError here telling the user to pass a MASTER_ADDR instead of relying on a default.
There was a problem hiding this comment.
Some MPI-like custom backends can initialize without MASTER_ADDR so I just put a warning. I can change to ValueError if you think that's a better choice.
There was a problem hiding this comment.
We are in a test where backend != "mpi", so I would adapt the error message (to say the backend needs a MASTER_ADDR) and raise an error. For MPI, it will still work without the MASTER_ADDR.
bf91de5 to
276bbb8
Compare
|
I have made the suggested changes. Please take a look. |
sgugger
left a comment
There was a problem hiding this comment.
Thanks for all the adjustments, two more little things and we should be good to merge!
| On your cluster just run: | ||
|
|
||
| ```bash | ||
| mpirun -np 2 python examples/nlp_example.py |
There was a problem hiding this comment.
Does this require an install of some library? Let's add how to!
There was a problem hiding this comment.
Ok, added a line on how to get OpenMPI...
|
Last step, could you run |
|
Sorry, my system doesn't have black. Make gave error. Can you please go ahead and push it to PR? |
|
Just installed black and fixed formatting... It is still giving an error about torch_ccl import but I don't know how to fix it... |
|
Fixed issue with flake8 and squashed to single commit... Should be good to merge now. |
|
Thanks again for all your work on this! |
Possible fix for #32