Changing type and name of local_used_maps to reflect that it is only one map#65380
Changing type and name of local_used_maps to reflect that it is only one map#65380jaceyca wants to merge 4 commits intogh/jaceyca/7/basefrom
Conversation
…one map Fixing bugs that arise when running setup.py develop [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit eb037b0 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…it is only one map" Fixing bugs that arise when running setup.py develop cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23 [ghstack-poisoned]
…it is only one map" Fixing bugs that arise when running setup.py develop cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23 [ghstack-poisoned]
mrshenli
left a comment
There was a problem hiding this comment.
Thanks for further cleaning up the code.
Fixing bugs that arise when running setup.py develop
Which bug are you referring to?
| } | ||
| local_used_work_ = process_group_->allreduce(local_used_maps_dev_); | ||
| std::vector<at::Tensor> temp_local_used_map_dev_vec_ = {local_used_map_dev_}; | ||
| local_used_work_ = process_group_->allreduce(temp_local_used_map_dev_vec_); |
There was a problem hiding this comment.
curious, does the following work?
local_used_work_ = process_group_->allreduce({local_used_map_dev_});There was a problem hiding this comment.
Nope, that causes the python setup.py develop command to error out:
../torch/csrc/distributed/c10d/reducer.cpp: In member function ‘void c10d::Reducer::all_reduce_local_used_map()’:
../torch/csrc/distributed/c10d/reducer.cpp:688:67: error: no matching function for call to ‘c10d::ProcessGroup::allreduce(at::Tensor&)’
local_used_work_ = process_group_->allreduce(local_used_map_dev_);
^
In file included from ../torch/csrc/distributed/c10d/reducer.hpp:13,
from ../torch/csrc/distributed/c10d/reducer.cpp:1:
../torch/csrc/distributed/c10d/ProcessGroup.hpp:212:50: note: candidate: ‘virtual c10::intrusive_ptr<c10d::ProcessGroup::Work> c10d::ProcessGroup::allreduce(std::vector<at::Tensor>&, const c10d::AllreduceOptions&)’
virtual c10::intrusive_ptr<ProcessGroup::Work> allreduce(
^~~~~~~~~
../torch/csrc/distributed/c10d/ProcessGroup.hpp:212:50: note: no known conversion for argument 1 from ‘at::Tensor’ to ‘std::vector<at::Tensor>&’
This is the bug I was referring to in my commit message, I should've been more clear.
Yanli suggested this fix of pushing local_used_maps_dev_ into a temporary vector and passing it to allreduce()
…it is only one map" Fixing bugs that arise when running setup.py develop cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23 [ghstack-poisoned]
|
@jaceyca has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Fixing bugs that arise when running setup.py develop
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23
Differential Revision: D31104844