[WIP] per-RPC device mapping#64901
Conversation
[ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit a018e7d (more details on the Dr. CI page):
🕵️ 8 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Test | 🔁 rerun | |
| Fail if there were any warnings | 🔁 rerun | |
| Run mypy | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
mrshenli
left a comment
There was a problem hiding this comment.
Just a minor comment that, it might be easier for review and test if we have incremental PRs that targets one API at a time.
| py::arg("device_map") = DeviceMap(), | ||
| py::arg("timeout") = py::cast(kUnsetRpcTimeout), | ||
| py::call_guard<py::gil_scoped_release>(), | ||
| R"( |
There was a problem hiding this comment.
add this new argument to docstring?
| .def( | ||
| "to_here", | ||
| &PyRRef::toHere, | ||
| py::arg("device_map") = DeviceMap(), |
There was a problem hiding this comment.
What's the default behavior here? Will it inherit the global device map or fallback to CPU RPC?
| : payload_(std::move(payload)), | ||
| tensors_(std::move(tensors)), | ||
| type_(type), | ||
| deviceMap_(deviceMap) {} |
There was a problem hiding this comment.
do we need std::move here?
| } | ||
|
|
||
| void Message::setDeviceMap(DeviceMap&& deviceMap) { | ||
| deviceMap_ = deviceMap; |
| pc.serializedPyObj(), pc.isAsyncExecution()); | ||
| pc.serializedPyObj(), | ||
| std::move(pc).moveDeviceMap(), | ||
| pc.isAsyncExecution()); |
| namespace distributed { | ||
| namespace rpc { | ||
|
|
||
| // TODO(pbelevich) |
There was a problem hiding this comment.
I assume the TODO here means consolidation? :)
| #include <torch/csrc/distributed/rpc/rpc_agent.h> | ||
| #include <torch/csrc/distributed/rpc/types.h> | ||
| #include <torch/csrc/jit/serialization/pickle.h> |
There was a problem hiding this comment.
I saw these two headers are included in the .cpp files in the derived classes. Do we still need them here? Will this create circular dependency?
|
|
||
| c10::intrusive_ptr<Message> ScriptRRefFetchRet::toMessageImpl() && { | ||
| auto res = fromIValues(values_, type_); | ||
| res->setDeviceMap(std::move(deviceMap_)); |
There was a problem hiding this comment.
curious, why do we set after creating the message instead of letting fromIValues handle deviceMap_ as well?
| RRefFetchRet( | ||
| std::vector<at::IValue> values, | ||
| MessageType type, | ||
| DeviceMap&& deviceMap) |
There was a problem hiding this comment.
does the RRefFetchRet need a deviceMap?
There was a problem hiding this comment.
rref.to_here(device_map) propagates device_map to ScriptRRefFetchRet/PythonRRefFetchRet to put the result to the target device
| public: | ||
| RRefMessageBase(const RRefId& rrefId, MessageType type) | ||
| : rrefId_(rrefId), type_(type) {} | ||
| RRefMessageBase(const RRefId& rrefId, MessageType type, DeviceMap&& deviceMap) |
There was a problem hiding this comment.
not 100% sure, but looks like only Script/PythonRRefFetchCall needs this device map. Does it make sense to only add this device map to those two messages?
| # C++ (see python_rpc_handler.cpp). | ||
| class RRefProxy: | ||
| def __init__(self, rref, rpc_api, timeout=UNSET_RPC_TIMEOUT): | ||
| def __init__(self, rref, rpc_api, device_map, timeout=UNSET_RPC_TIMEOUT): |
There was a problem hiding this comment.
curious, why do we need device_map in RRefProxy instead of letting the RPCs (_invoke_rpc above) on this proxy handling that?
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23 [ghstack-poisoned]
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack:
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @cbalioglu @gcramer23