[RPC] Support timeout for RRef proxy functions#50499
[RPC] Support timeout for RRef proxy functions#50499rohan-varma wants to merge 7 commits intogh/rohan-varma/216/basefrom
Conversation
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 4a218f3 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
Pull Request resolved: #50499 Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! ghstack-source-id: 119791109
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
Pull Request resolved: #50499 Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! ghstack-source-id: 119829871
| "rpc_sync", | ||
| [](const PyRRef& self) { | ||
| return self.createRRefProxy(RRefProxyType::RPC_SYNC); | ||
| [](const PyRRef& self, float timeoutSeconds) { |
There was a problem hiding this comment.
shouldn't this be int64_t or std::chrono::milliseconds instead of float ?
There was a problem hiding this comment.
I think this is float so that we can support it in torchscript? I think we used float for regular rpc timeout as well for this reason?
There was a problem hiding this comment.
Yes, I believe the convention we're using in RPC code is for timeout to be a float indicating the number of seconds. There was a PR to make that consistent across the codebase a while back. The float allows for a fractional timeout.
There was a problem hiding this comment.
hmmm i see, yeah if rpc timeouts are already float, then let's follow that convention.
| "rpc_async", | ||
| [](const PyRRef& self) { | ||
| return self.createRRefProxy(RRefProxyType::RPC_ASYNC); | ||
| [](const PyRRef& self, float timeoutSeconds) { |
| "remote", | ||
| [](const PyRRef& self) { | ||
| return self.createRRefProxy(RRefProxyType::REMOTE); | ||
| [](const PyRRef& self, float timeoutSeconds) { |
| "rpc_sync", | ||
| [](const PyRRef& self) { | ||
| return self.createRRefProxy(RRefProxyType::RPC_SYNC); | ||
| [](const PyRRef& self, float timeoutSeconds) { |
There was a problem hiding this comment.
I think this is float so that we can support it in torchscript? I think we used float for regular rpc timeout as well for this reason?
| return self.createRRefProxy( | ||
| RRefProxyType::RPC_SYNC, timeoutSeconds); | ||
| }, | ||
| py::arg("timeout") = kUnsetRpcTimeout, |
There was a problem hiding this comment.
Shouldn't we update all the docs to include this new parameter?
There was a problem hiding this comment.
Added docs in the latest update.
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! [ghstack-poisoned]
Pull Request resolved: #50499 Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Differential Revision: [D25897495](https://our.internmc.facebook.com/intern/diff/D25897495/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25897495/)! ghstack-source-id: 119860419
|
This pull request has been merged in d64184e. |
Summary: Pull Request resolved: pytorch#50499 Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Test Plan: Added UT Reviewed By: wanchaol Differential Revision: D25897495 fbshipit-source-id: f9ad5b8f75121f50537677056a5ab16cf262847e
Stack from ghstack:
Adds a timeout API to the following functions:
so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases:
rref._get_typewhich blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change.Differential Revision: D25897495
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!