[xray] Handle duplicate actor creation notifications nicely#2043
Conversation
fd4fde2 to
e136a32
Compare
|
Test FAILed. |
robertnishihara
left a comment
There was a problem hiding this comment.
This looks good to me, what are the situations where a double notification can happen (without a failure)? Are actor notifications broadcast multiple times?
|
jenkins retest this please |
|
The scenario where I've seen it happen so far is when a node manager receives an actor task that it doesn't have a location for yet, so it does a lookup. Then it receives the published notification along with the lookup response, so it gets a duplicate notification. I think it's necessary to keep the lookup step, though, in case there's a node that joins the cluster after some actor notifications have already been published. |
|
Got it, thanks. |
|
Test PASSed. |
* master: Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052) Don't crash on duplicate actor notifications (ray-project#2043) Fixed attribute name in code example (ray-project#2054) [xray] Add Travis build for testing xray on Linux. (ray-project#2047) Added missing comma to code example (ray-project#2050) Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051) use jobid_nil (ray-project#2044) Fix typo in tune. (ray-project#2046) Fix error in api.rst. (ray-project#2048) Improve shared_ptr usage (ray-project#2030)
* master: Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052) Don't crash on duplicate actor notifications (ray-project#2043) Fixed attribute name in code example (ray-project#2054) [xray] Add Travis build for testing xray on Linux. (ray-project#2047) Added missing comma to code example (ray-project#2050) Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051) use jobid_nil (ray-project#2044) Fix typo in tune. (ray-project#2046) Fix error in api.rst. (ray-project#2048) Improve shared_ptr usage (ray-project#2030)
* fix-a3c-torch: Fix shape error in conv nets Ensure that values are flat list rm all use of torch Variables Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052) Don't crash on duplicate actor notifications (ray-project#2043) rm unnecessary Variable wrapper Fixed attribute name in code example (ray-project#2054) [xray] Add Travis build for testing xray on Linux. (ray-project#2047) Added missing comma to code example (ray-project#2050) Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051) use jobid_nil (ray-project#2044) Fix typo in tune. (ray-project#2046) Fix error in api.rst. (ray-project#2048) Improve shared_ptr usage (ray-project#2030) replace deprecated function Fmt Fix shapes of tensors Rename argument name to out_size Use correct pytorch functions Use F.softmax instead of a pointless network layer
What do these changes do?
An actor creation notification may be processed twice by a raylet if the initial publication is slow. The check for duplicate actor notifications is too strict, causing the raylet to crash if a duplicate notification is received. This PR relaxes the check to just make sure that if a duplicate notification is received, the node manager IDs match.