Skip to content

[xray] Lineage cache requests notifications from the GCS about remote tasks#1834

Merged
robertnishihara merged 13 commits intoray-project:masterfrom
stephanie-wang:lineage-cache-gcs
Apr 16, 2018
Merged

[xray] Lineage cache requests notifications from the GCS about remote tasks#1834
robertnishihara merged 13 commits intoray-project:masterfrom
stephanie-wang:lineage-cache-gcs

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

What do these changes do?

This extends the lineage cache to request notifications about tasks that are scheduled to execute at a remote node. This allows each node to evict tasks from its local lineage cache once a notification about the remote task's commit is received.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4674/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4675/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4820/
Test PASSed.

// Stop listening for notifications about this task.
auto it = subscribed_tasks_.find(task_id);
if (it != subscribed_tasks_.end()) {
task_pubsub_.CancelNotifications(JobID::nil(), task_id, client_id_);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like for this use case, it would be fine for the Cancel to happen automatically once the GCS sends the notification, right?

The issue with that is that it involves special-casing the behavior in the redis module?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, for the Table interface, we could just call Cancel as soon as an entry is found since table entries are supposed to be immutable. Should I change that in this PR?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a subsequent PR if that makes sense.

// Stop listening for notifications about this task.
auto it = subscribed_tasks_.find(task_id);
if (it != subscribed_tasks_.end()) {
task_pubsub_.CancelNotifications(JobID::nil(), task_id, client_id_);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably do something with the returned STATUS

/// will remain unchanged.
///
/// \param task_id The ID of the waiting task to remove.
void RemoveWaitingTask(const TaskID &task_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of these methods return a status? Or should they handle failures internally?

@robertnishihara
Copy link
Copy Markdown
Collaborator

I added some status checks, but I wasn't really sure if they should be RAY_CHECK_OK or RAY_RETURN_NOT_OK.

}

void LineageCache::HandleEntryCommitted(const UniqueID &task_id) {
RAY_LOG(DEBUG) << "task committed: " << task_id;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the current design, we can't actually guarantee that this method won't fire twice (or more) for the same task ID, right? Couldn't that cause issues here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it should be okay if this method fires twice. I think this function is idempotent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the only case I can think of right now where this would fire twice is if a task gets reconstructed and added to the table again. That should be pretty rare.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right :) I changed this to check that the status is already COMMITTED if the call to set the status fails.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4914/
Test FAILed.

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

Thanks, @robertnishihara, I think the code you added looks good. We're basically calling CHECK_OK on all GCS operations besides the initial Subscribe calls, which I think is fine for now. We can revisit that once we start to explicitly handle GCS failure.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4922/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4920/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4926/
Test PASSed.

@robertnishihara
Copy link
Copy Markdown
Collaborator

Currently seeing

[ 63%] �[32m�[1mLinking CXX executable object_manager_test�[0m
In file included from /home/travis/build/ray-project/ray/src/ray/gcs/client_test.cc:5:0:
/home/travis/build/ray-project/ray/src/common/cmake/../thirdparty/hiredis/adapters/ae.h:102:12: warning: ‘int redisAeAttach(aeEventLoop*, redisAsyncContext*)’ defined but not used [-Wunused-function]
 static int redisAeAttach(aeEventLoop *loop, redisAsyncContext *ac) {
            ^
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x10): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::RequestNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x18): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::CancelNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
collect2: error: ld returned 1 exit status
make[2]: *** [src/ray/object_manager/object_manager_test] Error 1
make[1]: *** [src/ray/object_manager/CMakeFiles/object_manager_test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 64%] �[32m�[1mLinking CXX executable client_test�[0m
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x10): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::RequestNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
../libray.a(client.cc.o):(.data.rel.ro._ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE[_ZTVN3ray3gcs15PubsubInterfaceINS_8UniqueIDEEE]+0x18): undefined reference to `ray::gcs::PubsubInterface<ray::UniqueID>::CancelNotifications(ray::UniqueID const&, ray::UniqueID const&, ray::UniqueID const&)'
collect2: error: ld returned 1 exit status
make[2]: *** [src/ray/gcs/client_test] Error 1
make[1]: *** [src/ray/gcs/CMakeFiles/client_test.dir/all] Error 2
make: *** [all] Error 2

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4928/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4929/
Test PASSed.

@robertnishihara robertnishihara merged commit 6bd944a into ray-project:master Apr 16, 2018
@robertnishihara robertnishihara deleted the lineage-cache-gcs branch April 16, 2018 03:16
royf added a commit to royf/ray that referenced this pull request Apr 22, 2018
* master:
  Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929)
  [DataFrame] Adding read methods and tests (ray-project#1712)
  Allow task_table_update to fail when tasks are finished. (ray-project#1927)
  [rllib] Contribute DDPG to RLlib (ray-project#1877)
  [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920)
  Raylet task dispatch and throttling worker startup (ray-project#1912)
  [DataFrame] Eval fix (ray-project#1903)
  [tune] Polishing docs (ray-project#1846)
  [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (ray-project#1848)
  Preemptively push local arguments for actor tasks (ray-project#1901)
  [tune] Allow fetching pinned objects from trainable functions (ray-project#1895)
  Multithreading refactor for ObjectManager. (ray-project#1911)
  Add slice functionality (ray-project#1832)
  [DataFrame] Pass read_csv kwargs to _infer_column (ray-project#1894)
  Addresses missed comments from multichunk object transfer PR. (ray-project#1908)
  Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (ray-project#1816)
  [xray] Lineage cache requests notifications from the GCS about remote tasks (ray-project#1834)
  Fix UI issue for non-json-serializable task arguments. (ray-project#1892)
  Remove unnecessary calls to .hex() for object IDs. (ray-project#1910)
  Allow multiple raylets to be started on a single machine. (ray-project#1904)

# Conflicts:
#	python/ray/rllib/__init__.py
#	python/ray/rllib/dqn/dqn.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants