Single Big Object Parallel Transfer. by elibol · Pull Request #1827 · ray-project/ray

elibol · 2018-04-04T09:34:44Z

Enables parallel transfer of a single big object.

AmplabJenkins · 2018-04-04T10:06:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4652/
Test FAILed.

AmplabJenkins · 2018-04-04T22:21:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4655/
Test PASSed.

stephanie-wang · 2018-04-04T23:37:11Z

src/ray/object_manager/object_buffer_pool.h

+      create_failure_buffers_;
+  /// Tracks the number of calls to GetChunk that should fail after the first call
+  /// fails to obtain the buffer.
+  std::unordered_map<ray::ObjectID, uint64_t, ray::UniqueIDHasher> get_failed_count_;


I don't really understand why this is necessary. Can we just have each of the GetChunk calls fail/succeed independently of each other?

stephanie-wang · 2018-04-04T23:38:33Z

src/ray/object_manager/object_buffer_pool.cc

+}
+
+std::shared_ptr<plasma::PlasmaClient> ObjectBufferPool::GetObjectStore() {
+  if (available_clients.empty()) {


Now that the pool is at the granularity of object buffers instead of plasma clients, can we have a single client that is shared by all of the buffers?

stephanie-wang · 2018-04-04T23:44:58Z

src/ray/object_manager/object_buffer_pool.cc

+  RAY_LOG(DEBUG) << "SealOrAbortBuffer " << object_id << " "
+                 << buffer_state_[object_id].references;
+  if (!succeeded) {
+    buffer_state_[object_id].one_failed = true;


Thinking about it now, it might make sense to only abort the chunk that failed, but keep any work that was done on the other chunks.

stephanie-wang · 2018-04-04T23:46:03Z

src/ray/object_manager/object_buffer_pool.h

+  /// Information needed about each object chunk.
+  /// This is the structure returned whenever an object chunk is
+  /// retrieved.
+  struct ChunkInfo {


Would it work if we just returned a uint8_t * to the data buffer instead of defining this other struct?

We still need buffer length, but I did removed unnecessary fields in this struct.

stephanie-wang · 2018-04-04T23:47:44Z

src/ray/object_manager/object_buffer_pool.h

+  uint64_t chunk_size_;
+  /// A vector for each object maintaining information about the chunks which comprise
+  /// an object. ChunkInfo is used to transfer
+  std::unordered_map<ray::ObjectID, std::vector<ChunkInfo>, ray::UniqueIDHasher>


Similar to above, can this be a map to std::vector<uint8_t *>?

stephanie-wang · 2018-04-04T23:52:47Z

src/ray/object_manager/object_buffer_pool.cc

+                                                 object_buffer.metadata_size));
+    auto *data = const_cast<uint8_t *>(object_buffer.data->data());
+    uint64_t num_chunks = BuildChunks(object_id, data, data_size, metadata_size);
+    buffer_state_.emplace(


This seems to assume that if we do a Get on one chunk of an object, we are also going to call Get on every other chunk of the object. This also seems like it will break if we do simultaneous Gets on the same chunk of the same object. A less error-prone way of doing this might be to keep an actual reference count of the number of Get callers of each object. Then, this method would just increment the count, instead of setting the initial count to num_chunks.

stephanie-wang · 2018-04-04T23:54:57Z

src/ray/object_manager/object_buffer_pool.h

+    std::shared_ptr<plasma::PlasmaClient> client;
+    /// The number of references that currently rely on this buffer.
+    /// We expect this many calls to Release or SealOrAbortBuffer.
+    uint64_t references;


An alternative design that comes to mind is to combine the chunk_info_ and buffer_state_ maps. Also, instead of storing both a std::vector of the chunk information and a references count, how about we just make it a std::list of the chunk information, and the reference count is just the size of the list?

AmplabJenkins · 2018-04-06T06:55:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4692/
Test PASSed.

AmplabJenkins · 2018-04-09T19:50:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4755/
Test FAILed.

… and create.

AmplabJenkins · 2018-04-10T01:21:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4764/
Test PASSed.

AmplabJenkins · 2018-04-10T06:17:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4771/
Test PASSed.

stephanie-wang

Only partway through.

stephanie-wang · 2018-04-10T06:05:58Z

src/ray/object_manager/object_buffer_pool.h

+  /// Mutex for thread-safe operations.
+  std::mutex pool_mutex_;
+  /// Determines the maximum chunk size to be transferred by a single thread.
+  uint64_t chunk_size_;


Make this const.

stephanie-wang · 2018-04-10T06:07:11Z

src/ray/object_manager/object_buffer_pool.h

+  /// This is the structure returned whenever an object chunk is
+  /// retrieved.
+  struct ChunkInfo {
+    ChunkInfo() {}


Is this ever called? Remove if not.

stephanie-wang · 2018-04-10T06:08:24Z

src/ray/object_manager/object_buffer_pool.h

+
+  /// \param chunk_index The chunk index for which to obtain the buffer length.
+  /// \param data_size The size of the object + metadata.
+  /// \return The number of chunks into which the object will be split.


Update this documentation. Also, generally it's nice to start with one line that describes the method's purpose, before the params.

stephanie-wang · 2018-04-10T06:10:14Z

src/ray/object_manager/object_buffer_pool.h

+  std::pair<const ObjectBufferPool::ChunkInfo &, ray::Status> CreateChunk(const ObjectID &object_id, uint64_t data_size,
+                               uint64_t metadata_size, uint64_t chunk_index);
+
+  ray::Status ReleaseCreateChunk(const ObjectID &object_id, uint64_t chunk_index);


Document. Also, given its usage in the object manager, AbortCreateChunk might be a more appropriate name.

stephanie-wang · 2018-04-10T06:20:02Z

src/ray/object_manager/object_buffer_pool.cc

+
+ray::Status ObjectBufferPool::SealChunk(const ObjectID &object_id, const uint64_t chunk_index) {
+  std::lock_guard<std::mutex> lock(pool_mutex_);
+  create_buffer_state_[object_id].chunk_references[chunk_index]--;


I don't think we can decrement the chunk_references here. In an example with two threads that both create the first chunk of an object of two chunks:

T1 creates chunk 1
T1 seals chunk 1
T2 creates chunk 1
T2 seals chunk 1 --> object is sealed in the object store, but only one chunk was created

It would be really nice if we had unit tests for these kinds of cases.

Nice catch!

stephanie-wang · 2018-04-10T06:22:30Z

src/ray/object_manager/object_buffer_pool.h

+  ray::Status SealChunk(const ObjectID &object_id, uint64_t chunk_index);
+
+  /// Abort the create operation associated with an object.
+  ray::Status AbortCreate(const ObjectID &object_id);


This looks like it's only called in the destructor. Should this be private? Also, please be clear in the documentation that this aborts all chunks of the object, including outstanding ones.

stephanie-wang · 2018-04-10T06:25:16Z

src/ray/object_manager/object_buffer_pool.h

+    /// an object.
+    std::vector<ChunkInfo> chunk_info;
+    /// Reference counts for each chunk.
+    std::vector<uint64_t> chunk_references;


Is this field necessary? It's only incremented and decremented, never read.

stephanie-wang · 2018-04-10T06:27:27Z

src/ray/object_manager/object_buffer_pool.cc

+                                                 const uint64_t chunk_index) {
+  std::lock_guard<std::mutex> lock(pool_mutex_);
+  create_buffer_state_[object_id].chunk_references[chunk_index]--;
+  // Make sure ReleaseCreateChunk OR SealChunk is called.


Might want to AbortObject here if chunk_references is equal to 0 at every chunk_index and num_chunks_remaining == num_chunks, since this is back to the initial state right before the first CreateChunk.

stephanie-wang · 2018-04-10T06:30:26Z

src/ray/object_manager/object_buffer_pool.cc

+}
+
+std::vector<ObjectBufferPool::ChunkInfo> ObjectBufferPool::BuildChunks(const ObjectID &object_id, uint8_t *data,
+                                       uint64_t data_size, uint64_t metadata_size) {


metadata_size doesn't seem to be used in this method.

AmplabJenkins · 2018-04-10T06:30:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4772/
Test PASSed.

AmplabJenkins · 2018-04-10T08:26:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4778/
Test FAILed.

stephanie-wang

Looks good for the most part, mostly requested some cleanups.

stephanie-wang · 2018-04-10T17:07:49Z

src/ray/object_manager/object_buffer_pool.h

+  /// state, including create operations in progress for all chunks of the object.
+  ///
+  /// \param object_id The ObjectID.
+  /// \return The status of invoking this method.


For methods that return ray::Status, we should be clear about when/how they might fail. For this method and some others on this class, it looks like the only time they should fail is if there's a bug, which we can catch with RAY_CHECK calls. I would make any of these methods return void instead.

stephanie-wang · 2018-04-10T17:08:17Z

src/ray/object_manager/object_buffer_pool.h

+
+  /// Holds the state of a get buffer.
+  struct GetBufferState {
+    GetBufferState() {}


Why the empty constructor?

The compiler won't allow unordered_map with values that don't have a default constructor (so non-existent keys can be accessed).

stephanie-wang · 2018-04-10T17:08:22Z

src/ray/object_manager/object_buffer_pool.h

+
+  /// Holds the state of a create buffer.
+  struct CreateBufferState {
+    CreateBufferState() {}


Why the empty constructor?

stephanie-wang · 2018-04-10T17:15:50Z

src/ray/object_manager/object_buffer_pool.h

+  };
+
+  /// Returned when Get fails.
+  ChunkInfo errored_chunk_ = {0, nullptr, 0};


Make this const.

stephanie-wang · 2018-04-10T17:17:28Z

src/ray/object_manager/object_manager.cc

      object_directory_(new ObjectDirectory(gcs_client)),
      store_notification_(main_service, config.store_socket_name),
-      store_pool_(config.store_socket_name),
+      buffer_pool_(config.store_socket_name, config.object_chunk_size, 2*config.max_sends),


Add a comment explaining the rationale for 2*config.max_sends. Style guide also recommends adding a /release_delay=/ comment right before the argument value for parameters whose purpose isn't clear. Also, I'm a little surprised linting didn't catch this, but add spaces around the *.

stephanie-wang · 2018-04-10T17:29:44Z

src/ray/object_manager/object_manager.cc

-        connection_pool_.ReleaseSender(ConnectionPool::ConnectionType::TRANSFER, conn));
-    return ray::Status::IOError(
-        "Unable to transfer object to requesting plasma manager, object not local.");
+  std::pair<const ObjectBufferPool::ChunkInfo &, ray::Status> pair = buffer_pool_.GetChunk(object_id, data_size, metadata_size, chunk_index);


Please use a more descriptive variable name than pair.

stephanie-wang · 2018-04-10T17:29:55Z

src/ray/object_manager/object_manager.cc

+  ObjectBufferPool::ChunkInfo chunk_info = pair.first;
+
+  if (!pair.second.ok()) {
+    // This is the first thread to invoke GetChunk => Get failed on the


Update this comment.

stephanie-wang · 2018-04-10T17:32:05Z

src/ray/object_manager/object_manager.cc

+  RAY_LOG(DEBUG) << "ExecuteReceiveObject " << client_id << " " << object_id << " "
+                 << chunk_index;
+
+  std::pair<const ObjectBufferPool::ChunkInfo &, ray::Status> pair = buffer_pool_.CreateChunk(object_id, data_size, metadata_size, chunk_index);


Please use a more descriptive variable name than pair.

stephanie-wang · 2018-04-10T17:38:57Z

src/ray/object_manager/object_manager.cc

+    mutable_vec.resize(buffer_length);
+    uint8_t *mutable_data = mutable_vec.data();
+    std::vector<boost::asio::mutable_buffer> buffer;
+    buffer.push_back(asio::buffer(mutable_data, buffer_length));


asio::buffer also has a constructor that can take in the mutable_vec vector directly. Also, probably should've pointed this out earlier, but is it necessary to have the std::vector<boost::asio::mutable_buffer> buffer? Why not pass in the mutable buffer directly?

stephanie-wang · 2018-04-10T17:42:32Z

src/ray/object_manager/transfer_queue.h

  /// \param client_id The ClientID to which the object needs to be sent.
  /// \param object_id The ObjectID of the object to be sent.
-  void QueueSend(const ClientID &client_id, const ObjectID &object_id,
+  void QueueSend(const ClientID &client_id, const ObjectID &object_id, uint64_t data_size,


Update the documentation for the params here and in other places in the file.

These have been updated.

AmplabJenkins · 2018-04-10T20:09:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4781/
Test PASSed.

stephanie-wang

Looks good to merge, once the last changes are fixed and Travis tests pass (make sure to check the linting test on the merged build, not just the build for this branch).

AmplabJenkins · 2018-04-10T22:15:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4785/
Test PASSed.

concretevitamin · 2018-04-10T22:17:02Z

Why check in all the pprof visualizations?

robertnishihara

ah good catch @concretevitamin, please remove everything under src/ray/object_manager/test/profile and any other unnecessary files

pcmoritz · 2018-04-12T23:33:12Z

src/ray/object_manager/object_buffer_pool.cc

+}
+
+uint64_t ObjectBufferPool::GetNumChunks(uint64_t data_size) {
+  return static_cast<uint64_t>(ceil(static_cast<double>(data_size) / chunk_size_));


the canonical way to do this is via integer division: (x + y - 1) / y

AmplabJenkins · 2018-04-12T23:34:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4840/
Test FAILed.

AmplabJenkins · 2018-04-13T00:21:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4841/
Test PASSed.

This reverts commit 4af43b1.

robertnishihara · 2018-04-13T01:39:54Z

src/ray/object_manager/object_buffer_pool.cc

+namespace ray {
+
+ObjectBufferPool::ObjectBufferPool(const std::string &store_socket_name,
+                                   const uint64_t chunk_size, const int release_delay)


it doesn't make sense for uint64_t and int args to be const does it?

AmplabJenkins · 2018-04-13T02:27:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4848/
Test PASSed.

AmplabJenkins · 2018-04-13T03:09:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4852/
Test PASSed.

AmplabJenkins · 2018-04-13T18:13:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4879/
Test FAILed.

AmplabJenkins · 2018-04-13T21:58:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4886/
Test FAILed.

AmplabJenkins · 2018-04-13T22:02:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4887/
Test FAILed.

AmplabJenkins · 2018-04-15T00:04:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4905/
Test PASSed.

robertnishihara · 2018-04-15T00:16:08Z

src/ray/raylet/main.cc

+  // Maximum number of receives allowed.
+  object_manager_config.max_receives = 2;
+  // Object chunk size, in bytes.
+  object_manager_config.object_chunk_size = static_cast<uint64_t>(std::pow(10, 8));


I realize this has already been merged, but constants need to go in ray_config.h.

robertnishihara · 2018-04-15T00:16:29Z

src/ray/test/run_object_manager_tests.sh

-killall plasma_store || true
-$REDIS_DIR/redis-cli -p 6379 shutdown || true
+# killall plasma_store || true
+# $REDIS_DIR/redis-cli -p 6379 shutdown || true


Why are we commenting these out?

They'll be removed or uncommented. The tests now start and remove plasma stores by pid.

robertnishihara · 2018-04-15T00:17:24Z

I see that this has already been merged, but I left a few more comments. Also note that @stephanie-wang had some comments which were not addressed but should be addressed.

robertnishihara · 2018-04-15T00:20:38Z

src/ray/object_manager/transfer_queue.h

-  Lock context_mutex;
-
+  std::mutex send_mutex_;
+  std::mutex receive_mutex_;


Any lock must be documented very clearly. We need to know exactly what it is protecting and why.

robertnishihara · 2018-04-15T00:24:03Z

src/ray/object_manager/transfer_queue.cc

                                 std::shared_ptr<TcpClientConnection> conn) {
-  WriteLock guard(receive_mutex);
-  ReceiveRequest req = {client_id, object_id, object_size, conn};
+  std::unique_lock<std::mutex> guard(receive_mutex_);


We should be using lock_guard instead of unique_lock.

From http://jakascorner.com/blog/2016/02/lock_guard-and-unique_lock.html:

The rule of thumb is to always use std::lock_guard. But if we need some higher level functionalities, which are available by std::unique_lock, then we should use the std::unique_lock.

elibol · 2018-04-15T01:02:16Z

@robertnishihara I will address your comments in a separate PR.

object manager config bug fix. addresses other comments from ray-project#1827.

* Move object manager parameters to ray config, object manager config bug fix. addresses other comments from #1827. * linting and uint? * typos * remove uint.

* master: (56 commits) [xray] Turn on flushing to the GCS for the lineage cache (ray-project#1907) Single Big Object Parallel Transfer. (ray-project#1827) Remove num_threads as a parameter. (ray-project#1891) Adds Valgrind tests for multi-threaded object manager. (ray-project#1890) Pin cython version in docker base dependencies file. (ray-project#1898) Update arrow to efficiently serialize more types of numpy arrays. (ray-project#1889) updates (ray-project#1896) [DataFrame] Inherit documentation from Pandas (ray-project#1727) Update arrow and parquet-cpp. (ray-project#1875) raylet command line resource configuration plumbing (ray-project#1882) use raylet for remote ray nodes (ray-project#1880) [rllib] Propagate dim option to deepmind wrappers (ray-project#1876) [RLLib] DDPG (ray-project#1685) Lint Python files with Yapf (ray-project#1872) [DataFrame] Fixed repr, info, and memory_usage (ray-project#1874) Fix getattr compat (ray-project#1871) check if arrow build dir exists (ray-project#1863) [DataFrame] Encapsulate index and lengths into separate class (ray-project#1849) [DataFrame] Implemented __getattr__ (ray-project#1753) Add better analytics to docs (ray-project#1854) ... # Conflicts: # python/ray/rllib/__init__.py # python/setup.py

elibol added 3 commits April 2, 2018 12:22

cache all object info from object added store notification.

07862eb

Adds parallel transfer for big objects.

b8e533f

Merge branch 'master' into melih/bigobject

7f8fc6a

documentation and clean up.

3dda216

stephanie-wang requested changes Apr 4, 2018

View reviewed changes

compare objects...

507c125

Merge branch 'master' into melih/bigobject

992058e

merge buffer_state with chunk vec. Make separate buffer state for get…

0195278

… and create.

elibol added 2 commits April 9, 2018 22:17

use references for Get. Allow partial failure of Create.

5ea76dc

single plasma client.

2feb1bc

stephanie-wang requested changes Apr 10, 2018

View reviewed changes

changes based on review.

2730676

stephanie-wang requested changes Apr 10, 2018

View reviewed changes

elibol added 2 commits April 10, 2018 12:10

update documentation and add parameters for object manager in main.cc.

17d8093

review feedback.

42eca8c

stephanie-wang approved these changes Apr 10, 2018

View reviewed changes

elibol added 2 commits April 10, 2018 14:16

use vector consturctor.

4f6022d

linting

811a765

robertnishihara requested changes Apr 10, 2018

View reviewed changes

pcmoritz reviewed Apr 12, 2018

View reviewed changes

Revert "Asynchronous IO for ObjectManager messages and object transfer."

72dfb3f

This reverts commit 4af43b1.

robertnishihara reviewed Apr 13, 2018

View reviewed changes

elibol mentioned this pull request Apr 13, 2018

Remove num_threads as a parameter. #1891

Merged

update test configuration to reflect changes in ray-project#1891

d64f7c1

Merge branch 'master' into melih/bigobject

d726c1b

elibol added 2 commits April 13, 2018 14:46

review feedback.

bda02aa

linting.

1dc8c29

Merge branch 'master' into melih/bigobject

7edab8e

pcmoritz merged commit fcd3044 into ray-project:master Apr 15, 2018

robertnishihara reviewed Apr 15, 2018

View reviewed changes

elibol added a commit to elibol/ray that referenced this pull request Apr 15, 2018

Move object manager parameters to ray config,

cc59e58

object manager config bug fix. addresses other comments from ray-project#1827.

elibol mentioned this pull request Apr 15, 2018

Addresses missed comments from multichunk object transfer PR. #1908

Merged

Conversation

elibol commented Apr 4, 2018

Uh oh!

AmplabJenkins commented Apr 4, 2018

Uh oh!

AmplabJenkins commented Apr 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 6, 2018

Uh oh!

AmplabJenkins commented Apr 9, 2018

Uh oh!

AmplabJenkins commented Apr 10, 2018

Uh oh!

AmplabJenkins commented Apr 10, 2018

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 10, 2018

Uh oh!

AmplabJenkins commented Apr 10, 2018

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment