Skip to content

[Object Spilling] Initial Iteration of S3 adapter.#11379

Merged
rkooo567 merged 13 commits intoray-project:masterfrom
rkooo567:s3-object-spilling
Oct 30, 2020
Merged

[Object Spilling] Initial Iteration of S3 adapter.#11379
rkooo567 merged 13 commits intoray-project:masterfrom
rkooo567:s3-object-spilling

Conversation

@rkooo567
Copy link
Copy Markdown
Contributor

Why are these changes needed?

This implements the initial (most primitive) prototype to support the S3 backend. It, unfortunately, is not properly working (extremely slow). There are 3 limitations in this PR, and I will have separate PRs to fix each of them.

This PR

  • builds a basic test framework. (Tests will run twice with different object spilling config)
  • Make sure it "works" (though it has a really bad performance)
  • Add some useful validation logic to ray.init path.

Follow up

  • It doesn't properly mock out S3 in tests. As a result, we cannot test it in our CI. There are several S3 mocking libraries (moto3 for example, https://github.com/spulec/moto). We should adopt this so that we can at least properly mock test S3 backends.
  • It has "very poor" performance. It doesn't of course implement any fancy optimization like small objects fusion, but also doesn't implement the basic one like using thread pool or event queue. I tried with ThreadPoolExecutor and realized it wouldn't work because how this works is raylet pings the core worker cpp endpoint which subsequently invokes restore_objects methods. I will write a design doc before I implement this.
  • It doesn't modularize different adapters as well as defines proper interfaces. This also needs more discussion (for the right "interface").

Related issue number

#9850

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 14, 2020
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Oct 15, 2020 via email

@rkooo567
Copy link
Copy Markdown
Contributor Author

Finished the smart_open implementation. Here's the result. S3 is almost 600 times slower for get, and 200 times slower for put with multi-part download for 96MB objects.

fs

Object spilling benchmark for the config {'type': 'filesystem', 'params': {'directory_path': '/tmp'}}
Spilling 50 number of objects of size 12582912B takes 4.317760701999999 seconds with 10 number of io workers.
Getting all objects takes 2.3372979060000008 seconds.

Smart open

Object spilling benchmark for the config {'type': 's3', 'params': {'bucket_name': 'sang-object-spilling-test'}}
Spilling 50 number of objects of size 12582912B takes 743.244563962 seconds with 10 number of io workers.
Getting all objects takes 1165.6987561679998 seconds.

Some possible root causes;

We need more performance analysis, but I won't handle them in this PR.

That says, after this PR I will

  1. Fix the IO worker's issue.
  2. Benchmark within an EC2 instance with different object sizes.
  3. Implement object fusion.
  4. Do more performance analysis and optimization.

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 20, 2020
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 21, 2020
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2020
@rkooo567
Copy link
Copy Markdown
Contributor Author

Addressed all review. The next step;

  • Finish writing a design for fusion small objects.
  • Figure out why only 2 io workers are used although I specified a high number.
  • Actual performance benchmark within Amazon VPN that is within S3 region. (and tuning)

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I think it would be good to simplify the error handling.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2020
@rkooo567 rkooo567 merged commit 71c5089 into ray-project:master Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants