[Object Spilling] Initial Iteration of S3 adapter. by rkooo567 · Pull Request #11379 · ray-project/ray

rkooo567 · 2020-10-13T23:11:46Z

Why are these changes needed?

This implements the initial (most primitive) prototype to support the S3 backend. It, unfortunately, is not properly working (extremely slow). There are 3 limitations in this PR, and I will have separate PRs to fix each of them.

This PR

builds a basic test framework. (Tests will run twice with different object spilling config)
Make sure it "works" (though it has a really bad performance)
Add some useful validation logic to ray.init path.

Follow up

It doesn't properly mock out S3 in tests. As a result, we cannot test it in our CI. There are several S3 mocking libraries (moto3 for example, https://github.com/spulec/moto). We should adopt this so that we can at least properly mock test S3 backends.
It has "very poor" performance. It doesn't of course implement any fancy optimization like small objects fusion, but also doesn't implement the basic one like using thread pool or event queue. I tried with ThreadPoolExecutor and realized it wouldn't work because how this works is raylet pings the core worker cpp endpoint which subsequently invokes restore_objects methods. I will write a design doc before I implement this.
It doesn't modularize different adapters as well as defines proper interfaces. This also needs more discussion (for the right "interface").

Related issue number

#9850

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/external_storage.py

ericl · 2020-10-15T06:38:50Z

How about we go with smart_open for now then? Seems like it can be the generic purpose fallback even if other options then out to be faster for s3 in the future.

…

On Wed, Oct 14, 2020, 11:22 PM SangBin Cho ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/external_storage.py <#11379 (comment)>: > + the default boto3.resource('s3').Object(bucket_name, key).get() + put_config_override(dict): Configuration dict that will override + the default boto3.resource('s3').Object(bucket_name, key).put() + Raises: + RayError if it fails to setup a S3 client. For example, if boto3 is + not downloaded, it raises an RayError. It can also raise S3 related + exceptions. + """ + + def __init__(self, + bucket_name: str, + prefix: str = "ray_spilled_object_", + get_config_override: dict = None, + put_config_override: dict = None): + try: + import boto3 Ok. I read some doc/source code of both projects, and smart open seems to be more mature and has more options (one great part is it supports all major cloud providers). One drawback is that it doesn't really have asyncio compatible APIs (but if we only rely on Process-level parallelization or thread pool, it should be fine). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#11379 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSUYLIIUEJVKXTR6WFLSK2ILJANCNFSM4SP2RV2A> .

rkooo567 · 2020-10-20T17:08:24Z

Finished the smart_open implementation. Here's the result. S3 is almost 600 times slower for get, and 200 times slower for put with multi-part download for 96MB objects.

fs

Object spilling benchmark for the config {'type': 'filesystem', 'params': {'directory_path': '/tmp'}}
Spilling 50 number of objects of size 12582912B takes 4.317760701999999 seconds with 10 number of io workers.
Getting all objects takes 2.3372979060000008 seconds.

Smart open

Object spilling benchmark for the config {'type': 's3', 'params': {'bucket_name': 'sang-object-spilling-test'}}
Spilling 50 number of objects of size 12582912B takes 743.244563962 seconds with 10 number of io workers.
Getting all objects takes 1165.6987561679998 seconds.

Some possible root causes;

[Object Spilling] IO Workers don't seem to be fully utilized. #11495; It looks like there are only 2 IO workers although I specified 10 max workers.
The object upload happened on my laptop. It will be much faster if I do it in EC2 in the same region & VPN as S3.

We need more performance analysis, but I won't handle them in this PR.

That says, after this PR I will

Fix the IO worker's issue.
Benchmark within an EC2 instance with different object sizes.
Implement object fusion.
Do more performance analysis and optimization.

python/ray/external_storage.py

rkooo567 · 2020-10-28T19:03:44Z

Addressed all review. The next step;

Finish writing a design for fusion small objects.
Figure out why only 2 io workers are used although I specified a high number.
Actual performance benchmark within Amazon VPN that is within S3 region. (and tuning)

ericl

Looks good but I think it would be good to simplify the error handling.

python/ray/external_storage.py

rkooo567 added 2 commits October 13, 2020 14:52

Finished the first iteration.

f1b6e63

Removed unnecessary code.

196fbed

rkooo567 assigned ericl and stephanie-wang Oct 13, 2020

ericl reviewed Oct 14, 2020

View reviewed changes

python/ray/external_storage.py Outdated Show resolved Hide resolved

python/ray/external_storage.py Outdated Show resolved Hide resolved

python/ray/external_storage.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 14, 2020

Merge branch 'master' into s3-object-spilling

17eeae8

rkooo567 mentioned this pull request Oct 20, 2020

[Object Spilling] IO Workers don't seem to be fully utilized. #11495

Closed

2 tasks

rkooo567 added 2 commits October 20, 2020 00:47

Smartopen impl.

b91a0c1

Make sure tests passed.

5a9c14c

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 20, 2020

rkooo567 commented Oct 20, 2020

View reviewed changes