Support metadata for passing by value task arguments#5527
Support metadata for passing by value task arguments#5527raulchen merged 15 commits intoray-project:masterfrom
Conversation
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
zhijunfu
left a comment
There was a problem hiding this comment.
Thanks, I left a few comments.
src/ray/core_worker/common.h
Outdated
| /// Data of the argument, if passed by value, otherwise nullptr. | ||
| const std::shared_ptr<Buffer> data_; | ||
| /// Metadata of the argument, if passed by value, otherwise nullptr. | ||
| const std::shared_ptr<Buffer> metadata_; |
There was a problem hiding this comment.
Is it possible to reuse the existing RayObject to present the by-value param in the TaskArg? That way it should be more clear and more consistent.
There was a problem hiding this comment.
also for the data_ and metadata_, could we maintain the following invariants to make the code simpler?
- one of them must be non-null;
- if it's not null, then the buffer must be valid, that is, the size of the buffer must be greater than 0. Thus can only reply on the validity of the pointer itself, without need to further check its size.
There was a problem hiding this comment.
I'm not sure whether forbidding empty buffers is a better choice. Maybe we can add two more functions, HasData and HasMetadata.
Do you have any concerns about processing empty buffers?
There was a problem hiding this comment.
Just found that RayObject already has the HasMetadata function. I think it's better to ensure that data_ cannot be nullptr yet is allowed to be empty. As for metadata_, we don't set any restrictions.
There was a problem hiding this comment.
Could you explain a bit more on why it's better to ensure data_ cannot be nullptr?
I think for error conditions data_ could nullptr?
I preferred to not have empty buffers because:
- it's more intuitive and more clear conceptually. When it's not a nullptr, it's always valid.
- in most cases we only have
data_but notmetadata_, and we can save a memory allocation/free in this case. - I'm a bit concerned that having empty buffers might be error-prone. e.g. it's possible that people might only check the validity of the pointer, but don't check if buffer size is empty, then it might cause some bugs.
There was a problem hiding this comment.
I've changed my mind. I think we should ensure that the same set of rules apply to both data_ and metadata_ to avoid confusion. I tried to add checks in the constructor of RayObject to avoid empty buffers, but it just made it worse. Tests failed due to the checks and I can't find out the source of empty buffers. And it cost me hours.
I suggest to keep it as is. We can add more checks after we have an agreement on nullptr vs. empty buffer.
There was a problem hiding this comment.
I see. Then how about just keep it for now, and leave a todo that we can revisit this later.
There was a problem hiding this comment.
I've added checks to avoid empty buffers when constructing a RayObject instance.
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
edoakes
left a comment
There was a problem hiding this comment.
Mostly looks good, just please be very clear about the semantics of the RayObject as it is getting more and more complicated.
src/ray/core_worker/common.h
Outdated
| } | ||
|
|
||
| RAY_CHECK((data_ && data_->Size()) || (metadata_ && metadata_->Size())) | ||
| << "Data and metadat cannot be both empty."; |
There was a problem hiding this comment.
| << "Data and metadat cannot be both empty."; | |
| << "Data and metadata cannot both empty."; |
src/ray/core_worker/common.h
Outdated
| } | ||
|
|
||
| /// Whether this object has metadata. | ||
| bool HasMetadata() const { return metadata_ != nullptr; } |
There was a problem hiding this comment.
If the invariant is that the object has either data_ or metadata_, please also add a HasData method.
src/ray/core_worker/common.h
Outdated
| }; | ||
|
|
||
| /// Binary representation of ray object. | ||
| class RayObject { |
There was a problem hiding this comment.
Please comment that only one of data_ or metadata_ must be present.
src/ray/core_worker/common.h
Outdated
| // If this object is required to hold a copy of the data, | ||
| // make a copy if the passed in buffers don't already have a copy. | ||
| if (data_ && !data_->OwnsData()) { | ||
| data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), true); |
There was a problem hiding this comment.
| data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), true); | |
| data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), /*flag_name=*/true); |
Same for below.
src/ray/core_worker/common.h
Outdated
| RayObject(const std::shared_ptr<Buffer> &data, | ||
| const std::shared_ptr<Buffer> &metadata = nullptr, bool copy_data = false) | ||
| : data_(data), metadata_(metadata), has_data_copy_(copy_data) { | ||
| if (has_data_copy_) { |
There was a problem hiding this comment.
I know that this PR didn't introduce this flag, but can you please explain why we need the has_data_copy_ flag and its semantics in the header comment?
There was a problem hiding this comment.
What happens when it's set, when should the caller set it, what could go wrong if it isn't set when it should be, etc.
There was a problem hiding this comment.
This is related to Buffer definition, please refer to src/ray/common/buffer.h.
For example, by default when you initialize a LocalMemoryBuffer with a data pointer and a length, it just assigns the pointer and length, but doesn't copy the data content, this is for performance reasons, but in this case the buffer cannot ensure data validity, it instead relies the lifetime passed in data pointer.
This is fine for most cases - for example when you put an object into store, you create a temporary RayObject and don't want to do an extra copy. But in some cases you do want to always hold a valid data - for example, memory store uses RayObject to represent objects, in this case you actually want the object data to remain valid after user puts it into store.
There was a problem hiding this comment.
I see, that seems reasonable but we need to document it better (basically, the content of this comment). The comment in src/ray/common/buffer.h also doesn't have much more information.
src/ray/core_worker/common.h
Outdated
| const std::vector<std::string> function_descriptor; | ||
| }; | ||
|
|
||
| /// Binary representation of ray object. |
There was a problem hiding this comment.
Make this comment more useful or remove it
There was a problem hiding this comment.
Please check the updated comment.
src/ray/core_worker/common.h
Outdated
| bool HasMetadata() const { return metadata_ != nullptr; } | ||
|
|
||
| private: | ||
| // TODO (kfstorm): Currently both a null pointer and a pointer points to a buffer with |
There was a problem hiding this comment.
Please fix this before merging. Shouldn't be too hard and can easily get lost afterwards.
There was a problem hiding this comment.
As I said in the discussion with @zhijunfu, I tried it, but a lot of tests failed and I can't find out all the sources of empty buffers after hours. You may continue to discuss it on that thread.
src/ray/core_worker/common.h
Outdated
| // zero size means empty data/metadata. We'd better pick one and treat the other as | ||
| // invalid. | ||
|
|
||
| /// Data of the ray object. |
src/ray/core_worker/common.h
Outdated
|
|
||
| /// Data of the ray object. | ||
| std::shared_ptr<Buffer> data_; | ||
| /// Metadata of the ray object. |
|
|
||
| public NativeRayObject(byte[] data, byte[] metadata) { | ||
| Preconditions.checkNotNull(data); | ||
| Preconditions.checkNotNull(metadata); |
There was a problem hiding this comment.
should metadata be a zero-length array when there's no metadata? If so, we can document this.
There was a problem hiding this comment.
Oops. This is something I didn't keep consistent with the C++ code. I'll make null pointer and zero-length array both valid. If you prefer to allow only one of them, we can discuss it in the first thread.
| byte[] data = nativeRayObject.data; | ||
|
|
||
| // If meta is not null, deserialize the object from meta. | ||
| if (meta != null && meta.length > 0) { |
There was a problem hiding this comment.
meta != null isn't needed now?
There was a problem hiding this comment.
As I'll make null pointer valid, it is needed.
| public final NativeRayObject value; | ||
|
|
||
| private FunctionArg(ObjectId id, byte[] data) { | ||
| private FunctionArg(ObjectId id, NativeRayObject value) { |
There was a problem hiding this comment.
check (id == null) != (value == null)
There was a problem hiding this comment.
This is a private constructor, but OK.
There was a problem hiding this comment.
Adding this check can help catch bugs eariler
src/ray/core_worker/common.h
Outdated
| }; | ||
|
|
||
| /// Binary representation of ray object. | ||
| class RayObject { |
There was a problem hiding this comment.
better put this RayObject class in src/common/, because raylet may also need to use this.
|
Test PASSed. |
raulchen
left a comment
There was a problem hiding this comment.
LGTM. can you fix the conflict? thx
|
Test PASSed. |
|
@raulchen Conflicts solved. |
| /// \param[in] data Data of the ray object. | ||
| /// \param[in] metadata Metadata of the ray object. | ||
| /// \param[in] copy_data Whether this class should hold a copy of data. | ||
| RayObject(const std::shared_ptr<Buffer> &data, const std::shared_ptr<Buffer> &metadata, |
There was a problem hiding this comment.
Please add a comment explaining the has_data_copy flag and why it's necessary.
src/ray/common/ray_object.h
Outdated
| bool HasMetadata() const { return metadata_ != nullptr && metadata_->Size() > 0; } | ||
|
|
||
| private: | ||
| // TODO (kfstorm): Currently both a null pointer and a pointer points to a buffer with |
There was a problem hiding this comment.
Please address this before we merge.
There was a problem hiding this comment.
I've added checks to avoid empty buffers.
| RAY_CHECK(data_ != nullptr) << "This argument isn't passed by value."; | ||
| return data_; | ||
| const RayObject &GetValue() const { | ||
| RAY_CHECK(value_ != nullptr) << "This argument isn't passed by value."; |
There was a problem hiding this comment.
Let's try not to use RAY_CHECK in the core worker as it causes an Abort trap: 6, making it very inconvenient to debug (and horrible for users if it happens somehow). Better to use RAY_RETURN_NO_OK.
There was a problem hiding this comment.
How is that possible? RAY_RETURN_NOT_OK is used for functions with the return type of Status.
There was a problem hiding this comment.
This RAY_CHECK should not get triggered by user code. If it fails, it must be a bug in ray codebase.
There was a problem hiding this comment.
Yeah, I think it's better to use RAY_CHECK to catch system bugs.
There was a problem hiding this comment.
Oops, sorry - of course you can't use RAY_RETURN_NOT_OK. Actually, isn't it better to define separate classes for by value and by reference arguments instead of having these checks? That way the compiler can catch these bugs for us...
|
Test FAILed. |
|
Test PASSed. |
|
@edoakes All comments should be addressed. Please let us know if you have other comments. I'll postpone merging this PR until tomorrow. |
|
Test PASSed. |
|
LGTM aside from my comment about defining separate structs/classes for by value and by reference args. You can push that to another PR if this one is blocking things, but seems much better than |
|
@edoakes By-value and by-reference arguments need different methods ( I'll merge this PR first, as the refactor may need more discussions. |
Why are these changes needed?
This feature is needed by direct actor call, as it only accepts passing by value arguments. When passing byte array between Java and Python, we need metadata in task arguments to make passing byte array as value possible. Hence, this PR is a prerequisite of #5504.
What do these changes do?
Add
matadatafield forTaskArgincommon.proto.Update the object serialization code of both Java and Python.
Extract
serializeanddeserialziemethods ofObjectStoreto a new classObjectSerializer.Related issue number
#5029
Linter
scripts/format.shto lint the changes in this PR.