[Data] support batch_format for Sort and Aggregate#48287
[Data] support batch_format for Sort and Aggregate#48287scottjlee merged 8 commits intoray-project:masterfrom
Conversation
a2b2bc3 to
0045e27
Compare
|
Hi @scottjlee, could you take a look when you have time? Thanks a lot! |
scottjlee
left a comment
There was a problem hiding this comment.
Great start, have a suggestion on how we can set the batch_format without having it explicitly set at Dataset level.
python/ray/data/_internal/logical/rules/inherit_batch_format.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/logical/rules/inherit_batch_format.py
Outdated
Show resolved
Hide resolved
6fc8b80 to
bc0caf0
Compare
bc0caf0 to
e3f14e7
Compare
python/ray/data/_internal/logical/rules/inherit_batch_format.py
Outdated
Show resolved
Hide resolved
81097ac to
f39498b
Compare
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
f39498b to
e4f5009
Compare
scottjlee
left a comment
There was a problem hiding this comment.
Thanks for your contribution @xingyu-long !
python/ray/data/_internal/logical/rules/inherit_batch_format.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/logical/rules/inherit_batch_format.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
20a24af to
afeb03c
Compare
| for node in op.post_order_iter(): | ||
| nodes.appendleft(node) | ||
|
|
||
| while len(nodes) > 0: |
There was a problem hiding this comment.
Let's combine these 2 loops
There was a problem hiding this comment.
Looking through the codebase, I was trying to use same coding style for this matter. it seems not worth it to combine? Thanks!
There was a problem hiding this comment.
We can't just blindly follow the patterns, these need to make sense, right?
What's the motivation for first collecting the nodes into a queue, instead of traversing the iterator directly?
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
## Why are these changes needed? While we calling `xxx.map_groups(..., batch_format="...")`, we may invoke sort function and creating empty blocks which still uses pyarrow by default. And, when we invoke another sort call on top of it, we will hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'` since we uses first block type. (However, we may have different blocks). See more details in ray-project#46748 ## Related issue number Close ray-project#46748 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Xingyu Long <xingyulong97@gmail.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
Why are these changes needed?
While we calling
xxx.map_groups(..., batch_format="..."), we may invoke sort function and creating empty blocks which still uses pyarrow by default. And, when we invoke another sort call on top of it, we will hitAttributeError: 'DataFrame' object has no attribute 'num_rows'since we uses first block type. (However, we may have different blocks). See more details in #46748Related issue number
Close #46748
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.