Skip to content

[data] Abstractions for joins#57022

Merged
alexeykudinkin merged 1 commit intoray-project:masterfrom
iamjustinhsu:jhsu/abstractions-for-joins
Oct 1, 2025
Merged

[data] Abstractions for joins#57022
alexeykudinkin merged 1 commit intoray-project:masterfrom
iamjustinhsu:jhsu/abstractions-for-joins

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu commented Sep 29, 2025

Why are these changes needed?

as titled, does some refractoring to override joins in the future

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Refactors join execution to split preprocessing/postprocessing, encapsulate helper logic as class methods, and standardize index column handling.

  • Join execution refactor:
    • Introduce _DatasetPreprocessingResult for splitting tables into supported_projection and unsupported_projection.
    • Add _preprocess and _postprocess flow in finalize() to handle supported-column joins and restore unsupported columns.
    • Convert helper utilities to instance methods: _split_unsupported_columns, _append_index_column, _add_back_unsupported_columns, _is_pa_join_not_supported.
    • Standardize index column naming via _index_name(suffix) and remove hardcoded names.
    • Maintain join verb mapping and join semantics while reorganizing logic.

Written by Cursor Bugbot for commit 74ae0a3. This will update automatically on new commits. Configure here.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/abstractions-for-joins branch from de49c44 to 74ae0a3 Compare September 30, 2025 18:28
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Oct 1, 2025
@iamjustinhsu iamjustinhsu marked this pull request as ready for review October 1, 2025 00:30
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner October 1, 2025 00:30
):
unsupported.append(idx)
else:
supported.append(idx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: PyArrow Extension Types Incorrectly Unjoinable

The _split_unsupported_columns method incorrectly treats all PyArrow extension types as unjoinable. It lost the logic to unwrap extension types to their storage type, which means types with joinable underlying data are now unnecessarily indexed. This can break joins or degrade performance.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 1, 2025
@alexeykudinkin alexeykudinkin merged commit 84da583 into ray-project:master Oct 1, 2025
7 checks passed
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
>
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
>
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
> 
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
>
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
> 
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
> 
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
>
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled, does some refractoring to override joins in the future
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors join execution to split preprocessing/postprocessing,
encapsulate helper logic as class methods, and standardize index column
handling.
>
> - **Join execution refactor**:
> - Introduce `_DatasetPreprocessingResult` for splitting tables into
`supported_projection` and `unsupported_projection`.
> - Add `_preprocess` and `_postprocess` flow in `finalize()` to handle
supported-column joins and restore unsupported columns.
> - Convert helper utilities to instance methods:
`_split_unsupported_columns`, `_append_index_column`,
`_add_back_unsupported_columns`, `_is_pa_join_not_supported`.
> - Standardize index column naming via `_index_name(suffix)` and remove
hardcoded names.
> - Maintain join verb mapping and join semantics while reorganizing
logic.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
74ae0a3. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants