[Data] Fixing remaining issues with custom tensor extensions#56918
[Data] Fixing remaining issues with custom tensor extensions#56918alexeykudinkin merged 87 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request does a great job of streamlining the __eq__ and __hash__ implementations for Arrow tensor types. Centralizing the logic for fixed-shape tensor types into the base class is a good simplification, and adding the missing __eq__ method for ArrowVariableShapedTensorType improves correctness. I have a couple of suggestions to further improve the correctness of the __eq__ implementations by ensuring they are symmetric.
__eq__ to ArrowVariableShapedTensorType, streamlining __hash__ impls__eq__ to ArrowVariableShapedTensorType, streamlining __hash__ impls
5414326 to
c3d2267
Compare
python/ray/data/tests/test_tensor.py
Outdated
| ] | ||
|
|
||
|
|
||
| def test_tensor_type_equality_checks(): |
There was a problem hiding this comment.
Nit: Use parameterized tests and add ids to each test.
| ) | ||
| all_dumped_bytes.append(dumped_bytes) | ||
| arr = pa.array(all_dumped_bytes, type=type_.storage_type) | ||
| return ArrowPythonObjectArray.from_storage(type_, arr) |
There was a problem hiding this comment.
Drive by fixes
| return str(self) | ||
|
|
||
| @classmethod | ||
| def _need_variable_shaped_tensor_array( |
There was a problem hiding this comment.
Replaced by unify_tensor_type and unify_tensor_arrays
|
|
||
|
|
||
| # TODO(Clark): Remove this mixin once we only support Arrow 9.0.0+. | ||
| class _ArrowTensorScalarIndexingMixin: |
There was a problem hiding this comment.
Deleting dead code
| # Create offsets buffer | ||
| offsets = np.arange( | ||
| 0, | ||
| (outer_len + 1) * num_items_per_element, | ||
| num_items_per_element, | ||
| dtype=pa_type_.OFFSET_DTYPE.to_pandas_dtype(), | ||
| ) | ||
| offset_buffer = pa.py_buffer(offsets) |
There was a problem hiding this comment.
Using ndarrays instead of python ones
| def to_numpy(self, zero_copy_only: bool = True): | ||
| """ | ||
| Convert the entire array of tensors into a single ndarray. | ||
| else: | ||
| ext_dtype = value_type.to_pandas_dtype() | ||
|
|
||
| Args: | ||
| zero_copy_only: If True, an exception will be raised if the | ||
| conversion to a NumPy array would require copying the | ||
| underlying data (e.g. in presence of nulls, or for | ||
| non-primitive types). This argument is currently ignored, so | ||
| zero-copy isn't enforced even if this argument is true. | ||
| return np.ndarray(shape, dtype=ext_dtype, buffer=data_buffer, offset=offset) | ||
|
|
||
| Returns: | ||
| A single ndarray representing the entire array of tensors. | ||
| def to_var_shaped_tensor_array( | ||
| self, | ||
| ndim: int, | ||
| ) -> "ArrowVariableShapedTensorArray": | ||
| """ | ||
| return self._to_numpy(zero_copy_only=zero_copy_only) |
There was a problem hiding this comment.
Combined to_numpy and _to_numpy
| def _concat_same_type( | ||
| cls, | ||
| to_concat: Sequence[ | ||
| Union["ArrowTensorArray", "ArrowVariableShapedTensorArray"] | ||
| ], | ||
| ensure_copy: bool = False, | ||
| ) -> Union["ArrowTensorArray", "ArrowVariableShapedTensorArray"]: | ||
| Convert this tensor array to a variable-shaped tensor array. | ||
| """ | ||
| Concatenate multiple tensor arrays. | ||
|
|
||
| If one or more of the tensor arrays in to_concat are variable-shaped and/or any | ||
| of the tensor arrays have a different shape than the others, a variable-shaped | ||
| tensor array will be returned. | ||
|
|
||
| Args: | ||
| to_concat: Tensor arrays to concat | ||
| ensure_copy: Skip copying when ensure_copy is False and there is exactly 1 chunk. | ||
| """ | ||
| to_concat_types = [arr.type for arr in to_concat] | ||
| if ArrowTensorType._need_variable_shaped_tensor_array(to_concat_types): | ||
| # Need variable-shaped tensor array. | ||
| # TODO(Clark): Eliminate this NumPy roundtrip by directly constructing the | ||
| # underlying storage array buffers (NumPy roundtrip will not be zero-copy | ||
| # for e.g. boolean arrays). | ||
| # NOTE(Clark): Iterating over a tensor extension array converts each element | ||
| # to an ndarray view. | ||
| return ArrowVariableShapedTensorArray.from_numpy( | ||
| [e for a in to_concat for e in a] | ||
| shape = self.type.shape | ||
| if ndim < len(shape): | ||
| raise ValueError( | ||
| f"Can't convert {self.type} to var-shaped tensor type with {ndim=}" | ||
| ) | ||
| elif not ensure_copy and len(to_concat) == 1: | ||
| # Skip copying | ||
| return to_concat[0] | ||
| else: | ||
| storage = pa.concat_arrays([c.storage for c in to_concat]) | ||
|
|
||
| return ArrowTensorArray.from_storage(to_concat[0].type, storage) |
| def _chunk_tensor_arrays( | ||
| cls, arrs: Sequence[Union["ArrowTensorArray", "ArrowVariableShapedTensorArray"]] | ||
| ) -> pa.ChunkedArray: | ||
| """ | ||
| Create a ChunkedArray from multiple tensor arrays. | ||
| """ | ||
| arrs_types = [arr.type for arr in arrs] | ||
| if ArrowTensorType._need_variable_shaped_tensor_array(arrs_types): | ||
| new_arrs = [] | ||
| for a in arrs: | ||
| if isinstance(a.type, get_arrow_extension_fixed_shape_tensor_types()): | ||
| a = a.to_variable_shaped_tensor_array() | ||
| assert isinstance(a.type, ArrowVariableShapedTensorType) | ||
| new_arrs.append(a) | ||
| arrs = new_arrs | ||
| return pa.chunked_array(arrs) |
| # Pre-allocate lists for better performance | ||
| raveled = np.empty(len(arr), dtype=np.object_) | ||
| shapes = np.empty(len(arr), dtype=np.object_) | ||
|
|
||
| sizes = np.arange(len(arr), dtype=np.int64) |
There was a problem hiding this comment.
Replaced dynamically allocated Python lists w/ preallocated ndarrays
| def _to_numpy(self, index: Optional[int] = None, zero_copy_only: bool = False): | ||
| """ | ||
| Helper for getting either an element of the array of tensors as an ndarray, or | ||
| the entire array of tensors as a single ndarray. | ||
|
|
||
| Args: | ||
| index: The index of the tensor element that we wish to return as an | ||
| ndarray. If not given, the entire array of tensors is returned as an | ||
| ndarray. | ||
| zero_copy_only: If True, an exception will be raised if the conversion to a | ||
| NumPy array would require copying the underlying data (e.g. in presence | ||
| of nulls, or for non-primitive types). This argument is currently | ||
| ignored, so zero-copy isn't enforced even if this argument is true. | ||
|
|
||
| Returns: | ||
| The corresponding tensor element as an ndarray if an index was given, or | ||
| the entire array of tensors as an ndarray otherwise. | ||
| """ | ||
| # TODO(Clark): Enforce zero_copy_only. | ||
| # TODO(Clark): Support strides? | ||
| if index is None: | ||
| # Get individual ndarrays for each tensor element. | ||
| arrs = [self._to_numpy(i, zero_copy_only) for i in range(len(self))] | ||
| # Return ragged NumPy ndarray in the ndarray of ndarray pointers | ||
| # representation. | ||
| return create_ragged_ndarray(arrs) | ||
| data = self.storage.field("data") |
There was a problem hiding this comment.
Combining _to_numpy w/ to_numpy below
1542a0d to
31e1823
Compare
__eq__ to ArrowVariableShapedTensorType, streamlining __hash__ impls| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _check_pyarrow_version(): |
There was a problem hiding this comment.
Does it need to be inside core? since it's only used by library.
There was a problem hiding this comment.
Should be shared by all libs
There was a problem hiding this comment.
then we should put in the _common folder.
There was a problem hiding this comment.
I think that makes sense, but i think we'd move both get_pyarrow_version and _check_pyarrow_version together (don't think it makes sense to have them in different places)
jjyao
left a comment
There was a problem hiding this comment.
Please move arrow_utils.py (at least part of it) to _common as a follow-up.
41db2ee to
7febe6c
Compare
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
… be combined (#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
… be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
… be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
… be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
… be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
… be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ject#56918) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered: 1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure we can concat blocks holding these) 2. Fixed concatenation of AVSTT to properly reconcile different dimensions 3. Cleaned up and abstracted common utils to unify tensor types and provided arrays 4. Replaced Python arrays w/ ndarrays wherever possible 5. Deleted a lot of dead code 6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests. > > - **Tensor Extensions (Arrow)**: > - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and `concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`, `_are_contiguous_1d_views`). > - Add robust equality/hash for `ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify `ArrowTensorScalar`. > - Replace `ExtensionArray.from_storage(...)` with `ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old chunking utilities. > - Add `to_var_shaped_tensor_array` and shape-padding utilities; optimize `to_numpy`/`from_numpy` and boolean handling. > - **PyArrow Version Enforcement**: > - Add `_check_pyarrow_version` (min `9.0.0`, env override) in `ray/_private/arrow_utils.py`; integrate across Data (object/tensor extensions, util proxy). > - Update tests to validate failure on `pyarrow==8.0.0`; remove version-spoofing fixtures. > - **Arrow Ops & Schema Handling**: > - Update `concat`, schema unification, and struct-field alignment to use tensor-type unification; improved error messages. > - Use `concat_tensor_arrays` in extension column combining. > - **Other**: > - Simplify tensor scalar extraction in Arrow block accessor. > - Tests: add thorough tensor equality/concat/zero-copy cases; set `preserve_order` in limit/split tests; adjust bazel test size for `test_consumption`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 529de7a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
so that logic can conditionally execute or skip for doc building fixes doc build failure introduced by ray-project#56918 Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
… be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered:
ArrowVariableShapedTensorType.__eq__(to make sure we can concat blocks holding these)ExtensionArray.from_storagew/ExtensionType.wrap_arrayRelated issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests.
unify_tensor_types,unify_tensor_arrays, andconcat_tensor_arrayswith zero-copy helpers (_concat_ndarrays,_are_contiguous_1d_views).ArrowTensorType/V2/ArrowVariableShapedTensorType; simplifyArrowTensorScalar.ExtensionArray.from_storage(...)withExtensionType.wrap_array(...); remove_concat_same_typeand old chunking utilities.to_var_shaped_tensor_arrayand shape-padding utilities; optimizeto_numpy/from_numpyand boolean handling._check_pyarrow_version(min9.0.0, env override) inray/_private/arrow_utils.py; integrate across Data (object/tensor extensions, util proxy).pyarrow==8.0.0; remove version-spoofing fixtures.concat, schema unification, and struct-field alignment to use tensor-type unification; improved error messages.concat_tensor_arraysin extension column combining.preserve_orderin limit/split tests; adjust bazel test size fortest_consumption.Written by Cursor Bugbot for commit 529de7a. This will update automatically on new commits. Configure here.