Skip to content

[Data] Fixing remaining issues with custom tensor extensions#56918

Merged
alexeykudinkin merged 87 commits intoray-project:masterfrom
alexeykudinkin:ak/vr-shp-tnr-eq-fix
Oct 3, 2025
Merged

[Data] Fixing remaining issues with custom tensor extensions#56918
alexeykudinkin merged 87 commits intoray-project:masterfrom
alexeykudinkin:ak/vr-shp-tnr-eq-fix

Conversation

@alexeykudinkin
Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin commented Sep 25, 2025

Why are these changes needed?

While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered:

  1. Added missing ArrowVariableShapedTensorType.__eq__ (to make sure we can concat blocks holding these)
  2. Fixed concatenation of AVSTT to properly reconcile different dimensions
  3. Cleaned up and abstracted common utils to unify tensor types and provided arrays
  4. Replaced Python arrays w/ ndarrays wherever possible
  5. Deleted a lot of dead code
  6. Rebased ExtensionArray.from_storage w/ ExtensionType.wrap_array

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests.

  • Tensor Extensions (Arrow):
    • Introduce unify_tensor_types, unify_tensor_arrays, and concat_tensor_arrays with zero-copy helpers (_concat_ndarrays, _are_contiguous_1d_views).
    • Add robust equality/hash for ArrowTensorType/V2/ArrowVariableShapedTensorType; simplify ArrowTensorScalar.
    • Replace ExtensionArray.from_storage(...) with ExtensionType.wrap_array(...); remove _concat_same_type and old chunking utilities.
    • Add to_var_shaped_tensor_array and shape-padding utilities; optimize to_numpy/from_numpy and boolean handling.
  • PyArrow Version Enforcement:
    • Add _check_pyarrow_version (min 9.0.0, env override) in ray/_private/arrow_utils.py; integrate across Data (object/tensor extensions, util proxy).
    • Update tests to validate failure on pyarrow==8.0.0; remove version-spoofing fixtures.
  • Arrow Ops & Schema Handling:
    • Update concat, schema unification, and struct-field alignment to use tensor-type unification; improved error messages.
    • Use concat_tensor_arrays in extension column combining.
  • Other:
    • Simplify tensor scalar extraction in Arrow block accessor.
    • Tests: add thorough tensor equality/concat/zero-copy cases; set preserve_order in limit/split tests; adjust bazel test size for test_consumption.

Written by Cursor Bugbot for commit 529de7a. This will update automatically on new commits. Configure here.

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner September 25, 2025 04:20
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a great job of streamlining the __eq__ and __hash__ implementations for Arrow tensor types. Centralizing the logic for fixed-shape tensor types into the base class is a good simplification, and adding the missing __eq__ method for ArrowVariableShapedTensorType improves correctness. I have a couple of suggestions to further improve the correctness of the __eq__ implementations by ensuring they are symmetric.

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Sep 25, 2025
@alexeykudinkin alexeykudinkin changed the title [WIP][Data] Adding missing __eq__ to ArrowVariableShapedTensorType, streamlining __hash__ impls [Data] Adding missing __eq__ to ArrowVariableShapedTensorType, streamlining __hash__ impls Sep 26, 2025
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Sep 26, 2025
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner September 26, 2025 18:49
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 26, 2025 20:24
@github-actions github-actions bot disabled auto-merge September 26, 2025 20:24
]


def test_tensor_type_equality_checks():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use parameterized tests and add ids to each test.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

)
all_dumped_bytes.append(dumped_bytes)
arr = pa.array(all_dumped_bytes, type=type_.storage_type)
return ArrowPythonObjectArray.from_storage(type_, arr)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by fixes

return str(self)

@classmethod
def _need_variable_shaped_tensor_array(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by unify_tensor_type and unify_tensor_arrays



# TODO(Clark): Remove this mixin once we only support Arrow 9.0.0+.
class _ArrowTensorScalarIndexingMixin:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting dead code

Comment on lines +728 to +735
# Create offsets buffer
offsets = np.arange(
0,
(outer_len + 1) * num_items_per_element,
num_items_per_element,
dtype=pa_type_.OFFSET_DTYPE.to_pandas_dtype(),
)
offset_buffer = pa.py_buffer(offsets)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ndarrays instead of python ones

Comment on lines -930 to -944
def to_numpy(self, zero_copy_only: bool = True):
"""
Convert the entire array of tensors into a single ndarray.
else:
ext_dtype = value_type.to_pandas_dtype()

Args:
zero_copy_only: If True, an exception will be raised if the
conversion to a NumPy array would require copying the
underlying data (e.g. in presence of nulls, or for
non-primitive types). This argument is currently ignored, so
zero-copy isn't enforced even if this argument is true.
return np.ndarray(shape, dtype=ext_dtype, buffer=data_buffer, offset=offset)

Returns:
A single ndarray representing the entire array of tensors.
def to_var_shaped_tensor_array(
self,
ndim: int,
) -> "ArrowVariableShapedTensorArray":
"""
return self._to_numpy(zero_copy_only=zero_copy_only)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined to_numpy and _to_numpy

Comment on lines -947 to -982
def _concat_same_type(
cls,
to_concat: Sequence[
Union["ArrowTensorArray", "ArrowVariableShapedTensorArray"]
],
ensure_copy: bool = False,
) -> Union["ArrowTensorArray", "ArrowVariableShapedTensorArray"]:
Convert this tensor array to a variable-shaped tensor array.
"""
Concatenate multiple tensor arrays.

If one or more of the tensor arrays in to_concat are variable-shaped and/or any
of the tensor arrays have a different shape than the others, a variable-shaped
tensor array will be returned.

Args:
to_concat: Tensor arrays to concat
ensure_copy: Skip copying when ensure_copy is False and there is exactly 1 chunk.
"""
to_concat_types = [arr.type for arr in to_concat]
if ArrowTensorType._need_variable_shaped_tensor_array(to_concat_types):
# Need variable-shaped tensor array.
# TODO(Clark): Eliminate this NumPy roundtrip by directly constructing the
# underlying storage array buffers (NumPy roundtrip will not be zero-copy
# for e.g. boolean arrays).
# NOTE(Clark): Iterating over a tensor extension array converts each element
# to an ndarray view.
return ArrowVariableShapedTensorArray.from_numpy(
[e for a in to_concat for e in a]
shape = self.type.shape
if ndim < len(shape):
raise ValueError(
f"Can't convert {self.type} to var-shaped tensor type with {ndim=}"
)
elif not ensure_copy and len(to_concat) == 1:
# Skip copying
return to_concat[0]
else:
storage = pa.concat_arrays([c.storage for c in to_concat])

return ArrowTensorArray.from_storage(to_concat[0].type, storage)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting

Comment on lines -985 to -1000
def _chunk_tensor_arrays(
cls, arrs: Sequence[Union["ArrowTensorArray", "ArrowVariableShapedTensorArray"]]
) -> pa.ChunkedArray:
"""
Create a ChunkedArray from multiple tensor arrays.
"""
arrs_types = [arr.type for arr in arrs]
if ArrowTensorType._need_variable_shaped_tensor_array(arrs_types):
new_arrs = []
for a in arrs:
if isinstance(a.type, get_arrow_extension_fixed_shape_tensor_types()):
a = a.to_variable_shaped_tensor_array()
assert isinstance(a.type, ArrowVariableShapedTensorType)
new_arrs.append(a)
arrs = new_arrs
return pa.chunked_array(arrs)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting

Comment on lines +1065 to +1069
# Pre-allocate lists for better performance
raveled = np.empty(len(arr), dtype=np.object_)
shapes = np.empty(len(arr), dtype=np.object_)

sizes = np.arange(len(arr), dtype=np.int64)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced dynamically allocated Python lists w/ preallocated ndarrays

Comment on lines -1262 to -1288
def _to_numpy(self, index: Optional[int] = None, zero_copy_only: bool = False):
"""
Helper for getting either an element of the array of tensors as an ndarray, or
the entire array of tensors as a single ndarray.

Args:
index: The index of the tensor element that we wish to return as an
ndarray. If not given, the entire array of tensors is returned as an
ndarray.
zero_copy_only: If True, an exception will be raised if the conversion to a
NumPy array would require copying the underlying data (e.g. in presence
of nulls, or for non-primitive types). This argument is currently
ignored, so zero-copy isn't enforced even if this argument is true.

Returns:
The corresponding tensor element as an ndarray if an index was given, or
the entire array of tensors as an ndarray otherwise.
"""
# TODO(Clark): Enforce zero_copy_only.
# TODO(Clark): Support strides?
if index is None:
# Get individual ndarrays for each tensor element.
arrs = [self._to_numpy(i, zero_copy_only) for i in range(len(self))]
# Return ragged NumPy ndarray in the ndarray of ndarray pointers
# representation.
return create_ragged_ndarray(arrs)
data = self.storage.field("data")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining _to_numpy w/ to_numpy below

@alexeykudinkin alexeykudinkin changed the title [Data] Adding missing __eq__ to ArrowVariableShapedTensorType, streamlining __hash__ impls [Data] Fixing remaining issues with custom tensor extensions Sep 30, 2025
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 30, 2025 02:04
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner September 30, 2025 07:17
@github-actions github-actions bot disabled auto-merge September 30, 2025 07:17
cursor[bot]

This comment was marked as outdated.

logger = logging.getLogger(__name__)


def _check_pyarrow_version():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be inside core? since it's only used by library.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be shared by all libs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should put in the _common folder.

Copy link
Copy Markdown
Contributor Author

@alexeykudinkin alexeykudinkin Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, but i think we'd move both get_pyarrow_version and _check_pyarrow_version together (don't think it makes sense to have them in different places)

Copy link
Copy Markdown
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move arrow_utils.py (at least part of it) to _common as a follow-up.

@alexeykudinkin alexeykudinkin enabled auto-merge (squash) October 2, 2025 00:04
cursor[bot]

This comment was marked as outdated.

@github-actions github-actions bot disabled auto-merge October 2, 2025 01:54
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) October 2, 2025 01:56
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
> 
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
> 
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
> 
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
alexeykudinkin added a commit that referenced this pull request Oct 7, 2025
… be combined (#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
srinathk10 added a commit that referenced this pull request Oct 7, 2025
@gemini-code-assist gemini-code-assist bot mentioned this pull request Oct 7, 2025
8 tasks
srinathk10 added a commit that referenced this pull request Oct 7, 2025
…56918)"

This reverts commit efb3075.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
> 
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
… be combined (ray-project#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](ray-project#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
>
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
… be combined (ray-project#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](ray-project#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
> 
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
… be combined (ray-project#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](ray-project#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
> 
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… be combined (ray-project#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](ray-project#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
>
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
… be combined (ray-project#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](ray-project#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ject#56918)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

While resolving that surfaced recently, more issues have come up which
prompted me to review implementations of our tensor (Arrow's) extensions
and address a variety of issues discovered:

1. Added missing `ArrowVariableShapedTensorType.__eq__ ` (to make sure
we can concat blocks holding these)
2. Fixed concatenation of AVSTT to properly reconcile different
dimensions
3. Cleaned up and abstracted common utils to unify tensor types and
provided arrays
4. Replaced Python arrays w/ ndarrays wherever possible
5. Deleted a lot of dead code
6. Rebased `ExtensionArray.from_storage` w/ `ExtensionType.wrap_array`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors Ray Data’s Arrow tensor extensions with type unification and
zero-copy concat, replaces legacy APIs with wrap_array, and enforces
PyArrow>=9 across codepaths with updated concat/schema alignment and
tests.
>
> - **Tensor Extensions (Arrow)**:
> - Introduce `unify_tensor_types`, `unify_tensor_arrays`, and
`concat_tensor_arrays` with zero-copy helpers (`_concat_ndarrays`,
`_are_contiguous_1d_views`).
> - Add robust equality/hash for
`ArrowTensorType`/`V2`/`ArrowVariableShapedTensorType`; simplify
`ArrowTensorScalar`.
> - Replace `ExtensionArray.from_storage(...)` with
`ExtensionType.wrap_array(...)`; remove `_concat_same_type` and old
chunking utilities.
> - Add `to_var_shaped_tensor_array` and shape-padding utilities;
optimize `to_numpy`/`from_numpy` and boolean handling.
> - **PyArrow Version Enforcement**:
> - Add `_check_pyarrow_version` (min `9.0.0`, env override) in
`ray/_private/arrow_utils.py`; integrate across Data (object/tensor
extensions, util proxy).
> - Update tests to validate failure on `pyarrow==8.0.0`; remove
version-spoofing fixtures.
> - **Arrow Ops & Schema Handling**:
> - Update `concat`, schema unification, and struct-field alignment to
use tensor-type unification; improved error messages.
>   - Use `concat_tensor_arrays` in extension column combining.
> - **Other**:
>   - Simplify tensor scalar extraction in Arrow block accessor.
> - Tests: add thorough tensor equality/concat/zero-copy cases; set
`preserve_order` in limit/split tests; adjust bazel test size for
`test_consumption`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
529de7a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
so that logic can conditionally execute or skip for doc building

fixes doc build failure introduced by
ray-project#56918

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
… be combined (ray-project#57240)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Original [PR](ray-project#56918) had while
fixing all of the infra missed to delete the line in the end relaxing
this constraint:

1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged
2. Added tests

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants