[Data] Fixing remaining issues with custom tensor extensions by alexeykudinkin · Pull Request #56918 · ray-project/ray

alexeykudinkin · 2025-09-25T04:20:34Z

Why are these changes needed?

While resolving that surfaced recently, more issues have come up which prompted me to review implementations of our tensor (Arrow's) extensions and address a variety of issues discovered:

Added missing ArrowVariableShapedTensorType.__eq__ (to make sure we can concat blocks holding these)
Fixed concatenation of AVSTT to properly reconcile different dimensions
Cleaned up and abstracted common utils to unify tensor types and provided arrays
Replaced Python arrays w/ ndarrays wherever possible
Deleted a lot of dead code
Rebased ExtensionArray.from_storage w/ ExtensionType.wrap_array

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Refactors Ray Data’s Arrow tensor extensions with type unification and zero-copy concat, replaces legacy APIs with wrap_array, and enforces PyArrow>=9 across codepaths with updated concat/schema alignment and tests.

Tensor Extensions (Arrow):
- Introduce unify_tensor_types, unify_tensor_arrays, and concat_tensor_arrays with zero-copy helpers (_concat_ndarrays, _are_contiguous_1d_views).
- Add robust equality/hash for ArrowTensorType/V2/ArrowVariableShapedTensorType; simplify ArrowTensorScalar.
- Replace ExtensionArray.from_storage(...) with ExtensionType.wrap_array(...); remove _concat_same_type and old chunking utilities.
- Add to_var_shaped_tensor_array and shape-padding utilities; optimize to_numpy/from_numpy and boolean handling.
PyArrow Version Enforcement:
- Add _check_pyarrow_version (min 9.0.0, env override) in ray/_private/arrow_utils.py; integrate across Data (object/tensor extensions, util proxy).
- Update tests to validate failure on pyarrow==8.0.0; remove version-spoofing fixtures.
Arrow Ops & Schema Handling:
- Update concat, schema unification, and struct-field alignment to use tensor-type unification; improved error messages.
- Use concat_tensor_arrays in extension column combining.
Other:
- Simplify tensor scalar extraction in Arrow block accessor.
- Tests: add thorough tensor equality/concat/zero-copy cases; set preserve_order in limit/split tests; adjust bazel test size for test_consumption.

^{Written by Cursor Bugbot for commit 529de7a. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request does a great job of streamlining the __eq__ and __hash__ implementations for Arrow tensor types. Centralizing the logic for fixed-shape tensor types into the base class is a good simplification, and adding the missing __eq__ method for ArrowVariableShapedTensorType improves correctness. I have a couple of suggestions to further improve the correctness of the __eq__ implementations by ensuring they are symmetric.