-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] Remove deprecated TENSOR_COLUMN_NAME constant and associated dead code #60547
Description
Description
Remove the TENSOR_COLUMN_NAME constant ("__value__") from Ray Data. This constant was historically used to wrap raw numpy arrays into single-column tables, but this behavior has been deprecated since Ray 2.5. The constant and its associated code paths are now dead code that should be cleaned up.
Background
TENSOR_COLUMN_NAME (defined as "__value__") was introduced to handle cases where users passed raw numpy arrays to Ray Data APIs. When a raw numpy array was encountered, it would be automatically wrapped into a single-column table with the column name "__value__".
Since Ray 2.5, passing raw numpy arrays to APIs like map_batches() raises an explicit error:
if isinstance(batch, np.ndarray):
raise ValueError(
"Standalone numpy arrays are not allowed in Ray 2.5. "
"Return a dict of field -> array, e.g., `{'data': array}` instead of `array`."
)See: block.py:463-469
Current public APIs already use explicit column names:
from_numpy()uses"data"as the column namefrom_items()uses"item"as the column namerange_tensor()uses"data"as the column nameread_numpy()uses"data"as the column name
The remaining usages of TENSOR_COLUMN_NAME are:
- Dead code in
TableBlockBuilder.add()that wraps numpy arrays (never triggered by current code paths) - Backwards-compatibility logic in
_convert_batch_type_to_numpy()that auto-unwraps single tensor columns - Tensor detection logic in
_should_convert_to_tensor()that checkscolumn_name == TENSOR_COLUMN_NAME - Row extraction helpers
_build_tensor_row()in pandas/arrow block accessors
Implementation Boundaries & Constraints
-
Target Files:
python/ray/data/constants.py- RemoveTENSOR_COLUMN_NAMEdefinitionpython/ray/data/_internal/table_block.py- Remove numpy array handling inTableBlockBuilder.add()(lines 79-80)python/ray/data/util/data_batch_conversion.py- Remove backwards-compat logic in_convert_batch_type_to_pandas()and_convert_batch_type_to_numpy()python/ray/data/_internal/tensor_extensions/utils.py- Removecolumn_name == TENSOR_COLUMN_NAMEcheck in_should_convert_to_tensor()python/ray/data/_internal/pandas_block.py- Remove or update_build_tensor_row()python/ray/data/_internal/arrow_block.py- Remove default parametercol_name: str = TENSOR_COLUMN_NAMEfrom_build_tensor_row()python/ray/data/tests/unit/test_data_batch_conversion.py- Update tests that referenceTENSOR_COLUMN_NAMEpython/ray/data/tests/conftest.py- Update test fixtures that useTENSOR_COLUMN_NAME
-
Do Not Touch:
python/ray/air/constants.py- This is in the AIR module (separate cleanup)python/ray/air/util/data_batch_conversion.py- This is in the AIR module (separate cleanup)python/ray/train/- Predictor classes that useTENSOR_COLUMN_NAMEare in Train module
-
Breaking Change Assessment:
- This is not a user-facing breaking change because:
TENSOR_COLUMN_NAMEis not exported in any__init__.py- Current public APIs (
from_numpy, etc.) already use different column names like"data" - Passing raw numpy arrays to
map_batches()already errors since Ray 2.5
- Users who hardcoded
"__value__"in their code were relying on undocumented internal behavior
- This is not a user-facing breaking change because:
Contributing expectations
Please follow the Ray Data Contributing Guide for development setup and testing instructions.