Skip to content

[Data] Remove deprecated TENSOR_COLUMN_NAME constant and associated dead code #60547

@bveeramani

Description

@bveeramani

Description

Remove the TENSOR_COLUMN_NAME constant ("__value__") from Ray Data. This constant was historically used to wrap raw numpy arrays into single-column tables, but this behavior has been deprecated since Ray 2.5. The constant and its associated code paths are now dead code that should be cleaned up.

Background

TENSOR_COLUMN_NAME (defined as "__value__") was introduced to handle cases where users passed raw numpy arrays to Ray Data APIs. When a raw numpy array was encountered, it would be automatically wrapped into a single-column table with the column name "__value__".

Since Ray 2.5, passing raw numpy arrays to APIs like map_batches() raises an explicit error:

if isinstance(batch, np.ndarray):
    raise ValueError(
        "Standalone numpy arrays are not allowed in Ray 2.5. "
        "Return a dict of field -> array, e.g., `{'data': array}` instead of `array`."
    )

See: block.py:463-469

Current public APIs already use explicit column names:

  • from_numpy() uses "data" as the column name
  • from_items() uses "item" as the column name
  • range_tensor() uses "data" as the column name
  • read_numpy() uses "data" as the column name

The remaining usages of TENSOR_COLUMN_NAME are:

  1. Dead code in TableBlockBuilder.add() that wraps numpy arrays (never triggered by current code paths)
  2. Backwards-compatibility logic in _convert_batch_type_to_numpy() that auto-unwraps single tensor columns
  3. Tensor detection logic in _should_convert_to_tensor() that checks column_name == TENSOR_COLUMN_NAME
  4. Row extraction helpers _build_tensor_row() in pandas/arrow block accessors

Implementation Boundaries & Constraints

  • Target Files:

    • python/ray/data/constants.py - Remove TENSOR_COLUMN_NAME definition
    • python/ray/data/_internal/table_block.py - Remove numpy array handling in TableBlockBuilder.add() (lines 79-80)
    • python/ray/data/util/data_batch_conversion.py - Remove backwards-compat logic in _convert_batch_type_to_pandas() and _convert_batch_type_to_numpy()
    • python/ray/data/_internal/tensor_extensions/utils.py - Remove column_name == TENSOR_COLUMN_NAME check in _should_convert_to_tensor()
    • python/ray/data/_internal/pandas_block.py - Remove or update _build_tensor_row()
    • python/ray/data/_internal/arrow_block.py - Remove default parameter col_name: str = TENSOR_COLUMN_NAME from _build_tensor_row()
    • python/ray/data/tests/unit/test_data_batch_conversion.py - Update tests that reference TENSOR_COLUMN_NAME
    • python/ray/data/tests/conftest.py - Update test fixtures that use TENSOR_COLUMN_NAME
  • Do Not Touch:

    • python/ray/air/constants.py - This is in the AIR module (separate cleanup)
    • python/ray/air/util/data_batch_conversion.py - This is in the AIR module (separate cleanup)
    • python/ray/train/ - Predictor classes that use TENSOR_COLUMN_NAME are in Train module
  • Breaking Change Assessment:

    • This is not a user-facing breaking change because:
      1. TENSOR_COLUMN_NAME is not exported in any __init__.py
      2. Current public APIs (from_numpy, etc.) already use different column names like "data"
      3. Passing raw numpy arrays to map_batches() already errors since Ray 2.5
    • Users who hardcoded "__value__" in their code were relying on undocumented internal behavior

Contributing expectations

Please follow the Ray Data Contributing Guide for development setup and testing instructions.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticaldataRay Data-related issues

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions