Skip to content

[Data] ArrowInvalid error when you backfill missing fields from map tasks #60628

@bveeramani

Description

@bveeramani

What happened + What you expected to happen

A fix might be to update _backfill_missing_fields to cast existing fields to the unified type when they don't match. Currently it only handles nested structs and tensor types, but not primitive type mismatches like int64 -> float64.

Versions / Dependencies

1b4e9c8

Reproduction script

import ray


def generator_fn(batch):
    for i, row_id in enumerate(batch["id"]):
        if i % 2 == 0:
            # Yield struct with fields (a: int64, b: string)
            yield {"data": [{"a": 1, "b": "hello"}]}
        else:
            # Yield struct with fields (a: float64, c: int32)
            # Field 'a' has different type, field 'b' missing, field 'c' new
            yield {"data": [{"a": 1.5, "c": 100}]}


ds = ray.data.range(4, override_num_blocks=1)
ds = ds.map_batches(generator_fn, batch_size=4)
ds.materialize()

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tdataRay Data-related issuesstability

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions