Remove Schema From BlockMetadata by iamjustinhsu · Pull Request #53454 · ray-project/ray

iamjustinhsu · 2025-05-30T23:20:49Z

Why are these changes needed?

Currently, each block has a schema. If there are many blocks in a ref bundle, then that schema is duplicated everywhere. we should attach the concept of schema at the dataset/operator level, not block/bundle level. This PR removes schema from blockmetadata, moves it to physicaloperator level. This should decrease block overhead/serde runtime

I think it would be advantageous to combine the construct of BlockMetadata + Schema into a 3rd class/named tuple to make it easier to pass around for certain scenarios, but for now it made it a tuple since it makes it easier to handle

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

return type; addin schema=; Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/arrow_block.py

python/ray/data/_internal/execution/interfaces/physical_operator.py

python/ray/data/_internal/execution/operators/map_operator.py

python/ray/data/_internal/execution/streaming_executor_state.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/arrow_block.py

python/ray/data/datasource/datasource.py

python/ray/data/_internal/datasource/lance_datasource.py

python/ray/data/_internal/execution/interfaces/physical_operator.py

alexeykudinkin · 2025-06-04T01:03:36Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

+                meta_schema.schema,
            )
-            bytes_read += meta.size_bytes
+            bytes_read += meta_schema.metadata.size_bytes


Not really a fan of that double wrapping -- why not just return a tuple and unpack it above?

oh i was trying to be consistent using MetadataAndSchema. even though it double unwraps, the alternative of a tuple makes the code inconsistent with how we handle metadata and schema

@iamjustinhsu not sure i follow your point. Please elaborate

iiuc, u want

(metadata, schema): Tuple[BlockMetadata, Schema] = ray.get(...) bytes_read += metadata.size_bytes

but since we use MetadataAndSchema everywhere, there's no reason why we shouldn't be able to do

meta_schema: MetadataAndSchema = ray.get(...) bytes_read += meta_schema.metadata.size_bytes

python/ray/data/_internal/execution/interfaces/physical_operator.py

python/ray/data/_internal/execution/streaming_executor_state.py

alexeykudinkin · 2025-06-04T01:11:13Z

python/ray/data/_internal/plan.py

-                                break
-
-        self._schema = schema
+                for _ in iter_ref_bundles:


Schemas shouldn't be in the Operator hierarchy, but we can keep them inside OpState (for now).

to hao's point, physical operators do contain a lot of state. in fact, the doc string says they are stateful

Physical operators are stateful and non-serializable; they live on the driver side
of the Dataset only.

I'm a little unclear why we can't keep schemas at the PhysicalOperator Level?

Let's consolidate this discussion in the other thread

#53454 (comment)

raulchen

the overall change looks good to me.
Most of the comments are about code structure and code style.

python/ray/data/_internal/execution/interfaces/physical_operator.py

python/ray/data/_internal/execution/streaming_executor_state.py

python/ray/data/_internal/logical/interfaces/logical_operator.py

python/ray/data/block.py

python/ray/data/_internal/util.py

python/ray/data/_internal/planner/aggregate.py

raulchen · 2025-06-04T03:03:53Z

python/ray/data/_internal/logical/interfaces/source_operator.py

+
+# TODO(jusitn): split this into 2, it's not always the case
+# that both schema and metadata are correlated
+class GuessMetadataMixin(ABC):


I think I still prefer just putting these methods in the LogicalOperator for the following reasons:

this guess_schema can also return None. So knowing an operator is a GuessMetadataMixin doesn't ensure knowing the schema.

guess_schema/guess_metadata isn't specific to source operators.

Code would be cleaner. no need for the isinstance checks.

of course, the downside is that subclasses need to remember to override the implementation. But with this, you need to remember to make a subclass inherit from GuessMetadataMixin as well. So no much difference.

also a couple of other suggestions:

I slightly prefer infer_ over guess_ for the method names.

(now a new issue) I feel it's weird to have this guess_metadata or aggregate_output_metadata. Because a BlockMetadata is supposed to be for one single block, instead of the entire op.. I checked the use cases, and we are only using schema/input_files/num_rows from the metadata. So we can probably just have separate infer_schema and infer_num_rows.

input_files should probably also be removed from schema. Because it should be an attribute of a source logical operator, instead of a block. but no need to handle this in this PR.

SourceOperatorMixin doesn't need to be a mixin. It can just be a subclass of LogicalOperator with an input_files method.

for output_data, we can introduce another subclass ExistingDataSourceOperator.

ok i removed the GuessMetadataMixin, but still kept SourceOperator (renamed from SourceOperatorMixin to hold def output_data).

for output_data, we can introduce another subclass ExistingDataSourceOperator.

I just found it weird that we didn't have a standard way of checking if a an operator is a source operator (prior to this, we were looking is_source_op == (isinstance(op, Read) or len(op.input_dependencies) == 0). ExistingDataSourceOperator looks good, but it doesn't apply to Read which is a SourceOperator, so I just applied SourceOperator to Read but let it return None

done

yea i'll make a TODO

Not sure I entirely followed because AbstractFrom and InputData don't necessarily contain input_files, but I just folded SourceOperator into one class to keep things simple

addressed above

to clarify, what I proposed is such a hierarchy:

LogicalOperator

SourceOperator

Read (input_files)

ExistingDataSourceOperator (output_data)

AbstractFrom

InputDataBuffer

only ExistingDataSourceOperator has output_data

oh i see, I think it would good just not in this PR, as it's already large enough. The main motivation for SourceOperator was because checking for SourceOperator was non standard.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…u/remove-schema-from-block-metadata Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

raulchen

LGTM. Let's follow up with some TODOs in different PRs.

Follow-up to #53454 Closes #53786 Signed-off-by: Matthew Deng <matt@anyscale.com>

Currently, each block has a schema. If there are many blocks in a ref bundle, then that schema is duplicated everywhere. we should attach the concept of schema at the dataset/operator level, not block/bundle level. This PR removes schema from blockmetadata, moves it to physicaloperator level. This should decrease block overhead/serde runtime I think it would be advantageous to combine the construct of BlockMetadata + Schema into a 3rd class/named tuple to make it easier to pass around for certain scenarios, but for now it made it a tuple since it makes it easier to handle --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Follow-up to #53454 Closes #53786 Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Why are these changes needed? Consider the following code ```python ds = ray.data.range(10) ds = ds.repartition() ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) doesn't contain schema sometimes??? ``` ^ This is flakey ```python ds = ray.data.range(10) ds = ds.repartition() # ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) contains schema??? ``` ^ This isn't flakey??? Explanation: - There are many scenarios where we produce empty blocks (ie, in shuffle task map where we slice the blocks and send them to reduce). - Empty blocks in pyarrow still have a schema - When deduping a schema, we default take the FIRST schema as the source of truth. However, we should take the first NON-EMPTY schema, because order is non-deterministic. But that's what i was doing before #53454. with the check`schema is None`. Oh wait, it should be `not schema` (this covers `None` and empty schemas) - The first block of code failed because the user mapped empty block -> empty block. In our code, *with a UDF function*, empty blocks have no schema - The second block of code secretly succeeded because the user did not use UDF. Instead, we created empty blocks with the original schema. I ran it a gazillion times, it should not be flakey anymore  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

## Why are these changes needed? Consider the following code ```python ds = ray.data.range(10) ds = ds.repartition() ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) doesn't contain schema sometimes??? ``` ^ This is flakey ```python ds = ray.data.range(10) ds = ds.repartition() # ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) contains schema??? ``` ^ This isn't flakey??? Explanation: - There are many scenarios where we produce empty blocks (ie, in shuffle task map where we slice the blocks and send them to reduce). - Empty blocks in pyarrow still have a schema - When deduping a schema, we default take the FIRST schema as the source of truth. However, we should take the first NON-EMPTY schema, because order is non-deterministic. But that's what i was doing before ray-project#53454. with the check`schema is None`. Oh wait, it should be `not schema` (this covers `None` and empty schemas) - The first block of code failed because the user mapped empty block -> empty block. In our code, *with a UDF function*, empty blocks have no schema - The second block of code secretly succeeded because the user did not use UDF. Instead, we created empty blocks with the original schema. I ran it a gazillion times, it should not be flakey anymore  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Currently, each block has a schema. If there are many blocks in a ref bundle, then that schema is duplicated everywhere. we should attach the concept of schema at the dataset/operator level, not block/bundle level. This PR removes schema from blockmetadata, moves it to physicaloperator level. This should decrease block overhead/serde runtime I think it would be advantageous to combine the construct of BlockMetadata + Schema into a 3rd class/named tuple to make it easier to pass around for certain scenarios, but for now it made it a tuple since it makes it easier to handle --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Follow-up to #53454 Closes #53786 Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Why are these changes needed? Consider the following code ```python ds = ray.data.range(10) ds = ds.repartition() ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) doesn't contain schema sometimes??? ``` ^ This is flakey ```python ds = ray.data.range(10) ds = ds.repartition() # ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) contains schema??? ``` ^ This isn't flakey??? Explanation: - There are many scenarios where we produce empty blocks (ie, in shuffle task map where we slice the blocks and send them to reduce). - Empty blocks in pyarrow still have a schema - When deduping a schema, we default take the FIRST schema as the source of truth. However, we should take the first NON-EMPTY schema, because order is non-deterministic. But that's what i was doing before #53454. with the check`schema is None`. Oh wait, it should be `not schema` (this covers `None` and empty schemas) - The first block of code failed because the user mapped empty block -> empty block. In our code, *with a UDF function*, empty blocks have no schema - The second block of code secretly succeeded because the user did not use UDF. Instead, we created empty blocks with the original schema. I ran it a gazillion times, it should not be flakey anymore  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Why are these changes needed? Consider the following code ```python ds = ray.data.range(10) ds = ds.repartition() ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) doesn't contain schema sometimes??? ``` ^ This is flakey ```python ds = ray.data.range(10) ds = ds.repartition() # ds = ds.map_batches(lambda x : x) it1, it2 = ds.split(2) # repr(it2) contains schema??? ``` ^ This isn't flakey??? Explanation: - There are many scenarios where we produce empty blocks (ie, in shuffle task map where we slice the blocks and send them to reduce). - Empty blocks in pyarrow still have a schema - When deduping a schema, we default take the FIRST schema as the source of truth. However, we should take the first NON-EMPTY schema, because order is non-deterministic. But that's what i was doing before ray-project/ray#53454. with the check`schema is None`. Oh wait, it should be `not schema` (this covers `None` and empty schemas) - The first block of code failed because the user mapped empty block -> empty block. In our code, *with a UDF function*, empty blocks have no schema - The second block of code secretly succeeded because the user did not use UDF. Instead, we created empty blocks with the original schema. I ran it a gazillion times, it should not be flakey anymore  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch 5 times, most recently from a765c6b to 0948584 Compare May 31, 2025 16:28

Remove Schema From BlockMetadata

0127f7d

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from 0948584 to 0127f7d Compare May 31, 2025 17:32

iamjustinhsu added the go add ONLY when ready to merge, run all tests label May 31, 2025

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch 3 times, most recently from dc8da44 to fbf29fa Compare May 31, 2025 23:07

fix mongo tests; fix datasources; doclint; attribute errors; all-to-all

2ff27fe

return type; addin schema=; Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from fbf29fa to 2ff27fe Compare June 1, 2025 02:37

fix flaxey schema

e54f195

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from 4f8a2c5 to e54f195 Compare June 1, 2025 06:07

raulchen reviewed Jun 3, 2025

View reviewed changes

iamjustinhsu added 2 commits June 3, 2025 15:38

Tuple -> MetadataAndSchema

fde77cc

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

clean + add typing Union[type, pandas, pyarrow]

38db8f4

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from 067c136 to 38db8f4 Compare June 3, 2025 22:38

alexeykudinkin reviewed Jun 4, 2025

View reviewed changes

raulchen reviewed Jun 4, 2025

View reviewed changes

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch 3 times, most recently from a952a2e to 0ca6ef1 Compare June 4, 2025 16:56

hao + alexey clean (type alias, code moving, comments)

824f3ff

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from 0ca6ef1 to 824f3ff Compare June 4, 2025 17:49

solve race condition

3fc50d4

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch 3 times, most recently from 51af022 to 6946adc Compare June 4, 2025 19:08

iamjustinhsu removed the request for review from a team June 10, 2025 23:15

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch 2 times, most recently from 8a15552 to 0543925 Compare June 10, 2025 23:28

round 5, clean up, fix dedupe

39c3fb1

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from 0543925 to 39c3fb1 Compare June 10, 2025 23:33

Merge branch 'master' of https://github.com/iamjustinhsu/ray into jhs…

bef52c4

…u/remove-schema-from-block-metadata Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from e0e93b0 to bef52c4 Compare June 11, 2025 00:50

iamjustinhsu added 2 commits June 10, 2025 17:55

fix zip

1350cf3

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix logger

c043762

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/remove-schema-from-block-metadata branch from 5454b3f to c043762 Compare June 11, 2025 02:32

raulchen approved these changes Jun 12, 2025

View reviewed changes

raulchen merged commit a1b9dc6 into ray-project:master Jun 12, 2025
5 checks passed

iamjustinhsu deleted the jhsu/remove-schema-from-block-metadata branch June 12, 2025 21:29

matthewdeng mentioned this pull request Jun 13, 2025

[tune] update BlockMetadata args in tests #53791

Merged

raulchen pushed a commit that referenced this pull request Jun 13, 2025

[tune] update BlockMetadata args in tests (#53791)

149e152

Follow-up to #53454 Closes #53786 Signed-off-by: Matthew Deng <matt@anyscale.com>

iamjustinhsu mentioned this pull request Jun 17, 2025

[data] fix flakey schema #53901

Merged

8 tasks

elliot-barn pushed a commit that referenced this pull request Jun 18, 2025

[tune] update BlockMetadata args in tests (#53791)

2e99c60

Follow-up to #53454 Closes #53786 Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

rbavery mentioned this pull request Jun 27, 2025

[Ray Data] Ray doesn't correctly handle variable shaped tensors in .map after change in blocklevel metadata in ray>2.45 #54186

Closed

elliot-barn pushed a commit that referenced this pull request Jul 2, 2025

[tune] update BlockMetadata args in tests (#53791)

b0801d8

Follow-up to #53454 Closes #53786 Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

anrooo mentioned this pull request Aug 26, 2025

[Data] unify_schemas() not robust to empty Schemas when processing struct columns. #55960

Closed

Conversation

iamjustinhsu commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iamjustinhsu commented May 30, 2025 •

edited

Loading

iamjustinhsu Jun 4, 2025 •

edited

Loading

iamjustinhsu Jun 4, 2025 •

edited

Loading

iamjustinhsu Jun 4, 2025 •

edited

Loading