Skip to content

[data] Fix performance degradation on iceberg data source when reading large iceberg table#49054

Merged
raulchen merged 9 commits intoray-project:masterfrom
jimmyxie-figma:jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table
Dec 7, 2024
Merged

[data] Fix performance degradation on iceberg data source when reading large iceberg table#49054
raulchen merged 9 commits intoray-project:masterfrom
jimmyxie-figma:jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table

Conversation

@jimmyxie-figma
Copy link
Copy Markdown
Contributor

@jimmyxie-figma jimmyxie-figma commented Dec 4, 2024

Why are these changes needed?

When reading a large iceberg table the iceberg data source hangs after creating the read tasks. The relevant log related to this issue from the console shown below. The threshold for the read function is 1MB and the actual function pickled shouldn't be bigger than a couple of KBs.

The serialized size of your read function named '<lambda>' is 6.3MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in '<lambda>'.

This PR tries two issues

  • The issue where the _get_read_task reference self, and cause the lambda function to be large in size when pickling/spilling to disk. This in term cause the iceberg data source to be extremely slow when reading large tables. The PR removes all the self reference in the _get_read_task function

  • The issue where _get_read_task lambda function excessively hit the metastore (on every task read) because Table is not pickle-able. While Catalog and Table are not pickle-able . The task reader doesn't need neither of the properties. It need FileIO and TableMetadata instead, which both happens to be pickle-able, so we are passing them explicitly to the function.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jimmyxie-figma jimmyxie-figma requested a review from a team as a code owner December 4, 2024 00:18
… iceberg table

Signed-off-by: Jimmy Xie <rxie@figma.com>
Signed-off-by: Jimmy Xie <rxie@figma.com>
@jimmyxie-figma jimmyxie-figma force-pushed the jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table branch from 7d46968 to d332431 Compare December 4, 2024 00:20
@jimmyxie-figma jimmyxie-figma changed the title Fix performance degradation on iceberg data source when reading large iceberg table [Data] Fix performance degradation on iceberg data source when reading large iceberg table Dec 4, 2024
Signed-off-by: Jimmy Xie <rxie@figma.com>
Signed-off-by: Jimmy Xie <rxie@figma.com>
Signed-off-by: Jimmy Xie <rxie@figma.com>
@raulchen raulchen self-assigned this Dec 5, 2024
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Dec 5, 2024
Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

Signed-off-by: Jimmy Xie <rxie@figma.com>
Signed-off-by: Jimmy Xie <rxie@figma.com>
@jimmyxie-figma jimmyxie-figma changed the title [Data] Fix performance degradation on iceberg data source when reading large iceberg table [data] Fix performance degradation on iceberg data source when reading large iceberg table Dec 5, 2024
Comment on lines +205 to +208
# Get required properties for reading tasks - table IO, table metadata,
# row filter, case sensitivity,limit and projected schema. pre-apply
# them to `_get_read_task` through partial to avoid `self` reference
# which causes perfromance degradation during serialization
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Get required properties for reading tasks - table IO, table metadata,
# row filter, case sensitivity,limit and projected schema. pre-apply
# them to `_get_read_task` through partial to avoid `self` reference
# which causes perfromance degradation during serialization
# Get required properties for reading tasks - table IO, table metadata,
# row filter, case sensitivity,limit and projected schema to pass
# them directly to `_get_read_task` to avoid capture of `self` reference
# within the closure carrying substantial overhead invoking these tasks
#
# See XXX for more context

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmyxie-figma can you please also file a ticket outlining details of this issue (that you already capture in the description) and link it here for future code reader reference

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin added a ticket to the comment

Signed-off-by: Jimmy Xie <rxie@figma.com>
Copy link
Copy Markdown
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@raulchen raulchen merged commit 881a45d into ray-project:master Dec 7, 2024
@jimmyxie-figma jimmyxie-figma deleted the jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table branch December 9, 2024 14:24
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Dec 17, 2024
…g large iceberg table (ray-project#49054)

## Why are these changes needed?
When reading a large iceberg table the `iceberg data source` hangs after
creating the `read tasks`. The relevant log related to this issue from
the console shown below. The threshold for the read function is 1MB and
the actual function pickled shouldn't be bigger than a couple of KBs.

```
The serialized size of your read function named '<lambda>' is 6.3MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in '<lambda>'.
```

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR tries two issues
- The issue where the `_get_read_task` reference `self`, and cause the
lambda function to be large in size when pickling/spilling to disk. This
in term cause the iceberg data source to be extremely slow when reading
large tables. The PR removes all the `self` reference in the
`_get_read_task` function

- The issue where `_get_read_task` lambda function excessively hit the
metastore (on every task read) because `Table` is not pickle-able. While
`Catalog` and `Table` are not pickle-able . The task reader doesn't need
neither of the properties. It need `FileIO` and `TableMetadata` instead,
which both happens to be pickle-able, so we are passing them explicitly
to the function.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

---------

Signed-off-by: Jimmy Xie <rxie@figma.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
alexeykudinkin pushed a commit that referenced this pull request Sep 4, 2025
…urce when using a large number of files (#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: #49054 

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…urce when using a large number of files (ray-project#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: ray-project#49054

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…urce when using a large number of files (ray-project#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: ray-project#49054

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…urce when using a large number of files (ray-project#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: ray-project#49054

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
alexwang177 pushed a commit to pinterest/ray that referenced this pull request Sep 17, 2025
…urce when using a large number of files (ray-project#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: ray-project#49054 

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…urce when using a large number of files (ray-project#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: ray-project#49054

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…urce when using a large number of files (ray-project#55978)

## Why are these changes needed?

Using `FileBasedDatasource` or `ParquetDatasource` with a very large
number of files causes OOM when creating read tasks. The full list of
file paths is stored in `self`, causing it to persist to every read
task, leading to this warning:
```
The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.
```

When using a small number of blocks, OOM does not occur because the
large file list is not repeated so many times. But when setting high
parallelism with `override_num_blocks`, OOM occurs.

This is because the full list of paths is added to
`self._unresolved_paths`. This attribute isn't currently used anywhere
in ray. This PR removes `self._unresolved_paths` to alleviate unexpected
high memory usage with very large numbers of files.

## Related issue number

Similar to this issue with Iceberg: ray-project#49054 

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Jack Gammack <jgammack@etsy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants