[data] Fix performance degradation on iceberg data source when reading large iceberg table by jimmyxie-figma · Pull Request #49054 · ray-project/ray

jimmyxie-figma · 2024-12-04T00:18:26Z

Why are these changes needed?

When reading a large iceberg table the iceberg data source hangs after creating the read tasks. The relevant log related to this issue from the console shown below. The threshold for the read function is 1MB and the actual function pickled shouldn't be bigger than a couple of KBs.

The serialized size of your read function named '<lambda>' is 6.3MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in '<lambda>'.

This PR tries two issues

The issue where the _get_read_task reference self, and cause the lambda function to be large in size when pickling/spilling to disk. This in term cause the iceberg data source to be extremely slow when reading large tables. The PR removes all the self reference in the _get_read_task function
The issue where _get_read_task lambda function excessively hit the metastore (on every task read) because Table is not pickle-able. While Catalog and Table are not pickle-able . The task reader doesn't need neither of the properties. It need FileIO and TableMetadata instead, which both happens to be pickle-able, so we are passing them explicitly to the function.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… iceberg table Signed-off-by: Jimmy Xie <rxie@figma.com>

Signed-off-by: Jimmy Xie <rxie@figma.com>

…e-when-reading-large-table

Signed-off-by: Jimmy Xie <rxie@figma.com>

python/ray/data/_internal/datasource/iceberg_datasource.py

alexeykudinkin

LGTM, minor comments

Signed-off-by: Jimmy Xie <rxie@figma.com>

alexeykudinkin · 2024-12-05T20:30:52Z

python/ray/data/_internal/datasource/iceberg_datasource.py

+        # Get required properties for reading tasks - table IO, table metadata,
+        # row filter, case sensitivity,limit and projected schema. pre-apply
+        # them to `_get_read_task` through partial to avoid `self` reference
+        # which causes perfromance degradation during serialization


Suggested change

# Get required properties for reading tasks - table IO, table metadata,

# row filter, case sensitivity,limit and projected schema. pre-apply

# them to `_get_read_task` through partial to avoid `self` reference

# which causes perfromance degradation during serialization

# Get required properties for reading tasks - table IO, table metadata,

# row filter, case sensitivity,limit and projected schema to pass

# them directly to `_get_read_task` to avoid capture of `self` reference

# within the closure carrying substantial overhead invoking these tasks

#

# See XXX for more context

@jimmyxie-figma can you please also file a ticket outlining details of this issue (that you already capture in the description) and link it here for future code reader reference

@alexeykudinkin added a ticket to the comment

Signed-off-by: Jimmy Xie <rxie@figma.com>

raulchen

thanks!

…g large iceberg table (ray-project#49054) ## Why are these changes needed? When reading a large iceberg table the `iceberg data source` hangs after creating the `read tasks`. The relevant log related to this issue from the console shown below. The threshold for the read function is 1MB and the actual function pickled shouldn't be bigger than a couple of KBs. ``` The serialized size of your read function named '<lambda>' is 6.3MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in '<lambda>'. ```  This PR tries two issues - The issue where the `_get_read_task` reference `self`, and cause the lambda function to be large in size when pickling/spilling to disk. This in term cause the iceberg data source to be extremely slow when reading large tables. The PR removes all the `self` reference in the `_get_read_task` function - The issue where `_get_read_task` lambda function excessively hit the metastore (on every task read) because `Table` is not pickle-able. While `Catalog` and `Table` are not pickle-able . The task reader doesn't need neither of the properties. It need `FileIO` and `TableMetadata` instead, which both happens to be pickle-able, so we are passing them explicitly to the function. ## Related issue number  --------- Signed-off-by: Jimmy Xie <rxie@figma.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

…urce when using a large number of files (#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: #49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>

…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: sampan <sampan@anyscale.com>

…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>

…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>

jimmyxie-figma requested a review from a team as a code owner December 4, 2024 00:18

jimmyxie-figma added 2 commits December 3, 2024 19:20

Fix performance degradation on iceberg data source when reading large…

3729d94

… iceberg table Signed-off-by: Jimmy Xie <rxie@figma.com>

fixed formatting

d332431

Signed-off-by: Jimmy Xie <rxie@figma.com>

jimmyxie-figma force-pushed the jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table branch from 7d46968 to d332431 Compare December 4, 2024 00:20

jimmyxie-figma changed the title ~~Fix performance degradation on iceberg data source when reading large iceberg table~~ [Data] Fix performance degradation on iceberg data source when reading large iceberg table Dec 4, 2024

jimmyxie-figma added 4 commits December 3, 2024 23:20

fix type hints

22df0c1

Signed-off-by: Jimmy Xie <rxie@figma.com>

Merge branch 'master' into jimmyxie/fix-iceberg-read-performance-issu…

b100706

…e-when-reading-large-table

refactored catalog table retrival

f3103a3

Signed-off-by: Jimmy Xie <rxie@figma.com>

fix type import types

e63e062

Signed-off-by: Jimmy Xie <rxie@figma.com>

raulchen self-assigned this Dec 5, 2024

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Dec 5, 2024

alexeykudinkin reviewed Dec 5, 2024

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasource.py Show resolved Hide resolved

alexeykudinkin reviewed Dec 5, 2024

View reviewed changes

add comments

1c3297b

Signed-off-by: Jimmy Xie <rxie@figma.com>

jimmyxie-figma requested a review from alexeykudinkin December 5, 2024 14:44

fix lint

181a71b

Signed-off-by: Jimmy Xie <rxie@figma.com>

jimmyxie-figma changed the title ~~[Data] Fix performance degradation on iceberg data source when reading large iceberg table~~ [data] Fix performance degradation on iceberg data source when reading large iceberg table Dec 5, 2024

alexeykudinkin approved these changes Dec 5, 2024

View reviewed changes

jimmyxie-figma mentioned this pull request Dec 5, 2024

[Data] iceberg data source performance degradation when reading large table #49107

Closed

filed a ticket and added link to the issue

ed79179

Signed-off-by: Jimmy Xie <rxie@figma.com>

jimmyxie-figma requested a review from alexeykudinkin December 5, 2024 22:05

raulchen approved these changes Dec 7, 2024

View reviewed changes

raulchen merged commit 881a45d into ray-project:master Dec 7, 2024

jimmyxie-figma deleted the jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table branch December 9, 2024 14:24

hainesmichaelc added the community-backlog label May 22, 2025

JackGammack mentioned this pull request Aug 26, 2025

[data] Fix high memory usage with FileBasedDatasource & ParquetDatasource when using a large number of files #55978

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Fix performance degradation on iceberg data source when reading large iceberg table#49054

[data] Fix performance degradation on iceberg data source when reading large iceberg table#49054
raulchen merged 9 commits intoray-project:masterfrom
jimmyxie-figma:jimmyxie/fix-iceberg-read-performance-issue-when-reading-large-table

jimmyxie-figma commented Dec 4, 2024 •

edited

Loading

Uh oh!

Uh oh!

alexeykudinkin left a comment

Uh oh!

alexeykudinkin Dec 5, 2024

Uh oh!

alexeykudinkin Dec 5, 2024

Uh oh!

jimmyxie-figma Dec 5, 2024

Uh oh!

raulchen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jimmyxie-figma commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

jimmyxie-figma Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jimmyxie-figma commented Dec 4, 2024 •

edited

Loading