[data] Fix reading from zipped json#58214
Merged
bveeramani merged 8 commits intoray-project:masterfrom Oct 29, 2025
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses the issue of reading zipped JSON files in Ray Data, which was broken due to the assumption that all input files are seekable. The changes include refactoring reused code, defaulting to a chunk size of 10000 for zipped files, and handling snappy compression. The review focuses on correctness and maintainability, with suggestions for code improvements and clarity.
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
e898f51 to
33ce580
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…/jsonl-lines-handle-compressed
bveeramani
approved these changes
Oct 29, 2025
YoussefEssDS
pushed a commit
to YoussefEssDS/ray
that referenced
this pull request
Nov 8, 2025
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Nov 14, 2025
## Description ### Status Quo This PR #54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: #55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues #55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Status Quo
This PR #54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: #55356
Potential Workaround
Related issues
#55356
Additional information