Skip to content

[data] Fix reading from zipped json#58214

Merged
bveeramani merged 8 commits intoray-project:masterfrom
iamjustinhsu:jhsu/jsonl-lines-handle-compressed
Oct 29, 2025
Merged

[data] Fix reading from zipped json#58214
bveeramani merged 8 commits intoray-project:masterfrom
iamjustinhsu:jhsu/jsonl-lines-handle-compressed

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

Description

Status Quo

This PR #54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: #55356

Potential Workaround

  • Refractor reused code between JsonDatasource and FileDatasource
  • default to 10000 if zipped file found

Related issues

#55356

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@iamjustinhsu iamjustinhsu requested a review from a team as a code owner October 27, 2025 16:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses the issue of reading zipped JSON files in Ray Data, which was broken due to the assumption that all input files are seekable. The changes include refactoring reused code, defaulting to a chunk size of 10000 for zipped files, and handling snappy compression. The review focuses on correctness and maintainability, with suggestions for code improvements and clarity.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/jsonl-lines-handle-compressed branch from e898f51 to 33ce580 Compare October 27, 2025 16:58
cursor[bot]

This comment was marked as outdated.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 27, 2025
cursor[bot]

This comment was marked as outdated.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
cursor[bot]

This comment was marked as outdated.

@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Oct 29, 2025
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@bveeramani bveeramani merged commit 6dd3776 into ray-project:master Oct 29, 2025
6 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/jsonl-lines-handle-compressed branch October 29, 2025 18:38
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
## Description


### Status Quo
This PR ray-project#54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
ray-project#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
ray-project#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Nov 14, 2025
## Description


### Status Quo
This PR #54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Description


### Status Quo
This PR ray-project#54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
ray-project#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
ray-project#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description

### Status Quo
This PR ray-project#54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
ray-project#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
ray-project#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
## Description

### Status Quo
This PR ray-project#54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
ray-project#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
ray-project#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

### Status Quo
This PR ray-project#54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
ray-project#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
ray-project#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants