Skip to content

[Data] read_json _estimate_chunksize raises an error #55356

@codingl2k1

Description

@codingl2k1

What happened + What you expected to happen

Prepare the test data: curl --output yahoo_answers_title_answer.jsonl.gz -L https://huggingface.co/datasets/sentence-transformers/embedding-training-data/resolve/main/yahoo_answers_title_answer.jsonl.gz

When I use pandas.read_json, it works good. However, when I use ray.data.read_json, it raises an error:

ray.exceptions.RayTaskError(UnicodeDecodeError): ray::ReadPandasJSON->SplitBlocks(100)() (pid=38670, ip=127.0.0.1)
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 557, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 601, in __call__
    for block in blocks:
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 532, in __call__
    for data in iter:
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
    yield from self._block_fn(input, ctx)
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_read_op.py", line 103, in do_read
    yield from read_task()
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/datasource/datasource.py", line 179, in __call__
    yield from result
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/datasource/file_based_datasource.py", line 291, in read_task_fn
    yield from read_files(read_paths)
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/datasource/file_based_datasource.py", line 254, in read_files
    for block in iterate_with_retry(
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1486, in iterate_with_retry
    raise e from None
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1467, in iterate_with_retry
    for item_index, item in enumerate(iterable):
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/datasource/json_datasource.py", line 182, in _read_stream
    chunksize = self._estimate_chunksize(f)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/datasource/json_datasource.py", line 198, in _estimate_chunksize
    df = _cast_range_index_to_string(next(reader))
                                     ^^^^^^^^^^^^
  File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1079, in __next__
    lines = list(islice(self.data, self.chunksize))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Versions / Dependencies

Ray 2.48.0

Reproduction script

import pandas as pd
import os
df = pd.read_json(os.path.abspath("yahoo_answers_title_answer.jsonl.gz"), lines=True)
df.head(5)

Output

                                                   0                                                  1
0  why doesn't an optical mouse work on a glass t...  Optical mice use an LED and a camera to rapidl...
1       What is the best off-road motorcycle trail ?  i hear that the mojave road is amazing!<br />\...
2             What is Trans Fat? How to reduce that?  Trans fats occur in manufactured foods during ...
3                         How many planes Fedex has?  according to the www.fedex.com web site:\nAir ...
4  In the san francisco bay area, does it make se...  renting vs buying depends on your goals. <br /...
import ray.data
ds = ray.data.read_json(os.path.abspath("yahoo_answers_title_answer.jsonl.gz"), lines=True)
ds.materialize()  # <--------- error

Issue Severity

None

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to Raystability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions