-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Raystability
Description
What happened + What you expected to happen
Prepare the test data: curl --output yahoo_answers_title_answer.jsonl.gz -L https://huggingface.co/datasets/sentence-transformers/embedding-training-data/resolve/main/yahoo_answers_title_answer.jsonl.gz
When I use pandas.read_json, it works good. However, when I use ray.data.read_json, it raises an error:
ray.exceptions.RayTaskError(UnicodeDecodeError): ray::ReadPandasJSON->SplitBlocks(100)() (pid=38670, ip=127.0.0.1)
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 557, in _map_task
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 601, in __call__
for block in blocks:
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 532, in __call__
for data in iter:
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
yield from self._block_fn(input, ctx)
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_read_op.py", line 103, in do_read
yield from read_task()
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/datasource/datasource.py", line 179, in __call__
yield from result
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/datasource/file_based_datasource.py", line 291, in read_task_fn
yield from read_files(read_paths)
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/datasource/file_based_datasource.py", line 254, in read_files
for block in iterate_with_retry(
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1486, in iterate_with_retry
raise e from None
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1467, in iterate_with_retry
for item_index, item in enumerate(iterable):
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/datasource/json_datasource.py", line 182, in _read_stream
chunksize = self._estimate_chunksize(f)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/ray/data/_internal/datasource/json_datasource.py", line 198, in _estimate_chunksize
df = _cast_range_index_to_string(next(reader))
^^^^^^^^^^^^
File "/Users/admin/Work/ray/venv/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1079, in __next__
lines = list(islice(self.data, self.chunksize))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byteVersions / Dependencies
Ray 2.48.0
Reproduction script
import pandas as pd
import os
df = pd.read_json(os.path.abspath("yahoo_answers_title_answer.jsonl.gz"), lines=True)
df.head(5)Output
0 1
0 why doesn't an optical mouse work on a glass t... Optical mice use an LED and a camera to rapidl...
1 What is the best off-road motorcycle trail ? i hear that the mojave road is amazing!<br />\...
2 What is Trans Fat? How to reduce that? Trans fats occur in manufactured foods during ...
3 How many planes Fedex has? according to the www.fedex.com web site:\nAir ...
4 In the san francisco bay area, does it make se... renting vs buying depends on your goals. <br /...
import ray.data
ds = ray.data.read_json(os.path.abspath("yahoo_answers_title_answer.jsonl.gz"), lines=True)
ds.materialize() # <--------- errorIssue Severity
None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Raystability