-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] semicolons (param segments) in paths get dropped #57226
Copy link
Copy link
Closed
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesstability
Description
What happened + What you expected to happen
I have a directory with a semicolon in the name conversations/American Refining Group; and as -- hash/<files>. I attempt to read from it with read_binary_files, and everything after the semicolon gets dropped, and then the directory doesn't exist and I get an error:
File "/Users/hmlin/Aryn/managed-service/aryn/lib/sycamore/lib/sycamore/sycamore/materialize.py", line 430, in execute
files = read_binary_files(
^^^^^^^^^^^^^^^^^^
File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/read_api.py", line 2274, in read_binary_files
datasource = BinaryDatasource(
^^^^^^^^^^^^^^^^^
File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_based_datasource.py", line 153, in __init__
zip(
File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 179, in expand_paths
yield from _expand_paths(paths, filesystem, partitioning, ignore_missing_paths)
File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 286, in _expand_paths
yield from _get_file_infos_serial(paths, filesystem, ignore_missing_paths)
File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 312, in _get_file_infos_serial
yield from _get_file_infos(path, filesystem, ignore_missing_paths)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 427, in _get_file_infos
raise FileNotFoundError(path)
FileNotFoundError: /Users/hmlin/Aryn/managed-service/nvcserver/conversations/American Refining Group
Expected behavior: read the filesystem without dropping part of the path.
I've narrowed it down to here:
ray/python/ray/data/datasource/path_util.py
Line 177 in bc9723a
| parsed = urllib.parse.urlparse(path, allow_fragments=False) # support '#' in path |
The fix I want is
# python/ray/data/datasource/path_util.py
# line 178
+ params = ";" + parsed.params if parsed.params else "" # support ';' in path
...
# line 198
- return netloc + parsed_path + query
+ return netloc + parsed_path + params + query
I also wrote tests:
# python/ray/data/tests/test_path_util.py
@pytest.mark.parametrize(
"path",
[
"some/file",
"some/file;semicolon",
"some/file?questionmark",
"some/file#hash",
"some/file;all?of the#above"
]
)
def test_weird_local_paths(path):
resolved_paths, _ = _resolve_paths_and_filesystem(path)
assert resolved_paths[0] == path
but I couldn't get my venv to set up / behave itself so I didn't want to open a PR (and haven't verified the tests pass).
Versions / Dependencies
Found in
ray: 2.47.1
python: 3.12.11
macos: 26.0.1
Reproduction script
mkdir "abc;def"
python -c "from ray.data import read_binary_files; read_binary_files('abc;def')"Issue Severity
High: It blocks me from completing my task.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesstability