Skip to content

[Data] semicolons (param segments) in paths get dropped #57226

@HenryL27

Description

@HenryL27

What happened + What you expected to happen

I have a directory with a semicolon in the name conversations/American Refining Group; and as -- hash/<files>. I attempt to read from it with read_binary_files, and everything after the semicolon gets dropped, and then the directory doesn't exist and I get an error:

 File "/Users/hmlin/Aryn/managed-service/aryn/lib/sycamore/lib/sycamore/sycamore/materialize.py", line 430, in execute
    files = read_binary_files(
            ^^^^^^^^^^^^^^^^^^
  File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/read_api.py", line 2274, in read_binary_files
    datasource = BinaryDatasource(
                 ^^^^^^^^^^^^^^^^^
  File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_based_datasource.py", line 153, in __init__
    zip(
  File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 179, in expand_paths
    yield from _expand_paths(paths, filesystem, partitioning, ignore_missing_paths)
  File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 286, in _expand_paths
    yield from _get_file_infos_serial(paths, filesystem, ignore_missing_paths)
  File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 312, in _get_file_infos_serial
    yield from _get_file_infos(path, filesystem, ignore_missing_paths)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hmlin/Aryn/managed-service/.venv/lib/python3.12/site-packages/ray/data/datasource/file_meta_provider.py", line 427, in _get_file_infos
    raise FileNotFoundError(path)
FileNotFoundError: /Users/hmlin/Aryn/managed-service/nvcserver/conversations/American Refining Group

Expected behavior: read the filesystem without dropping part of the path.

I've narrowed it down to here:

parsed = urllib.parse.urlparse(path, allow_fragments=False) # support '#' in path
-
The fix I want is

# python/ray/data/datasource/path_util.py
# line 178
+     params = ";" + parsed.params if parsed.params else ""  # support ';' in path
...
# line 198
-     return netloc + parsed_path + query
+     return netloc + parsed_path + params + query

I also wrote tests:

# python/ray/data/tests/test_path_util.py
@pytest.mark.parametrize(
    "path",
    [
        "some/file",
        "some/file;semicolon",
        "some/file?questionmark",
        "some/file#hash",
        "some/file;all?of the#above"
    ]
)
def test_weird_local_paths(path):
    resolved_paths, _ = _resolve_paths_and_filesystem(path)
    assert resolved_paths[0] == path

but I couldn't get my venv to set up / behave itself so I didn't want to open a PR (and haven't verified the tests pass).

Versions / Dependencies

Found in
ray: 2.47.1
python: 3.12.11
macos: 26.0.1

Reproduction script

mkdir "abc;def"
python -c "from ray.data import read_binary_files; read_binary_files('abc;def')"

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesstability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions