Skip to content

Deepcopy of HfFileSystem fails due to non-picklable HfHubHTTPError response arg #3576

@owenowenisme

Description

@owenowenisme

Describe the bug

When trying to deepcopy an HfFileSystem with error cached in _repo_and_revision_exists_cache, the whole deepcopy will failed because an error object HfHubHTTPError do not implement __reduce__ex correctly.

We did not pass the response into constructor because this is a keyword argument without default value, therefore this will give us error
TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'

This will be fatal if we want to serialize HfFileSystem instance.

  • errors.py
class HfHubHTTPError(HTTPError, OSError):

    def __init__(
        self,
        message: str,
        *,
        response: Response,
        server_message: Optional[str] = None,
    ):
        self.request_id = response.headers.get("x-request-id") or response.headers.get("X-Amzn-Trace-Id")
        self.server_message = server_message
        self.response = response
        self.request = response.request
        super().__init__(message)

    def __reduce_ex__(self, protocol):
        """Fix pickling of Exception subclass with kwargs. We need to override __reduce_ex__ of the parent class"""
        return (self.__class__, (str(self),), {"response": self.response, "server_message": self.server_message})

Reproduction

To minimize repro script, we just deepcopy the _repo_and_revision_exists_cache like HfFileSystem.

# test_hf_cloudpickle_bug.py
from copy import deepcopy
from huggingface_hub import HfFileSystem
from huggingface_hub.utils import RepositoryNotFoundError
from requests import Response, Request

# Mock an error
resp = Response()
resp.status_code = 404
resp.url = "https://huggingface.co/api/datasets/rotten_tomatoes/test.parquet"
resp.request = Request("GET", "https://huggingface.co/api/datasets/rotten_tomatoes/test.parquet")
resp._content = b'{"error": "Repository Not Found"}'

err = RepositoryNotFoundError(
    "404 Client Error. Repository Not Found.",
    response=resp,
    server_message="Repository Not Found",
)

fs = HfFileSystem()
# Simulate the error in cache
fs._repo_and_revision_exists_cache = {
    ("dataset", "rotten_tomatoes/test.parquet", None): (False, err),
}

# Now try to deepcopy the cache: this is exactly what _get_instance_state does.
cache_copy = deepcopy(fs._repo_and_revision_exists_cache)  # <- expected to fail on buggy behavior

Logs

❯ python test_hf.py                                                                                                                                                                                                                              (myenv) 
Traceback (most recent call last):
  File "/Users/youchenglin/ray/test_hf.py", line 30, in <module>
    cache_copy = deepcopy(fs._repo_and_revision_exists_cache)  # <- expected to fail on buggy behavior
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 265, in _reconstruct
    y = func(*args)
TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'

System info

- huggingface_hub version: 1.1.5
- python 3.10.19

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions