Skip to content

HF_DATASETS_CACHE ignored? #7480

@stephenroller

Description

@stephenroller

Describe the bug

I'm struggling to get things to respect HF_DATASETS_CACHE.

Rationale: I'm on a system that uses NFS for homedir, so downloading to NFS is expensive, slow, and wastes valuable quota compared to local disk. Instead, it seems to rely mostly on HF_HUB_CACHE.

Current version: 3.2.1dev. In the process of testing 3.4.0

Steps to reproduce the bug

[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]

dump.py:

from datasets import load_dataset
dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-100BT", split="train")

Repro steps

# ensure no cache
$ mv ~/.cache/huggingface ~/.cache/huggingface.bak

$ export HF_DATASETS_CACHE=/tmp/roller/datasets
$ rm -rf ${HF_DATASETS_CACHE}
$ env | grep HF | grep -v TOKEN
HF_DATASETS_CACHE=/tmp/roller/datasets

$ python dump.py
# (omitted for brevity)

# (while downloading) 
$ du -hcs ~/.cache/huggingface/hub
18G     hub
18G     total

# (after downloading)
$ du -hcs ~/.cache/huggingface/hub

It's a shame because datasets supports s3 (which I could really use right now) but hub does not.

Expected behavior

  • ~/.cache/huggingface/hub stays empty
  • /tmp/roller/datasets becomes full of stuff

Environment info

[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions