Describe the bug
datasets.load_dataset revision semantics are a bit inconsistent when the dataset is not found on the huggingface hub. When fetching the latest cached version of the dataset, the revision argument is ignored, so long as any cached versions of the dataset already exist in the HF cache.
Steps to reproduce the bug
import datasets
datasets.load_dataset(
"sentientfutures/ahb",
"dimensions",
split="train",
revision="main"
)
# would expect some error to raise here
datasets.load_dataset(
"sentientfutures/ahb",
"dimensions",
split="train",
revision="invalid_revision"
)
Expected behavior
On the second call to datasets.load_dataset in the 'steps to reproduce the bug' example, expect something like:
raise DatasetNotFoundError(
datasets.exceptions.DatasetNotFoundError: Revision 'invalid_revision' doesn't exist for dataset 'sentientfutures/ahb' on the Hub.
Environment info
datasets version: 4.4.1
- Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.37
- Python version: 3.12.12
huggingface_hub version: 0.36.0
- PyArrow version: 22.0.0
- Pandas version: 2.2.3
fsspec version: 2025.9.0
Describe the bug
datasets.load_datasetrevisionsemantics are a bit inconsistent when the dataset is not found on the huggingface hub. When fetching the latest cached version of the dataset, therevisionargument is ignored, so long as any cached versions of the dataset already exist in the HF cache.Steps to reproduce the bug
Expected behavior
On the second call to
datasets.load_datasetin the 'steps to reproduce the bug' example, expect something like:Environment info
datasetsversion: 4.4.1huggingface_hubversion: 0.36.0fsspecversion: 2025.9.0