Skip to content

[URIs] Parse web URLs in parse_hf_uri + add HfUri.to_url#4296

Merged
Wauplin merged 10 commits into
mainfrom
parse-hf-uri-from-url
Jun 5, 2026
Merged

[URIs] Parse web URLs in parse_hf_uri + add HfUri.to_url#4296
Wauplin merged 10 commits into
mainfrom
parse-hf-uri-from-url

Conversation

@Wauplin

@Wauplin Wauplin commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Now that we have a single centralized parser for hf:// URIs (parse_hf_uri), this PR teaches it to also understand the web URLs you copy-paste from the website, so pasting a link straight into the CLI or the library "just works". It also adds the reverse direction: HfUri.to_url().

Allows using URLs in the CLI:

$ hf cp https://huggingface.co/nvidia/LocateAnything-3B/blob/main/config.json - | jq '.architectures'
[
  "LocateAnythingForConditionalGeneration"
]

Summary

  • parse_hf_uri(...) now accepts Hugging Face web URLs (with or without scheme, http:// and https://) and normalizes them to the canonical hf:// form before reusing the exact same parsing/validation path.
  • Recognized hosts: huggingface.co, hf.co, the staging host, and the host of a custom HF_ENDPOINT. Query strings (?download=true) and fragments (#L10) are dropped.
  • Added HfUri.to_url(endpoint=None) — renders the browsable web URL (the inverse of parsing one).
  • is_hf_uri(...) benefits for free (returns True for recognized URLs, still False for local paths).
  • The parser stays strict: ambiguous or non-location URLs are rejected, never guessed.

Supported URL formats

Only the URL path is shown (the part after the host, e.g. huggingface.co); the recognized host is implied.

Points at URL path
Model repo /<ns>/<name>
That repo type (also spaces/, kernels/, models/) /datasets/<ns>/<name>
Folder at <rev> /<ns>/<name>/tree/<rev>[/<path>]
File (viewer) /<ns>/<name>/blob/<rev>/<path>
File (download) /<ns>/<name>/resolve/<rev>/<path>
File (raw) /<ns>/<name>/raw/<rev>/<path>
File (blame) /<ns>/<name>/blame/<rev>/<path>
Bucket /buckets/<ns>/<name>
Bucket file (no revision) /buckets/<ns>/<name>/resolve/<path>
Bucket folder /buckets/<ns>/<name>/tree/<path>

Special refs (refs/pr/N, refs/convert/...) are matched eagerly even though they contain /; other branch/tag names with / must be URL-encoded (feature%2Ffoo).

Deliberately rejected (ambiguous / not a Hub location)

Single-segment URLs (user/org pages, canonical repos like /gpt2), listing pages (/datasets), commit/commits/discussions/settings routes, /collections/..., and non-HF hosts. As soon as a URL is ambiguous, we don't parse it.

Examples

>>> from huggingface_hub import parse_hf_uri

>>> parse_hf_uri("https://huggingface.co/openai/gpt-oss-20b")
HfUri(type='model', id='openai/gpt-oss-20b', revision=None, path_in_repo='')

>>> parse_hf_uri("https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/README.md")
HfUri(type='dataset', id='HuggingFaceFW/fineweb', revision='main', path_in_repo='README.md')

>>> parse_hf_uri("https://huggingface.co/datasets/foo/bar/resolve/refs/pr/3/data/train.csv")
HfUri(type='dataset', id='foo/bar', revision='refs/pr/3', path_in_repo='data/train.csv')

>>> parse_hf_uri("https://huggingface.co/buckets/my-org/my-bucket/resolve/checkpoints/model.safetensors")
HfUri(type='bucket', id='my-org/my-bucket', revision=None, path_in_repo='checkpoints/model.safetensors')

And the reverse direction:

>>> parse_hf_uri("hf://datasets/my-org/my-dataset@v1/train.csv").to_url()
'https://huggingface.co/datasets/my-org/my-dataset/blob/v1/train.csv'

to_url() points at the repo/bucket landing page when there's no path or revision, the folder viewer (/tree/<rev>) when only a revision is set, the file viewer (/blob/<rev>/<path>, revision defaulting to main) for repo files, and the tree route (/tree/<path>) for bucket files (buckets are not versioned).

Notes / decisions

  • I implemented URL support by normalizing URLs into the canonical hf:// body and feeding it back through the existing parser, so all validation (repo id, revision splitting, empty-segment checks, special refs) is shared between the two entry points — no duplicated logic.
  • URL formats were cross-checked against the Hub server router (moon-landing server.ts).
  • Edge case: a canonical-repo file URL like .../datasets/squad/blob/main/file.json reads squad/blob as the id and is then rejected ("unsupported route"). Still rejected — canonical repos aren't supported anyway.

Tests & docs

  • Added URL_SUCCESS_CASES, URL_FAILURE_CASES and TO_URL_CASES (+ tests) to tests/test_utils_hf_uris.py.
  • Updated docs/source/en/package_reference/hf_uris.md with "Web URLs" (supported + rejected tables) and "Rendering a web URL" sections.

🤖 Generated with Claude Code


Note

Low Risk
Pure string parsing with no network or auth changes; ambiguous URLs are explicitly rejected, though any code that relied on web URLs failing parse_hf_uri will now accept them.

Overview
Extends parse_hf_uri so copy-pasted Hugging Face web URLs (and scheme-less huggingface.co/... links) normalize into the same hf:// parsing path as before, covering repo/bucket landing pages and blob/resolve/raw/tree/blame routes while rejecting ambiguous or non-location URLs (profiles, settings, collections, wrong hosts).

Adds HfUri.to_url() to render browsable Hub URLs (inverse of URL parsing), HF_URL_HOSTS in constants for huggingface.co, hf.co, staging, and custom HF_ENDPOINT, and updates is_hf_uri to treat recognized web URLs as valid. Docs and parametrized tests cover URL success/failure cases and to_url round-trips.

Reviewed by Cursor Bugbot for commit be42b0c. Bugbot is set up for automated code reviews on this repo. Configure here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread docs/source/en/package_reference/hf_uris.md Outdated
@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated

@julien-c julien-c left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @mishig25 for visibility as we discussed it)

…ckets as /tree

- Percent-decode path segments when parsing web URLs so file names with spaces,
  '#', ... resolve correctly (revision was already decoded, path was not).
- to_url() now percent-encodes the path (inverse of parsing), keeping '/' as separator.
- Remove 'media' from recognized URL routes (not a Hub location we want to parse).
- to_url() renders bucket files via the '/tree/<path>' route instead of '/resolve/<path>'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/huggingface_hub/utils/_hf_uris.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 73dc917. Configure here.

url = f"{base}/buckets/{self.id}"
if path:
url += f"/tree/{path}"
return url

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket to_url uses /tree/ instead of /resolve/ for files

Low Severity

to_url() for buckets always generates the /tree/ route for paths, but the documentation table (lines 120–122) distinguishes /resolve/<path> for bucket files from /tree/<path> for bucket folders. The PR description also explicitly states to_url() should use "the download route (/resolve/<path>) for bucket files." Using /tree/ for a file like data.bin means the browser link points to the folder viewer rather than the download/raw-content route, which is semantically incorrect for file paths.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 73dc917. Configure here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional: bucket to_url() renders the /tree/<path> route, not /resolve/<path>. The repo docs (lines 148-149) and the to_url tests already reflect /tree/ for bucket files. The /resolve/ entry in the parsing table is only about accepted input URLs (the parser accepts both /resolve/ and /tree/ for buckets), not about what to_url() emits. Fixed the stale /resolve/<path> mention in the PR description.

@Wauplin Wauplin marked this pull request as ready for review June 2, 2026 10:10
@Wauplin Wauplin requested a review from hanouticelina June 2, 2026 10:10
@Wauplin

Wauplin commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Note: I'll open a follow-up PR to deprecate repo_type_and_id_from_hf_id in favor of parse_hf_uri now that it also handles URLs. In practice we will stop to use repo_type_and_id_from_hf_id in huggingface_hub but keep it exposed publicly as it's used in some other libraries.

@hanouticelina hanouticelina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very tiny nits otherwise looks good to me! 👌

Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
Wauplin and others added 2 commits June 5, 2026 10:54
Co-authored-by: célina <hanouticelina@gmail.com>
@Wauplin Wauplin merged commit d04c3b2 into main Jun 5, 2026
24 of 26 checks passed
@Wauplin Wauplin deleted the parse-hf-uri-from-url branch June 5, 2026 08:57
@huggingface-hub-bot

Copy link
Copy Markdown
Contributor

This PR has been shipped as part of the v1.18.0 release.

Wauplin added a commit that referenced this pull request Jun 9, 2026
Make _parse_bucket_uri accept Hugging Face web URLs (e.g.
https://huggingface.co/buckets/namespace/name) in addition to hf://
URIs and plain namespace/name paths, leveraging URL parsing from #4296.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants