[URIs] Parse web URLs in parse_hf_uri + add HfUri.to_url#4296
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckets as /tree - Percent-decode path segments when parsing web URLs so file names with spaces, '#', ... resolve correctly (revision was already decoded, path was not). - to_url() now percent-encodes the path (inverse of parsing), keeping '/' as separator. - Remove 'media' from recognized URL routes (not a Hub location we want to parse). - to_url() renders bucket files via the '/tree/<path>' route instead of '/resolve/<path>'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 73dc917. Configure here.
| url = f"{base}/buckets/{self.id}" | ||
| if path: | ||
| url += f"/tree/{path}" | ||
| return url |
There was a problem hiding this comment.
Bucket to_url uses /tree/ instead of /resolve/ for files
Low Severity
to_url() for buckets always generates the /tree/ route for paths, but the documentation table (lines 120–122) distinguishes /resolve/<path> for bucket files from /tree/<path> for bucket folders. The PR description also explicitly states to_url() should use "the download route (/resolve/<path>) for bucket files." Using /tree/ for a file like data.bin means the browser link points to the folder viewer rather than the download/raw-content route, which is semantically incorrect for file paths.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 73dc917. Configure here.
There was a problem hiding this comment.
This is intentional: bucket to_url() renders the /tree/<path> route, not /resolve/<path>. The repo docs (lines 148-149) and the to_url tests already reflect /tree/ for bucket files. The /resolve/ entry in the parsing table is only about accepted input URLs (the parser accepts both /resolve/ and /tree/ for buckets), not about what to_url() emits. Fixed the stale /resolve/<path> mention in the PR description.
|
Note: I'll open a follow-up PR to deprecate |
hanouticelina
left a comment
There was a problem hiding this comment.
very tiny nits otherwise looks good to me! 👌
Co-authored-by: célina <hanouticelina@gmail.com>
|
This PR has been shipped as part of the v1.18.0 release. |
Make _parse_bucket_uri accept Hugging Face web URLs (e.g. https://huggingface.co/buckets/namespace/name) in addition to hf:// URIs and plain namespace/name paths, leveraging URL parsing from #4296. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


Now that we have a single centralized parser for
hf://URIs (parse_hf_uri), this PR teaches it to also understand the web URLs you copy-paste from the website, so pasting a link straight into the CLI or the library "just works". It also adds the reverse direction:HfUri.to_url().Allows using URLs in the CLI:
Summary
parse_hf_uri(...)now accepts Hugging Face web URLs (with or without scheme,http://andhttps://) and normalizes them to the canonicalhf://form before reusing the exact same parsing/validation path.huggingface.co,hf.co, the staging host, and the host of a customHF_ENDPOINT. Query strings (?download=true) and fragments (#L10) are dropped.HfUri.to_url(endpoint=None)— renders the browsable web URL (the inverse of parsing one).is_hf_uri(...)benefits for free (returnsTruefor recognized URLs, stillFalsefor local paths).Supported URL formats
Only the URL path is shown (the part after the host, e.g.
huggingface.co); the recognized host is implied./<ns>/<name>spaces/,kernels/,models/)/datasets/<ns>/<name><rev>/<ns>/<name>/tree/<rev>[/<path>]/<ns>/<name>/blob/<rev>/<path>/<ns>/<name>/resolve/<rev>/<path>/<ns>/<name>/raw/<rev>/<path>/<ns>/<name>/blame/<rev>/<path>/buckets/<ns>/<name>/buckets/<ns>/<name>/resolve/<path>/buckets/<ns>/<name>/tree/<path>Special refs (
refs/pr/N,refs/convert/...) are matched eagerly even though they contain/; other branch/tag names with/must be URL-encoded (feature%2Ffoo).Deliberately rejected (ambiguous / not a Hub location)
Single-segment URLs (user/org pages, canonical repos like
/gpt2), listing pages (/datasets),commit/commits/discussions/settingsroutes,/collections/..., and non-HF hosts. As soon as a URL is ambiguous, we don't parse it.Examples
And the reverse direction:
to_url()points at the repo/bucket landing page when there's no path or revision, the folder viewer (/tree/<rev>) when only a revision is set, the file viewer (/blob/<rev>/<path>, revision defaulting tomain) for repo files, and the tree route (/tree/<path>) for bucket files (buckets are not versioned).Notes / decisions
hf://body and feeding it back through the existing parser, so all validation (repo id, revision splitting, empty-segment checks, special refs) is shared between the two entry points — no duplicated logic.server.ts)..../datasets/squad/blob/main/file.jsonreadssquad/blobas the id and is then rejected ("unsupported route"). Still rejected — canonical repos aren't supported anyway.Tests & docs
URL_SUCCESS_CASES,URL_FAILURE_CASESandTO_URL_CASES(+ tests) totests/test_utils_hf_uris.py.docs/source/en/package_reference/hf_uris.mdwith "Web URLs" (supported + rejected tables) and "Rendering a web URL" sections.🤖 Generated with Claude Code
Note
Low Risk
Pure string parsing with no network or auth changes; ambiguous URLs are explicitly rejected, though any code that relied on web URLs failing
parse_hf_uriwill now accept them.Overview
Extends
parse_hf_uriso copy-pasted Hugging Face web URLs (and scheme-lesshuggingface.co/...links) normalize into the samehf://parsing path as before, covering repo/bucket landing pages andblob/resolve/raw/tree/blameroutes while rejecting ambiguous or non-location URLs (profiles, settings, collections, wrong hosts).Adds
HfUri.to_url()to render browsable Hub URLs (inverse of URL parsing),HF_URL_HOSTSinconstantsforhuggingface.co,hf.co, staging, and customHF_ENDPOINT, and updatesis_hf_urito treat recognized web URLs as valid. Docs and parametrized tests cover URL success/failure cases andto_urlround-trips.Reviewed by Cursor Bugbot for commit be42b0c. Bugbot is set up for automated code reviews on this repo. Configure here.