Skip to content

Code reorganization towards release of xet cargo package#693

Merged
hoytak merged 2 commits intomainfrom
hoytak/260307-xet-package
Mar 11, 2026
Merged

Code reorganization towards release of xet cargo package#693
hoytak merged 2 commits intomainfrom
hoytak/260307-xet-package

Conversation

@hoytak
Copy link
Collaborator

@hoytak hoytak commented Mar 10, 2026

This PR is a massive rearrangement of the code base into 5 packages intended for release on cargo. The directories and corresponding packages are:

  1. xet_runtime/ — compiles into the xet-runtime package. Contains the runtime, config, and logging management.
  2. xet_core_structures/ — compiles into the xet-core-structures package. Contains core data structures for hashing, shards, and xorbs as well as internal data structures that depend on these.
  3. xet_client/ — compiles into the xet-client package, contains client code for remotely connecting to the Hugging Face servers.
  4. xet_data/ — compiles into the xet-data package, contains the data processing pipeline: chunking/deduplication, file reconstruction, clean/smudge operations, and progress tracking.
  5. xet_pkg/ — compiles into the hf-xet package, provides the top-level session-based API for file upload and download with user-facing error categorization. This is the primary package downstream dependencies would use. This also contains a single summary error type, XetError, that translates cleanly into python error types.

In addition, the other tools are:

  • git_xet/ — the git_xet CLI binary crate (location preserved).
  • hf_xet/ -- the hf_xet python package (location preserved).
  • simulation/ — the simulation crate for upload scenario benchmarking.
  • wasm/ -- the wasm objects.

The full description — and information for an AI agent to use to update downstream dependencies — is at api_changes/update_260309_package_restructure.md.

Summary of moves:

  • xet_runtime: became xet_runtime::core inside xet_runtime/.

  • utils: became xet_runtime::utils inside xet_runtime/.

  • xet_config: became xet_runtime::config inside xet_runtime/.

  • xet_logging: became xet_runtime::logging inside xet_runtime/.

  • error_printer: became xet_runtime::error_printer inside xet_runtime/.

  • file_utils: became xet_runtime::file_utils inside xet_runtime/.

  • merklehash: became xet_core_structures::merklehash inside xet_core_structures/.

  • mdb_shard: became xet_core_structures::metadata_shard inside xet_core_structures/.

  • xorb_object: became xet_core_structures::xorb_object inside xet_core_structures/.

  • cas_client: became xet_client::cas_client inside xet_client/.

  • hub_client: became xet_client::hub_client inside xet_client/.

  • cas_types: became xet_client::cas_types inside xet_client/.

  • chunk_cache: became xet_client::chunk_cache inside xet_client/.

  • data: became xet_data::processing inside xet_data/.

  • deduplication: became xet_data::deduplication inside xet_data/.

  • file_reconstruction: became xet_data::file_reconstruction inside xet_data/.

  • progress_tracking: became xet_data::progress_tracking inside xet_data/.

  • xet_session: became xet::xet_session inside xet_pkg/.

  • Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level into wasm/; internal imports updated, public APIs unchanged.

Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Code reorganization towards release of xet cargo package

This is a well-structured consolidation of ~20 workspace crates into 5 publishable packages with a clean layered dependency graph (runtime → core_structures → client → data → xet). The api_changes/update_260309_package_restructure.md migration guide is thorough. Error type hierarchy is well-designed with XetError providing meaningful user-facing categories that map cleanly to Python exception types. Backward-compatibility aliases at old module paths is the right approach for staged migration.

CI workflows were updated correctly in ci.yml, git-xet-release.yml, and hf-xet-tests.yml.

Critical

1. release.yml: All 5 maturin working-directory references still point to hf_xet instead of bindings/hf_xet

hf_xet/ no longer exists at the repo root — it was moved to bindings/hf_xet/. This will break every release build (linux, musllinux, windows, macos, sdist). The sed/cp/pushd paths in the same file were updated correctly — only working-directory was missed. See inline comment for locations.

Non-blocking observations (for follow-up)

2. Wrong repository URL in libxet/Cargo.toml — set to https://github.com/xetdata/xet-core but actual repo is https://github.com/huggingface/xet-core.

3. Missing crates.io publish metadata — Only libxet/Cargo.toml has description, license, repository. The other 4 packages are missing these (cargo publish requires license and description). Fine if publishing is a later step.

4. mockall as a non-dev dependency in client/Cargo.toml — pulls a proc-macro framework into every consumer's build. Should likely be [dev-dependencies] unless mock types are part of the public API for downstream test code.

5. tempfile as a non-dev dependency in core_structures/Cargo.toml — worth auditing whether it's used in library code or only tests/benchmarks.

6. Inconsistent type paths in client/src/error.rs:129-130From<SingleflightError> impl uses utils::errors::SingleflightError in the trait but utils::singleflight::SingleflightError in the parameter. Compiles today via re-export but fragile and confusing to readers.

@rajatarya
Copy link
Collaborator

Not sure any way to avoid this, but by moving hf_xet from root to bindings/hf_xet will break all the *nix distro build scripts (Arch/Fedora/etc) - they are all expecting to use $repo_root/hf_xet for Python package root to build from.

Maybe we can use git symlinks to keep them alive for one release and mark them deprecated? Not sure it is worth it since they will likely not notice until it breaks.

@hoytak hoytak force-pushed the hoytak/260307-xet-package branch from 89cda78 to ea34cdd Compare March 10, 2026 17:31
@hoytak
Copy link
Collaborator Author

hoytak commented Mar 10, 2026

Not sure any way to avoid this, but by moving hf_xet from root to bindings/hf_xet will break all the *nix distro build scripts (Arch/Fedora/etc) - they are all expecting to use $repo_root/hf_xet for Python package root to build from.

Maybe we can use git symlinks to keep them alive for one release and mark them deprecated? Not sure it is worth it since they will likely not notice until it breaks.

We can move them back for now... That's less critical.

@seanses
Copy link
Collaborator

seanses commented Mar 10, 2026

IIUC to publish libxet we still need to publish runtime, core_structures, client and data, just want to double check if they are published with name as the directory (in that case the crate may be already taken) or under any other name.

@seanses
Copy link
Collaborator

seanses commented Mar 10, 2026

Not sure any way to avoid this, but by moving hf_xet from root to bindings/hf_xet will break all the *nix distro build scripts (Arch/Fedora/etc) - they are all expecting to use $repo_root/hf_xet for Python package root to build from.

Maybe we can use git symlinks to keep them alive for one release and mark them deprecated? Not sure it is worth it since they will likely not notice until it breaks.

If possible let's also keep git_xet under the repo root for now. There are a lot of hub docs linking to here and a moon-landing proxy rule leading to this location.

Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't block on me for this - but the git-xet comment from @seanses is worth doing ahead of time.

@hoytak
Copy link
Collaborator Author

hoytak commented Mar 11, 2026

IIUC to publish libxet we still need to publish runtime, core_structures, client and data, just want to double check if they are published with name as the directory (in that case the crate may be already taken) or under any other name.

The packages published are hf-xet, xet-runtime, xet-core-structures, xet-client, and xet-data.

@hoytak hoytak force-pushed the hoytak/260307-xet-package branch from f7a13a9 to 7bf45f9 Compare March 11, 2026 18:07
@hoytak hoytak force-pushed the hoytak/260307-xet-package branch from 7bf45f9 to 4e88dd1 Compare March 11, 2026 18:35
@hoytak hoytak merged commit 45d38a1 into main Mar 11, 2026
7 checks passed
@hoytak hoytak deleted the hoytak/260307-xet-package branch March 11, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants