Move model weights validation into weight utils. by hnyls2002 · Pull Request #13660 · sgl-project/sglang

hnyls2002 · 2025-11-20T14:13:28Z

No description provided.

gemini-code-assist · 2025-11-20T14:13:44Z

Summary of Changes

Hello @hnyls2002, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the handling of HuggingFace model weight validation. By moving the logic for checking model integrity and download completeness from a standalone CI script into the weight_utils.py module, the changes centralize this critical functionality. This improves code maintainability, ensures consistent validation across different model loading pathways, and simplifies the CI workflow by removing redundant external scripts.

Highlights

Centralized Model Weight Validation: The logic for validating model weights, including checks for incomplete downloads and corrupted safetensors files, has been moved into python/sglang/srt/model_loader/weight_utils.py.
New Utility Functions: Three new private functions (_list_snapshot_incomplete_files, _validate_safetensors_header, _validate_snapshot_weight_files) were introduced to handle specific aspects of model weight validation.
Refactored Local Snapshot Discovery: The find_local_hf_snapshot_dir function now leverages these new utility functions for more robust validation of locally cached HuggingFace model snapshots.
CI Script Removal: The dedicated CI script scripts/ci/validate_and_download_models.py has been removed, as its functionality is now integrated into the core library.
CI Workflow Simplification: The scripts/ci/prepare_runner.sh script has been updated to reflect the removal of the external validation script, streamlining the CI runner preparation process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

hnyls2002 · 2025-11-20T14:14:47Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request refactors the model weight validation logic, moving it from a standalone CI script into the core weight_utils.py module. This is a significant improvement, as it integrates validation into the model loading process, ensuring checks are performed consistently. The new implementation is cleaner, more robust, and utilizes modern pathlib features, replacing older, more brittle logic. The validation for sharded models is now more comprehensive, correctly handling various model structures and multiple shard groups.

I've identified a couple of minor areas where exception handling could be more specific to enhance debuggability. Overall, this is an excellent refactoring that improves the reliability and maintainability of the model loading process.

gemini-code-assist · 2025-11-20T14:21:49Z

+    except Exception as e:
+        logger.warning(
+            "Failed to scan snapshot %s for incomplete files: %s",
+            snapshot_path,
+            e,
+        )


The except Exception as e: block is overly broad. It can mask unexpected programming errors during the file scan, potentially leading to an incomplete snapshot being incorrectly treated as valid. It's better to catch more specific exceptions, like OSError, to handle filesystem-related issues (e.g., permission errors) while allowing other unexpected exceptions to propagate for easier debugging.

Suggested change

except Exception as e:

logger.warning(

"Failed to scan snapshot %s for incomplete files: %s",

snapshot_path,

e,

)

except OSError as e:

logger.warning(

"Failed to scan snapshot %s for incomplete files: %s",

snapshot_path,

e,

)

Fixes the model weights validation issue by adding proper detection and cleanup of corrupted/incomplete model cache files. Key changes: - Add validation helpers for safetensors files and sharded models - Add cleanup function to remove corrupted cache directories - Update find_local_hf_snapshot_dir() to validate and cleanup when needed When corruption or incomplete downloads are detected, the entire model cache directory is removed (using shutil.rmtree) to force a clean re-download. This fixes the issue where deleting individual files didn't work due to HuggingFace's symlink-based blob storage. Related: #13660

Removes the standalone validation script and integrates all validation into weight_utils.py. This completes the centralization started in #13660. Changes: - Remove scripts/ci/validate_and_download_models.py - Update scripts/ci/prepare_runner.sh to remove validation step Validation now happens automatically during model loading via the find_local_hf_snapshot_dir() function in weight_utils.py, which detects and cleans up corrupted/incomplete caches.

hnyls2002 added 3 commits November 20, 2025 22:09

fix weight utils

63cb9d5

update

f637a9e

Merge branch 'main' into lsyin/fix-model-weights-validation

ddb42cf

github-actions Bot added the run-ci label Nov 20, 2025

gemini-code-assist Bot reviewed Nov 20, 2025

View reviewed changes

hnyls2002 closed this Nov 20, 2025

hnyls2002 deleted the lsyin/fix-model-weights-validation branch November 20, 2025 15:45

alisonshao mentioned this pull request Nov 21, 2025

Fix model weights validation with automatic cache cleanup #13729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move model weights validation into weight utils.#13660

Move model weights validation into weight utils.#13660
hnyls2002 wants to merge 3 commits intomainfrom
lsyin/fix-model-weights-validation

hnyls2002 commented Nov 20, 2025

Uh oh!

gemini-code-assist Bot commented Nov 20, 2025

Uh oh!

hnyls2002 commented Nov 20, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hnyls2002 commented Nov 20, 2025

Uh oh!

gemini-code-assist Bot commented Nov 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

hnyls2002 commented Nov 20, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant