Skip to content

Move model weights validation into weight utils.#13660

Closed
hnyls2002 wants to merge 3 commits intomainfrom
lsyin/fix-model-weights-validation
Closed

Move model weights validation into weight utils.#13660
hnyls2002 wants to merge 3 commits intomainfrom
lsyin/fix-model-weights-validation

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hnyls2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the handling of HuggingFace model weight validation. By moving the logic for checking model integrity and download completeness from a standalone CI script into the weight_utils.py module, the changes centralize this critical functionality. This improves code maintainability, ensures consistent validation across different model loading pathways, and simplifies the CI workflow by removing redundant external scripts.

Highlights

  • Centralized Model Weight Validation: The logic for validating model weights, including checks for incomplete downloads and corrupted safetensors files, has been moved into python/sglang/srt/model_loader/weight_utils.py.
  • New Utility Functions: Three new private functions (_list_snapshot_incomplete_files, _validate_safetensors_header, _validate_snapshot_weight_files) were introduced to handle specific aspects of model weight validation.
  • Refactored Local Snapshot Discovery: The find_local_hf_snapshot_dir function now leverages these new utility functions for more robust validation of locally cached HuggingFace model snapshots.
  • CI Script Removal: The dedicated CI script scripts/ci/validate_and_download_models.py has been removed, as its functionality is now integrated into the core library.
  • CI Workflow Simplification: The scripts/ci/prepare_runner.sh script has been updated to reflect the removal of the external validation script, streamlining the CI runner preparation process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the model weight validation logic, moving it from a standalone CI script into the core weight_utils.py module. This is a significant improvement, as it integrates validation into the model loading process, ensuring checks are performed consistently. The new implementation is cleaner, more robust, and utilizes modern pathlib features, replacing older, more brittle logic. The validation for sharded models is now more comprehensive, correctly handling various model structures and multiple shard groups.

I've identified a couple of minor areas where exception handling could be more specific to enhance debuggability. Overall, this is an excellent refactoring that improves the reliability and maintainability of the model loading process.

Comment on lines +275 to +280
except Exception as e:
logger.warning(
"Failed to scan snapshot %s for incomplete files: %s",
snapshot_path,
e,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception as e: block is overly broad. It can mask unexpected programming errors during the file scan, potentially leading to an incomplete snapshot being incorrectly treated as valid. It's better to catch more specific exceptions, like OSError, to handle filesystem-related issues (e.g., permission errors) while allowing other unexpected exceptions to propagate for easier debugging.

Suggested change
except Exception as e:
logger.warning(
"Failed to scan snapshot %s for incomplete files: %s",
snapshot_path,
e,
)
except OSError as e:
logger.warning(
"Failed to scan snapshot %s for incomplete files: %s",
snapshot_path,
e,
)

@hnyls2002 hnyls2002 closed this Nov 20, 2025
@hnyls2002 hnyls2002 deleted the lsyin/fix-model-weights-validation branch November 20, 2025 15:45
alisonshao added a commit that referenced this pull request Nov 21, 2025
Fixes the model weights validation issue by adding proper detection and
cleanup of corrupted/incomplete model cache files.

Key changes:
- Add validation helpers for safetensors files and sharded models
- Add cleanup function to remove corrupted cache directories
- Update find_local_hf_snapshot_dir() to validate and cleanup when needed

When corruption or incomplete downloads are detected, the entire model
cache directory is removed (using shutil.rmtree) to force a clean
re-download. This fixes the issue where deleting individual files didn't
work due to HuggingFace's symlink-based blob storage.

Related: #13660
alisonshao added a commit that referenced this pull request Nov 21, 2025
Removes the standalone validation script and integrates all validation
into weight_utils.py. This completes the centralization started in #13660.

Changes:
- Remove scripts/ci/validate_and_download_models.py
- Update scripts/ci/prepare_runner.sh to remove validation step

Validation now happens automatically during model loading via the
find_local_hf_snapshot_dir() function in weight_utils.py, which detects
and cleans up corrupted/incomplete caches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant