Move model weights validation into weight utils.#13660
Move model weights validation into weight utils.#13660
Conversation
Summary of ChangesHello @hnyls2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the handling of HuggingFace model weight validation. By moving the logic for checking model integrity and download completeness from a standalone CI script into the weight_utils.py module, the changes centralize this critical functionality. This improves code maintainability, ensures consistent validation across different model loading pathways, and simplifies the CI workflow by removing redundant external scripts. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request refactors the model weight validation logic, moving it from a standalone CI script into the core weight_utils.py module. This is a significant improvement, as it integrates validation into the model loading process, ensuring checks are performed consistently. The new implementation is cleaner, more robust, and utilizes modern pathlib features, replacing older, more brittle logic. The validation for sharded models is now more comprehensive, correctly handling various model structures and multiple shard groups.
I've identified a couple of minor areas where exception handling could be more specific to enhance debuggability. Overall, this is an excellent refactoring that improves the reliability and maintainability of the model loading process.
| except Exception as e: | ||
| logger.warning( | ||
| "Failed to scan snapshot %s for incomplete files: %s", | ||
| snapshot_path, | ||
| e, | ||
| ) |
There was a problem hiding this comment.
The except Exception as e: block is overly broad. It can mask unexpected programming errors during the file scan, potentially leading to an incomplete snapshot being incorrectly treated as valid. It's better to catch more specific exceptions, like OSError, to handle filesystem-related issues (e.g., permission errors) while allowing other unexpected exceptions to propagate for easier debugging.
| except Exception as e: | |
| logger.warning( | |
| "Failed to scan snapshot %s for incomplete files: %s", | |
| snapshot_path, | |
| e, | |
| ) | |
| except OSError as e: | |
| logger.warning( | |
| "Failed to scan snapshot %s for incomplete files: %s", | |
| snapshot_path, | |
| e, | |
| ) |
Fixes the model weights validation issue by adding proper detection and cleanup of corrupted/incomplete model cache files. Key changes: - Add validation helpers for safetensors files and sharded models - Add cleanup function to remove corrupted cache directories - Update find_local_hf_snapshot_dir() to validate and cleanup when needed When corruption or incomplete downloads are detected, the entire model cache directory is removed (using shutil.rmtree) to force a clean re-download. This fixes the issue where deleting individual files didn't work due to HuggingFace's symlink-based blob storage. Related: #13660
Removes the standalone validation script and integrates all validation into weight_utils.py. This completes the centralization started in #13660. Changes: - Remove scripts/ci/validate_and_download_models.py - Update scripts/ci/prepare_runner.sh to remove validation step Validation now happens automatically during model loading via the find_local_hf_snapshot_dir() function in weight_utils.py, which detects and cleans up corrupted/incomplete caches.
No description provided.