Skip to content

feat: add coordinated checkpoint prefetch for network filesystem loading#20843

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
janbernloehr:jbernloehr/prefetch-checkpoints
Apr 17, 2026
Merged

feat: add coordinated checkpoint prefetch for network filesystem loading#20843
Fridge003 merged 8 commits intosgl-project:mainfrom
janbernloehr:jbernloehr/prefetch-checkpoints

Conversation

@janbernloehr
Copy link
Copy Markdown
Contributor

@janbernloehr janbernloehr commented Mar 18, 2026

Closes #20842

Motivation

When multiple DP ranks on the same node load the same checkpoint via mmap (e.g. DP-attention), each rank independently page-faults all safetensors files over NFS/Lustre. With N ranks this causes N × checkpoint_size bytes of redundant network I/O.

Note: this is orthogonal to #20332, which addresses the per-rank access pattern problem with TP (striped reads within safetensors). This PR addresses the cross-rank duplication problem with DP (N ranks redundantly reading the same files over NFS). The two optimizations are complementary.

Modifications

  • weight_utils.py: Added _PREFETCH_BLOCK_SIZE module constant, _prefetch_checkpoint_file() (reads a file in blocks to warm page cache) and _prefetch_all_checkpoints() (distributes files across ranks via sorted_files[rank::world_size], runs in a background daemon thread with configurable thread count). Added prefetch and prefetch_num_threads parameters to safetensors_weights_iterator() and buffered_multi_thread_safetensors_weights_iterator().
  • server_args.py: Added --weight-loader-prefetch-checkpoints flag and --weight-loader-prefetch-num-threads (default: 4).
  • loader.py: Wired the new flags through to both weight iterator functions.
  • test_prefetch_checkpoints.py: Unit tests verifying byte-level read completeness, correct rank-based file partitioning, and bit-identical weights with/without prefetch.

The prefetch runs in a background daemon thread (inspired by vllm-project/vllm#36012) that pipelines I/O with loading — loading starts immediately while the prefetch thread reads ahead into the OS page cache. Each rank prefetches 1/Nth of the checkpoint shards, reducing total NFS I/O from Ncheckpoint to 1checkpoint.

Accuracy Tests

No changes to model forward code or kernels. The prefetch only affects the I/O path during weight loading — tensors loaded are identical (verified by unit test test_weights_match_with_and_without_prefetch).

Benchmarking and Profiling

DeepSeek R1-671B FP8 on Lustre, multiple platforms and configurations:

Platform Baseline + Background prefetch Improvement
DGX B200 (8-GPU, dpa-on) 1677s (28 min) 167s (2.8 min) 10x
DGX B300 (8-GPU, dpa-on) 2390s (40 min) 280s (4.7 min) 8.5x
GB200 NVL (2-node, dpa-on) 263s (4.4 min) 131s (2.2 min) 2x
GB300 NVL (2-node, dpa-on) 279s (4.6 min) 152s (2.5 min) 1.8x
DGX H100 (2-node TP=16, dpa-off) 2571s (42.9 min) 793s (13.2 min) 3.2x

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial performance enhancement for loading large language model weights, especially in distributed environments utilizing network filesystems. By intelligently prefetching checkpoint files into the OS page cache, it eliminates the substantial overhead caused by multiple processes independently accessing the same data over the network. This optimization dramatically accelerates model initialization, making the system more robust and efficient for high-performance computing setups.

Highlights

  • Performance Optimization: Implemented coordinated checkpoint prefetching to significantly reduce weight loading times (up to 10x faster) for distributed training on network filesystems like NFS/Lustre.
  • Reduced Network I/O: Addressed redundant network I/O by distributing checkpoint file prefetching across ranks, ensuring each rank reads only a fraction of the shards, thereby reducing total network I/O from N * checkpoint_size to 1 * checkpoint_size.
  • New Functionality: Introduced _prefetch_checkpoint_file() and _prefetch_all_checkpoints() in weight_utils.py to manage the prefetching process, reading files in 16 MB blocks to warm the OS page cache.
  • User Control: Added a new command-line flag --weight-loader-prefetch-checkpoints in server_args.py to enable or disable this feature, which is then wired through loader.py.
  • Comprehensive Testing: Included new unit tests in test_prefetch_checkpoints.py to verify byte-level read completeness, correct rank-based file partitioning, and ensure bit-identical weights with and without prefetching.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a coordinated checkpoint prefetch mechanism to significantly improve weight loading times, especially on network filesystems. The implementation correctly distributes file prefetching across ranks and includes detailed logging and performance metrics. Unit tests have been added to verify the correctness of the prefetching logic and ensure bit-identical weights with and without prefetch. The changes are well-structured and address a critical performance bottleneck.

Comment thread python/sglang/srt/model_loader/weight_utils.py Outdated
Comment thread python/sglang/srt/model_loader/weight_utils.py
@nvpohanh
Copy link
Copy Markdown
Collaborator

@Fridge003 Could you assign to the PICs of this part? thanks!

Comment thread test/registered/unit/model_loader/test_prefetch_checkpoints.py
Comment thread test/registered/unit/model_loader/test_prefetch_checkpoints.py Outdated
Comment thread test/registered/unit/model_loader/test_prefetch_checkpoints.py Outdated
Comment thread python/sglang/srt/server_args.py
Comment thread python/sglang/srt/model_loader/weight_utils.py Outdated
Comment thread python/sglang/srt/model_loader/weight_utils.py Outdated
Comment thread python/sglang/srt/model_loader/weight_utils.py Outdated
Comment thread python/sglang/srt/server_args.py
Comment thread python/sglang/srt/model_loader/weight_utils.py
@Fridge003
Copy link
Copy Markdown
Collaborator

Wonderful feature!

When multiple DP ranks on the same node load the same checkpoint via
mmap (e.g. DP-attention), each rank independently page-faults all
files over NFS/Lustre. With N ranks this causes N * checkpoint_size
bytes of redundant network I/O.

This adds a `--weight-loader-prefetch-checkpoints` flag that, before
loading, distributes a sequential read of the checkpoint files across
all ranks (each reads 1/Nth of the shards into the shared OS page
cache). Subsequent mmap accesses then hit warm page cache instead of
the network filesystem.

Measured on DeepSeek R1-671B FP8 with 8 DP-attention ranks:

| Setup                    | B300 weight load time |
|--------------------------|-----------------------|
| Baseline (mmap)          | deadlock / 40 min     |
| max_workers=4            | 2390s (40 min)        |
| max_workers=4 + prefetch | 236s (3.9 min)        |

10x improvement on Lustre, with no infrastructure changes required.

The implementation:
- `_prefetch_checkpoint_file()`: reads a file in 16 MB blocks to warm
  the page cache
- `_prefetch_all_checkpoints()`: distributes files across ranks with a
  barrier, uses 4 threads per rank for concurrent prefetch
- Wired through both `safetensors_weights_iterator` and
  `buffered_multi_thread_safetensors_weights_iterator`
Tests verify:
- _prefetch_checkpoint_file reads every byte of the file
- _prefetch_all_checkpoints only reads the files assigned to the
  current rank (rank::world_size partitioning)
- Single-rank mode prefetches all files
- Weights loaded with prefetch=True are bit-identical to prefetch=False
- Extract block_size as module-level constant _PREFETCH_BLOCK_SIZE
- Make prefetch num_threads configurable via --weight-loader-prefetch-num-threads
  (default: 4), wired through server_args -> loader -> weight_utils
Adopt vLLM's approach (vllm-project/vllm#36012): run prefetch in a
background daemon thread instead of blocking with a barrier. This
pipelines I/O with loading — the loader benefits from pages already
cached while the prefetch thread continues reading ahead.

Advantages over the blocking approach:
- Works well even when checkpoint > available RAM (sliding window)
- No barrier overhead between ranks
- Loading starts immediately instead of waiting for full prefetch
- Make _PREFETCH_BLOCK_SIZE configurable via SGLANG_PREFETCH_BLOCK_SIZE_MB env var
- Print prefetch log messages on every rank (not just rank 0)
- Add docs for --weight-loader-prefetch-checkpoints and --weight-loader-prefetch-num-threads
  to docs/advanced_features/server_arguments.md
- Simplify tests: keep only TestPrefetchWeightsIdentical, register with
  register_cpu_ci(est_time=5, suite="stage-a-test-cpu")
- Remove TestPrefetchReadsAllBytes and TestPrefetchDistributedOnlyReadsSubset
  per reviewer request (multi-GPU integration test needed separately)
@janbernloehr janbernloehr force-pushed the jbernloehr/prefetch-checkpoints branch from f5a9b3b to 2d51eea Compare April 9, 2026 09:00
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 9, 2026
Two fixes:
- Use get_world_group().local_rank for prefetch file distribution
  instead of global rank. Page cache is per-node, so each node must
  independently prefetch the full checkpoint. With global rank, multi-
  node setups would split files across nodes whose page caches are not
  shared.
- Fix buffered_multi_thread_safetensors_weights_iterator to iterate
  files in sorted order (matching prefetch order), not original order.
Launches a small MoE model (Qwen1.5-MoE-A2.7B-Chat) with DP-attention
and --weight-loader-prefetch-checkpoints on 4 GPUs. Verifies the server
starts and produces valid output.

Registered as nightly-4-gpu with est_time=300.
@nvpohanh
Copy link
Copy Markdown
Collaborator

@janbernloehr could you let us know when this PR is ready for another round of review?

@janbernloehr
Copy link
Copy Markdown
Contributor Author

@nvpohanh Thanks for the ping, I addressed everything. If there is anything else I can do please let me know.

@nvpohanh
Copy link
Copy Markdown
Collaborator

@Fridge003 Could you review again? Thanks!

@Fridge003
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit 04a5395 into sgl-project:main Apr 17, 2026
227 of 272 checks passed
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
nvjullin pushed a commit to nvjullin/sglang that referenced this pull request Apr 27, 2026
nvjullin pushed a commit to nvjullin/sglang that referenced this pull request Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Coordinated checkpoint prefetch for network filesystem loading

3 participants