feat: add coordinated checkpoint prefetch for network filesystem loading by janbernloehr · Pull Request #20843 · sgl-project/sglang

janbernloehr · 2026-03-18T08:26:00Z

Closes #20842

Motivation

When multiple DP ranks on the same node load the same checkpoint via mmap (e.g. DP-attention), each rank independently page-faults all safetensors files over NFS/Lustre. With N ranks this causes N × checkpoint_size bytes of redundant network I/O.

Note: this is orthogonal to #20332, which addresses the per-rank access pattern problem with TP (striped reads within safetensors). This PR addresses the cross-rank duplication problem with DP (N ranks redundantly reading the same files over NFS). The two optimizations are complementary.

Modifications

weight_utils.py: Added _PREFETCH_BLOCK_SIZE module constant, _prefetch_checkpoint_file() (reads a file in blocks to warm page cache) and _prefetch_all_checkpoints() (distributes files across ranks via sorted_files[rank::world_size], runs in a background daemon thread with configurable thread count). Added prefetch and prefetch_num_threads parameters to safetensors_weights_iterator() and buffered_multi_thread_safetensors_weights_iterator().
server_args.py: Added --weight-loader-prefetch-checkpoints flag and --weight-loader-prefetch-num-threads (default: 4).
loader.py: Wired the new flags through to both weight iterator functions.
test_prefetch_checkpoints.py: Unit tests verifying byte-level read completeness, correct rank-based file partitioning, and bit-identical weights with/without prefetch.

The prefetch runs in a background daemon thread (inspired by vllm-project/vllm#36012) that pipelines I/O with loading — loading starts immediately while the prefetch thread reads ahead into the OS page cache. Each rank prefetches 1/Nth of the checkpoint shards, reducing total NFS I/O from Ncheckpoint to 1checkpoint.

Accuracy Tests

No changes to model forward code or kernels. The prefetch only affects the I/O path during weight loading — tensors loaded are identical (verified by unit test test_weights_match_with_and_without_prefetch).

Benchmarking and Profiling

DeepSeek R1-671B FP8 on Lustre, multiple platforms and configurations:

Platform	Baseline	+ Background prefetch	Improvement
DGX B200 (8-GPU, dpa-on)	1677s (28 min)	167s (2.8 min)	10x
DGX B300 (8-GPU, dpa-on)	2390s (40 min)	280s (4.7 min)	8.5x
GB200 NVL (2-node, dpa-on)	263s (4.4 min)	131s (2.2 min)	2x
GB300 NVL (2-node, dpa-on)	279s (4.6 min)	152s (2.5 min)	1.8x
DGX H100 (2-node TP=16, dpa-off)	2571s (42.9 min)	793s (13.2 min)	3.2x

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-18T08:26:21Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial performance enhancement for loading large language model weights, especially in distributed environments utilizing network filesystems. By intelligently prefetching checkpoint files into the OS page cache, it eliminates the substantial overhead caused by multiple processes independently accessing the same data over the network. This optimization dramatically accelerates model initialization, making the system more robust and efficient for high-performance computing setups.

Highlights

Performance Optimization: Implemented coordinated checkpoint prefetching to significantly reduce weight loading times (up to 10x faster) for distributed training on network filesystems like NFS/Lustre.
Reduced Network I/O: Addressed redundant network I/O by distributing checkpoint file prefetching across ranks, ensuring each rank reads only a fraction of the shards, thereby reducing total network I/O from N * checkpoint_size to 1 * checkpoint_size.
New Functionality: Introduced _prefetch_checkpoint_file() and _prefetch_all_checkpoints() in weight_utils.py to manage the prefetching process, reading files in 16 MB blocks to warm the OS page cache.
User Control: Added a new command-line flag --weight-loader-prefetch-checkpoints in server_args.py to enable or disable this feature, which is then wired through loader.py.
Comprehensive Testing: Included new unit tests in test_prefetch_checkpoints.py to verify byte-level read completeness, correct rank-based file partitioning, and ensure bit-identical weights with and without prefetching.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a coordinated checkpoint prefetch mechanism to significantly improve weight loading times, especially on network filesystems. The implementation correctly distributes file prefetching across ranks and includes detailed logging and performance metrics. Unit tests have been added to verify the correctness of the prefetching logic and ensure bit-identical weights with and without prefetch. The changes are well-structured and address a critical performance bottleneck.

nvpohanh · 2026-03-31T08:57:44Z

@Fridge003 Could you assign to the PICs of this part? thanks!

Fridge003 · 2026-04-08T23:57:39Z

Wonderful feature!

When multiple DP ranks on the same node load the same checkpoint via mmap (e.g. DP-attention), each rank independently page-faults all files over NFS/Lustre. With N ranks this causes N * checkpoint_size bytes of redundant network I/O. This adds a `--weight-loader-prefetch-checkpoints` flag that, before loading, distributes a sequential read of the checkpoint files across all ranks (each reads 1/Nth of the shards into the shared OS page cache). Subsequent mmap accesses then hit warm page cache instead of the network filesystem. Measured on DeepSeek R1-671B FP8 with 8 DP-attention ranks: | Setup | B300 weight load time | |--------------------------|-----------------------| | Baseline (mmap) | deadlock / 40 min | | max_workers=4 | 2390s (40 min) | | max_workers=4 + prefetch | 236s (3.9 min) | 10x improvement on Lustre, with no infrastructure changes required. The implementation: - `_prefetch_checkpoint_file()`: reads a file in 16 MB blocks to warm the page cache - `_prefetch_all_checkpoints()`: distributes files across ranks with a barrier, uses 4 threads per rank for concurrent prefetch - Wired through both `safetensors_weights_iterator` and `buffered_multi_thread_safetensors_weights_iterator`

Tests verify: - _prefetch_checkpoint_file reads every byte of the file - _prefetch_all_checkpoints only reads the files assigned to the current rank (rank::world_size partitioning) - Single-rank mode prefetches all files - Weights loaded with prefetch=True are bit-identical to prefetch=False

- Extract block_size as module-level constant _PREFETCH_BLOCK_SIZE - Make prefetch num_threads configurable via --weight-loader-prefetch-num-threads (default: 4), wired through server_args -> loader -> weight_utils

Adopt vLLM's approach (vllm-project/vllm#36012): run prefetch in a background daemon thread instead of blocking with a barrier. This pipelines I/O with loading — the loader benefits from pages already cached while the prefetch thread continues reading ahead. Advantages over the blocking approach: - Works well even when checkpoint > available RAM (sliding window) - No barrier overhead between ranks - Loading starts immediately instead of waiting for full prefetch

- Make _PREFETCH_BLOCK_SIZE configurable via SGLANG_PREFETCH_BLOCK_SIZE_MB env var - Print prefetch log messages on every rank (not just rank 0) - Add docs for --weight-loader-prefetch-checkpoints and --weight-loader-prefetch-num-threads to docs/advanced_features/server_arguments.md - Simplify tests: keep only TestPrefetchWeightsIdentical, register with register_cpu_ci(est_time=5, suite="stage-a-test-cpu") - Remove TestPrefetchReadsAllBytes and TestPrefetchDistributedOnlyReadsSubset per reviewer request (multi-GPU integration test needed separately)

Two fixes: - Use get_world_group().local_rank for prefetch file distribution instead of global rank. Page cache is per-node, so each node must independently prefetch the full checkpoint. With global rank, multi- node setups would split files across nodes whose page caches are not shared. - Fix buffered_multi_thread_safetensors_weights_iterator to iterate files in sorted order (matching prefetch order), not original order.

Launches a small MoE model (Qwen1.5-MoE-A2.7B-Chat) with DP-attention and --weight-loader-prefetch-checkpoints on 4 GPUs. Verifies the server starts and produces valid output. Registered as nightly-4-gpu with est_time=300.

nvpohanh · 2026-04-15T07:10:44Z

@janbernloehr could you let us know when this PR is ready for another round of review?

janbernloehr · 2026-04-15T08:17:13Z

@nvpohanh Thanks for the ping, I addressed everything. If there is anything else I can do please let me know.

nvpohanh · 2026-04-16T07:38:55Z

@Fridge003 Could you review again? Thanks!

Fridge003 · 2026-04-16T07:47:38Z

/tag-and-rerun-ci

…ing (sgl-project#20843)

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread python/sglang/srt/model_loader/weight_utils.py Outdated

Comment thread python/sglang/srt/model_loader/weight_utils.py

Fridge003 mentioned this pull request Apr 7, 2026

[Roadmap] GB200/GB300 development for Q2 #19650

Open

Fridge003 requested changes Apr 8, 2026

View reviewed changes

janbernloehr added 5 commits April 9, 2026 10:53

refactor: address review feedback on checkpoint prefetch

019daa8

- Extract block_size as module-level constant _PREFETCH_BLOCK_SIZE - Make prefetch num_threads configurable via --weight-loader-prefetch-num-threads (default: 4), wired through server_args -> loader -> weight_utils

janbernloehr force-pushed the jbernloehr/prefetch-checkpoints branch from f5a9b3b to 2d51eea Compare April 9, 2026 09:00

github-actions Bot added the documentation Improvements or additions to documentation label Apr 9, 2026

janbernloehr added 2 commits April 9, 2026 11:17

test: add multi-GPU integration test for checkpoint prefetch

6e465ef

Launches a small MoE model (Qwen1.5-MoE-A2.7B-Chat) with DP-attention and --weight-loader-prefetch-checkpoints on 4 GPUs. Verifies the server starts and produces valid output. Registered as nightly-4-gpu with est_time=300.

move environ

5706c56

Fridge003 approved these changes Apr 16, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 16, 2026

Fridge003 merged commit 04a5395 into sgl-project:main Apr 17, 2026
227 of 272 checks passed

jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026

feat: add coordinated checkpoint prefetch for network filesystem load…

a7f23a9

…ing (sgl-project#20843)

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

feat: add coordinated checkpoint prefetch for network filesystem load…

0490c54

…ing (sgl-project#20843)

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

feat: add coordinated checkpoint prefetch for network filesystem load…

1bd2149

…ing (sgl-project#20843)

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

feat: add coordinated checkpoint prefetch for network filesystem load…

8f2e26c

…ing (sgl-project#20843)

nvjullin pushed a commit to nvjullin/sglang that referenced this pull request Apr 27, 2026

feat: add coordinated checkpoint prefetch for network filesystem load…

2b2570e

…ing (sgl-project#20843)

nvjullin mentioned this pull request Apr 27, 2026

Cherry-pick feat: add coordinated checkpoint prefetch for network filesystem load… #23830

Closed

5 tasks

nvjullin pushed a commit to nvjullin/sglang that referenced this pull request Apr 30, 2026

feat: add coordinated checkpoint prefetch for network filesystem load…

2edc07d

…ing (sgl-project#20843)

Conversation

janbernloehr commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

nvpohanh commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Apr 8, 2026

Uh oh!

nvpohanh commented Apr 15, 2026

Uh oh!

janbernloehr commented Apr 15, 2026

Uh oh!

nvpohanh commented Apr 16, 2026

Uh oh!

Fridge003 commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janbernloehr commented Mar 18, 2026 •

edited

Loading