Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency by hubertlu-tw · Pull Request #20477 · sgl-project/sglang

hubertlu-tw · 2026-03-12T23:30:42Z

Motivation

When using multi-GPU tensor parallelism (TP > 1), the first collective communication operation triggers NCCL/RCCL communicator initialization, causing severe P99 TTFT degradation (up to 1400ms) for the first 2-3 requests.

This PR implements NCCL/RCCL pre-warming during server startup to eliminate cold-start latency, inspired by InstantTensor's implementation.

Measured Impact on AMD MI355X:

P99 TTFT improvement: 74.9% (1413ms → 357ms)
Latency stability: 87.8% lower std dev (327ms → 40ms)
Warmup overhead: 4.7s one-time cost

Default Behavior:

Enabled by default for AMD/HIP (RCCL) - validated on MI355X
Disabled by default for NVIDIA/CUDA (NCCL) - pending validation

Modifications

Server Arguments (server_args.py)

Added pre_warm_nccl field with platform-aware default:

pre_warm_nccl: bool = dataclasses.field(
    default_factory=lambda: is_hip()
)  # Default: True for AMD/HIP, False for NVIDIA/CUDA

Added CLI argument:

parser.add_argument(
    "--pre-warm-nccl",
    action="store_true",
    help="Pre-warm NCCL/RCCL communicators during startup to reduce P99 TTFT cold-start latency. Default: enabled for AMD/HIP (RCCL), disabled for NVIDIA/CUDA (NCCL).",
)

Model Runner (model_runner.py)

Added warmup logic during initialization:

if self.server_args.pre_warm_nccl and (self.tp_size > 1 or self.pp_size > 1 or self.moe_ep_size > 1):
    warmup_start = time.perf_counter()
    tp_group_handle = get_tp_group().device_group

    # Single warmup all_reduce to initialize NCCL/RCCL communicator
    warmup_tensor = torch.zeros(1, device=torch.cuda.current_device())
    dist.all_reduce(warmup_tensor, group=tp_group_handle)
    torch.cuda.synchronize()

    warmup_elapsed = time.perf_counter() - warmup_start
    logger.info(f"NCCL/RCCL warmup completed in {warmup_elapsed:.3f}s ...")

Accuracy Tests

No accuracy impact - latency optimization only, does not affect model outputs.

Validated with GSM8K (100 questions):

Without pre-warm: 97.0%
With pre-warm: 98.0%

Benchmarking and Profiling

Test Environment

Platform: AMD MI355X (8 GPUs)
Model: DeepSeek-R1-MXFP4-Preview
Configuration: TP=8, 128-token prompts, 16 output tokens

Results

Configuration	P99 TTFT	P95 TTFT	Median TTFT	Std Dev	Improvement
Without pre-warm	1413 ms	1413 ms	210 ms	327 ms	Baseline
With pre-warm	357 ms	354 ms	207 ms	40 ms	74.9% faster P99

Key Findings:

P99 TTFT: 74.9% improvement (1413ms → 357ms)
Latency stability: 87.8% lower std dev (327ms → 40ms)
Warmup overhead: 4.7s one-time cost
ROI: Warmup pays for itself after 4-5 requests

Reproduction Commands

Click to expand

Test without pre-warming:

# Start server (disable via Python API: ServerArgs(pre_warm_nccl=False))
python3 -m sglang.launch_server \
  --model-path /data/DeepSeek-R1-MXFP4-Preview \
  --tp-size 8

# Send requests - first 2-3 will be slow (~1400ms)
for i in {1..20}; do
  time curl -X POST http://127.0.0.1:30000/generate \
    -H "Content-Type: application/json" \
    -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 16}}'
done

Test with pre-warming (default for AMD):

# Start server (pre-warm enabled by default on AMD)
python3 -m sglang.launch_server \
  --model-path /data/DeepSeek-R1-MXFP4-Preview \
  --tp-size 8

# Expected log: "NCCL/RCCL warmup completed in 4.561s"

# Send requests - all fast (~300ms)
for i in {1..20}; do
  time curl -X POST http://127.0.0.1:30000/generate \
    -H "Content-Type: application/json" \
    -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 16}}'
done

NVIDIA users (pre-warm disabled by default):

# Enable explicitly with --pre-warm-nccl
python3 -m sglang.launch_server \
  --model-path /data/model \
  --tp-size 8 \
  --pre-warm-nccl

Checklist

Accuracy validation: No impact on model outputs (GSM8K: 97.0% vs 98.0%)
Performance benchmarks: 74.9% P99 TTFT improvement on AMD MI355X
Code style: Follows SGLang conventions
Unit tests: TODO (warmup is tested via integration)
Documentation: Inline comments added
CI tests: Pending

Review Process

Ping Merge Oncalls to start PR flow
Get approvals from CODEOWNERS
Trigger CI tests: /tag-run-ci-label, /rerun-failed-ci
After green CI + approvals, merge

Implements NCCL/RCCL communicator pre-warming during server startup to eliminate cold-start latency (up to 1400ms) for first requests when using multi-GPU tensor parallelism. Measured on AMD MI355X: - P99 TTFT improvement: 74.9% (1413ms → 357ms) - Latency stability: 87.8% lower std dev (327ms → 40ms) - Warmup overhead: 4.7s one-time cost (5.2% of model loading) Changes: - server_args.py: Add pre_warm_nccl field with platform-aware default (enabled for AMD/HIP, disabled for NVIDIA/CUDA until validation) - server_args.py: Add --pre-warm-nccl CLI argument - model_runner.py: Implement warmup via single all_reduce operation during ModelRunner initialization Default behavior: - AMD/HIP: Enabled (validated 74.9% improvement) - NVIDIA/CUDA: Disabled (pending validation) Inspired by InstantTensor's implementation which achieved 71% improvement on NVIDIA GPUs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-12T23:30:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw

@hubertlu-tw
Can you make the change rocm/hip specific (nccl->rccl), or cuda&hip specific (to avoid regression to other platforms).

hubertlu-tw · 2026-03-16T18:07:40Z

@hubertlu-tw Can you make the change rocm/hip specific (nccl->rccl), or cuda&hip specific (to avoid regression to other platforms).

@HaiShaw I have modified server_args.py so that --pre-warm-nccl is only applicable for CUDA and HIP and it is set to True by default only for AMD GPUs.

…project#20477) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…project#20477)

hubertlu-tw requested review from HaiShaw and kkHuang-amd March 12, 2026 23:30

hubertlu-tw requested review from Fridge003, Ying1123, hnyls2002, ispobock and merrymercy as code owners March 12, 2026 23:30

hubertlu-tw added the run-ci label Mar 12, 2026

Merge branch 'main' into pre_warm_nccl

229a9e2

sogalin approved these changes Mar 13, 2026

View reviewed changes

HaiShaw reviewed Mar 13, 2026

View reviewed changes

HaiShaw and others added 2 commits March 13, 2026 14:03

Merge branch 'main' into pre_warm_nccl

b0fb94c

Make pre_warm_nccl HIP and CUDA only

69f485e

HaiShaw approved these changes Mar 17, 2026

View reviewed changes

HaiShaw merged commit 943f34f into sgl-project:main Mar 17, 2026
71 of 91 checks passed

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency (sgl-…

943a4b9

…project#20477) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency (sgl-…

c9bf3fc

…project#20477) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency (sgl-…

58735fe

…project#20477) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency (sgl-…

261ea2a

…project#20477)

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency#20477

Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency#20477
HaiShaw merged 4 commits intosgl-project:mainfrom
hubertlu-tw:pre_warm_nccl

hubertlu-tw commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

HaiShaw left a comment •

edited

Loading

Uh oh!

hubertlu-tw commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hubertlu-tw commented Mar 12, 2026

Motivation

Modifications

Server Arguments (server_args.py)

Model Runner (model_runner.py)

Accuracy Tests

Benchmarking and Profiling

Test Environment

Results

Reproduction Commands

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

HaiShaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubertlu-tw commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HaiShaw left a comment •

edited

Loading