Release SGLang Server AL2023 DLC#6179
Merged
Merged
Conversation
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
FP8 models (qwen3.5-35b-a3b-fp8, qwen3-coder-next-fp8) require fp8e4nv which is only supported on Hopper (sm_90+). The gpu-l40s-4gpu-runners label doesn't exist, causing fallback to gpu-efa-runners (A100 sm_80). LLaMA 3.3 70B OOMs on A100 runners. Move all three to gpu-h100-8gpu-runners with tp=8 and appropriate memory settings. Add CVE-2026-42504 to security allowlist — go/stdlib MIME header CPU exhaustion in mooncake libetcd_wrapper.so, same root cause as existing Go stdlib entries.
PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure compatible GPUModuleOp API during CUDA graph capture. Upstream SGLang applies the same fix (sgl-project/sglang#25958).
Benchmark run 27228675384 surfaced three distinct failures:
- qwen3.5-35b-a3b-fp8 / qwen3-coder-next-fp8: tp=8 shards the FP8 MoE
gate/up output_size to 64, which is not divisible by block_n=128
("output_size ... not divisible by weight quantization block_n=128").
Revert to tp=4 — the intended sharding for these FP8 models.
- qwen3-32b: shared gpu-efa-runners pod had a leftover process holding
port 8000 ("address already in use" -> warmup timeout). Move to a
dedicated gpu-h100-8gpu-runners pod to avoid the collision.
llama-3.3-70b stays at tp=8 (dense model, no block-quant constraint,
needs the memory headroom).
… CUDA graph All gpu-h100-8gpu-runners benchmark jobs failed at server startup with '[Errno 98] address already in use' on port 8000; port 8000 is occupied on those pods. Remove the SGLANG_PORT=8000 override from the five GPU models so they use the SGLang default (30000), matching the x86 jobs that already pass. Also add --disable-piecewise-cuda-graph to qwen3-32b: it crashed during warmup_compile with 'FusedAddRMSNorm ... illegal memory access' while capturing the experimental piecewise CUDA graph (same workaround as llama-3.3-70b).
Jyothirmaikottu
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Test Plan
Test Result
Toggle if you are merging into master Branch
By default, docker image builds and tests are disabled. Two ways to run builds and tests:
How to use the helper utility for updating dlc_developer_config.toml
Assuming your remote is called
origin(you can find out more withgit remote -v)...python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp originpython src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp originpython src/prepare_dlc_dev_environment.py -rcp originNOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:
sagemaker_remote_tests = truesagemaker_efa_tests = truesagemaker_rc_tests = truesagemaker_local_tests = trueHow to use PR description
Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:# /buildspec <buildspec_path># /buildspec pytorch/training/buildspec.yml# /tests <test_list># /tests sanity security ec2sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.Toggle if you are merging into main Branch
PR Checklist
pre-commit run --all-fileslocally before creating this PR. (Read DEVELOPMENT.md for details).