Skip to content

[AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts#15318

Merged
HaiShaw merged 7 commits intosgl-project:mainfrom
sunxxuns:test-amd-ci
Dec 18, 2025
Merged

[AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts#15318
HaiShaw merged 7 commits intosgl-project:mainfrom
sunxxuns:test-amd-ci

Conversation

@sunxxuns
Copy link
Copy Markdown
Collaborator

@sunxxuns sunxxuns commented Dec 17, 2025

Summary

This PR fixes multiple AMD CI issues related to AITER kernel compatibility:

1. AITER Kernel Segfault Fix

  • Clear pre-built AITER kernels from the Docker image that may be incompatible with the current environment
  • These stale kernels were causing segfaults when imported

2. AITER Kernel Timeout Fix

  • Add warmup script (scripts/ci/amd_ci_warmup_aiter.py) to pre-build commonly used AITER JIT kernels (RMSNorm, rotary embedding, activation)
  • Run warmup in amd_ci_install_dependency.sh before tests to avoid timeout waiting for JIT compilation (~156s for module_rmsnorm alone)

3. TorchAO Test Skip

  • Skip TestTransformersFallbackTorchAO test on AMD GPUs
  • TorchAO's _convert_weight_to_int4pack_cuda requires CDNA2+ and is not supported on current AMD hardware

4. test_rope_rocm.py Tolerance Fix

  • Relax tolerance from 1e-2 to 2e-2 in rotary embedding test
  • Test was failing with marginal differences (0.0146 vs 0.01) on 1 element out of 4 million due to HIP floating-point precision variations

Test Plan

  • All AMD CI tests pass:
    • stage-a-test-1-amd ✓
    • unit-test-backend-1-gpu-amd (all 12 partitions) ✓
    • unit-test-backend-2-gpu-amd (all partitions) ✓
    • accuracy-test-1-gpu-amd ✓
    • performance-test-1-gpu-part-1-amd ✓
    • performance-test-1-gpu-part-2-amd ✓
    • performance-test-2-gpu-amd ✓
    • pr-test-amd-finish ✓

@github-actions github-actions Bot added the amd label Dec 17, 2025
@sunxxuns sunxxuns force-pushed the test-amd-ci branch 5 times, most recently from 0610b5f to 522d8b5 Compare December 18, 2025 09:32
@sunxxuns sunxxuns changed the title [CI] Test AMD CI on main branch [AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts Dec 18, 2025
@sunxxuns sunxxuns force-pushed the test-amd-ci branch 4 times, most recently from 1ca18d8 to fca5278 Compare December 18, 2025 10:46
@sunxxuns
Copy link
Copy Markdown
Collaborator Author

CI Status Update

AMD CI ✅ ALL PASSING

All AMD CI tests are now passing:

  • pr-test-amd-finish: ✅
  • stage-a-test-1-amd: ✅
  • unit-test-backend-1-gpu-amd (all 12 partitions): ✅
  • unit-test-backend-2-gpu-amd: ✅
  • accuracy-test-1-gpu-amd: ✅
  • performance-test-1-gpu-part-1-amd: ✅
  • performance-test-1-gpu-part-2-amd: ✅
  • performance-test-2-gpu-amd: ✅

NVIDIA CI ❌ Infrastructure Issue (Unrelated)

The NVIDIA CI tests are failing with python3: command not found in the install dependency script. This is a runner infrastructure issue unrelated to this PR which only modifies AMD CI workflows.

This PR is ready for review/merge as all AMD CI tests pass.

@sunxxuns sunxxuns force-pushed the test-amd-ci branch 2 times, most recently from 884d826 to b51e85f Compare December 18, 2025 14:51
…nd test timeouts

The Docker image contains pre-compiled AITER kernels that may be incompatible
with the current environment, causing segfaults when imported. After clearing
these kernels, they need to be rebuilt at runtime (~156s for module_rmsnorm
alone), which causes test timeouts.

This fix:
1. Clears pre-built AITER kernels from the Docker image
2. Pre-builds commonly used AITER kernels (RMSNorm, rotary embedding, activation)
   during dependency installation, before any tests run

The warmup is done once in amd_ci_install_dependency.sh, which is shared across
all CI jobs, rather than in each individual workflow step.

Changes:
- Add scripts/ci/amd_ci_warmup_aiter.py to trigger JIT compilation
- Update scripts/ci/amd_ci_install_dependency.sh to clear and warmup AITER kernels
The previous commit cleared pre-built AITER kernels to avoid segfaults
from incompatible kernels in the Docker image. However, this caused
test timeouts because AITER kernels needed to be rebuilt at runtime
(~156s for module_rmsnorm alone).

This fix adds a warmup step that pre-builds commonly used AITER kernels
after clearing, so tests don't timeout waiting for JIT compilation.

Changes:
- Add scripts/ci/amd_ci_warmup_aiter.py to trigger JIT compilation
- Update workflow to run warmup after clearing AITER kernels
- Add timeout-minutes: 10 for the warmup step
@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization dependencies Pull requests that update a dependency file lora Multi-modal multi-modal language model deepseek sgl-kernel blackwell SM100/SM120 diffusion SGLang Diffusion model-gateway labels Dec 18, 2025
Remove the temporary disable flag to allow the 8-GPU test job to run.
Remove the temporary disable flag to allow the 8-GPU test job to run.
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Dec 18, 2025

@sunxxuns skip torchao one seems appropriate.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Dec 18, 2025

Merging this now, but to root cause and provide update @yctseng0211

@HaiShaw HaiShaw merged commit e0963a6 into sgl-project:main Dec 18, 2025
84 of 87 checks passed
xiaobaicxy added a commit to xiaobaicxy/sglang that referenced this pull request Dec 19, 2025
* 'main' of https://github.com/sgl-project/sglang: (136 commits)
  fix: unreachable error check in retraction (sgl-project#15433)
  [sgl-kernel] chore: update deepgemm version (sgl-project#13402)
  [diffusion] multi-platform: support diffusion on amd and fix encoder loading on MI325 (sgl-project#13760)
  [amd] Add deterministic all-reduce kernel for AMD (ROCm) (sgl-project#15340)
  [diffusion] refactor: refactor _build_req_from_sampling to use shallow_asdict (sgl-project#13782)
  Add customized sampler registration (sgl-project#15423)
  Update readme (sgl-project#15425)
  Fix Mindspore model import warning (sgl-project#15287)
  [Feature] Xiaomi `MiMo-V2-Flash` day0 support (sgl-project#15207)
  [diffusion] profiling: add bench_serving.py and VBench (sgl-project#15410)
  [DLLM] Fix dLLM regression (sgl-project#15371)
  [Deepseek V3.2] Fix Deepseek MTP in V1 mode (sgl-project#15429)
  chore: update CI_PERMISSIONS (sgl-project#15431)
  [DLLM] Add CI for diffusion LLMs (sgl-project#14723)
  Support using different attention backend for draft decoding. (sgl-project#14843)
  feat(dsv32): better error handling for DeepSeek-v3.2 encoder (sgl-project#14353)
  tiny fix lint on main (sgl-project#15424)
  multimodal: precompute hash for MultimodalDataItem (sgl-project#14354)
  [AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts (sgl-project#15318)
  [Performance] optimize NSA backend metadata computation for multi-step speculative decoding (sgl-project#14781)
  ...
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd blackwell SM100/SM120 deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation lora model-gateway Multi-modal multi-modal language model quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants