[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow by yctseng0211 · Pull Request #21720 · sgl-project/sglang

yctseng0211 · 2026-03-31T02:24:40Z

Summary

Add a new CI workflow (pr-test-amd-dockerfile.yml) that enables end-to-end testing of rocm.Dockerfile changes before merging. Previously, Dockerfile changes could not be properly validated in CI because the test pipeline always used the latest release image.
Changes:

New workflow pr-test-amd-dockerfile.yml: builds temporary Docker images (MI30X + MI35X) from the PR's Dockerfile, then runs the full AMD CI test suite using these images
pr-test-amd.yml: add docker_image_mi35x input so MI30X and MI35X jobs use the correct architecture-specific image; remove docker/rocm.Dockerfile from path triggers to avoid duplicate CI runs
amd_ci_start_container_disagg.sh: add --custom-image argument support (already existed in amd_ci_start_container.sh)

How it works

Trigger conditions

The workflow triggers on PRs that modify docker/rocm.Dockerfile, but only executes when ALL of the following are true:

PR has the rocm-docker label
PR has the run-ci label (enforced by pr-gate.yml in the downstream pr-test-amd.yml)
PR is not a draft
If rocm-docker label is absent, the workflow skips entirely and falls back to the normal pr-test-amd.yml behavior (which now ignores Dockerfile-only changes since they can't be tested with the existing image).

Pipeline

PR with rocm.Dockerfile changes + labels [rocm-docker, run-ci]

build-test-images (amd-docker-scale) 
    |-- Build MI30X image (gfx942) -> rocm/sgl-dev:v0.x.x-rocm700-mi30x-YYYYMMDD_test 
    |-- Build MI35X image (gfx950) -> rocm/sgl-dev:v0.x.x-rocm700-mi35x-YYYYMMDD_test 
run-amd-ci (calls pr-test-amd.yml) 
    |-- MI30X jobs use docker_image 
    |-- MI35X jobs use docker_image_mi35x

Usage

Create a PR that modifies docker/rocm.Dockerfile
Add label rocm-docker (and run-ci as usual)
The workflow builds both MI30X and MI35X images, then runs the full test suite
Can also be triggered manually via workflow_dispatch

No duplicate runs

docker/rocm.Dockerfile has been removed from pr-test-amd.yml's path triggers, so only one workflow runs at a time depending on the labels present.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-03-31T02:24:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…plicate CI runs

yctseng0211 · 2026-04-01T09:38:06Z

@amd-bot ci-status

amd-bot · 2026-04-01T09:43:45Z

@yctseng0211

Now I have a clear picture. Let me separate the two workflow contexts:

Run ID 23790604664 — PR Test (AMD) (existing workflow pr-test-amd.yml, triggered directly) — uses nightly images
Run ID 23790604853 — PR Test AMD Dockerfile (new workflow pr-test-amd-dockerfile.yml, introduced by this PR) — uses PR-built images with aiter v0.1.12

The failures are:

From run 23790604664: stage-a-test-1-gpu-small-amd (server startup timeout) + pr-test-amd-finish (gate)
From run 23790604853: massive failures across run-amd-ci / stage-b-* jobs (aiter SIGABRT crashes, CK FMHA assertion failures, OOM kills, perf regressions)

Now let me compile the final report.

CI Status for PR #21720

PR: [AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow

Changed files: .github/workflows/pr-test-amd-dockerfile.yml (new, +143), .github/workflows/pr-test-amd.yml (+60/-2), docker/rocm.Dockerfile (+4/-4), scripts/ci/amd/amd_ci_start_container_disagg.sh (+14/-3)

What this PR does

Adds a new workflow pr-test-amd-dockerfile.yml that builds custom MI30X/MI35X Docker images from docker/rocm.Dockerfile and then calls pr-test-amd.yml with those custom images
Modifies pr-test-amd.yml to accept docker_image / docker_image_mi35x inputs and passes --custom-image to the container start scripts
Bumps aiter from v0.1.11.post1 → v0.1.12 in docker/rocm.Dockerfile
Removes docker/rocm.Dockerfile from the paths trigger of pr-test-amd.yml (moves it to the new dedicated workflow)

Two separate CI runs

Run	Workflow	Images Used	Conclusion
23790604664	PR Test (AMD) — direct trigger	Nightly images (pre-built, unchanged)	Failure (1 job)
23790604853	PR Test AMD Dockerfile — new workflow	PR-built images (aiter v0.1.12)	Failure (many jobs)

Failure Summary Table

Job	Error	Related?	Explanation	Log
`stage-a-test-1-gpu-small-amd` (run 23790604664)	`TimeoutError: Server failed to start within the timeout period`	🟢 Unlikely	Uses nightly image, not PR-built. Server startup timeout is an infra/flaky issue. No code changed by this PR runs in this path.	Log
`pr-test-amd-finish` (run 23790604664)	Gate job: upstream `stage-a-test-1-gpu-small-amd: failure`	🟢 Unlikely	Propagated failure from above.	Log
`run-amd-ci / stage-b-test-1-gpu-small-amd` shards 1,3,4,5,6,7,8,10,13	`Subprocess scheduler_0 crashed with exit code -6` (SIGABRT) in aiter `_mha_batch_prefill` → `aiter_backend.py:2365`	🔴 Likely	All 9 shards crash in the aiter MHA prefill kernel. These jobs use the PR-built Docker image containing aiter v0.1.12 (bumped from v0.1.11.post1 by this PR). The crash is a C++ assertion failure inside the Composable Kernel FMHA at `fmha_batch_prefill_kernel.hpp:661` ("KV cache K offset overflow: exceed int32 max").	Shard 1, Shard 3, Shard 5, etc.
`run-amd-ci / stage-b-test-1-gpu-small-amd` shards 9,11	Perf regressions: VLM throughput 1117 < 2000, TTFT 88.9ms > 86ms	🔴 Likely	Running on PR-built image with aiter v0.1.12. Performance regression in the new aiter version.	Shard 9, Shard 11
`run-amd-ci / stage-b-test-1-gpu-large-amd` shard 1	CK FMHA assertion crash + benchmark parse failure, timeout	🔴 Likely	Same CK FMHA int32 overflow assertion in PR-built image.	Log
`run-amd-ci / stage-b-test-2-gpu-large-amd` shards 0,1	Server OOM-killed (-9) / CK FMHA assertion crash	🔴 Likely	Same pattern with PR-built image. Shard 1 has the explicit CK FMHA assertion.	Shard 0, Shard 1
`run-amd-ci / stage-b-test-1-gpu-small-amd-mi35x`	GPT-OSS 20B server crash (FMHA assertion then OOM-killed)	🔴 Likely	MI35X PR-built image with aiter v0.1.12. Same crash pattern.	Log
`run-amd-ci / stage-b-test-large-8-gpu-35x-disaggregation-amd`	PP disaggregation server OOM-killed (-9)	🔴 Likely	MI35X PR-built image. Server crash during disaggregated PP serving.	Log
`run-amd-ci / pr-test-amd-finish` (run 23790604853)	Gate job: multiple upstream failures	🔴 Likely	Propagated from all the above failures.	Log

Details

🔴 Aiter v0.1.12 crash — all run-amd-ci failures

The dominant failure across ~15 jobs is a crash in the aiter Composable Kernel FMHA batch prefill kernel. The crash manifests as:

python3: .../composable_kernel/include/ck_tile/ops/fmha/fmha_batch_prefill_kernel.hpp:661:
Assertion `static_cast<int64_t>(num_total_pages - 1) * batch_stride_k <=
  static_cast<int64_t>(std::numeric_limits<index_t>::max()) &&
  "KV cache K offset overflow: exceed int32 max"' failed.

This assertion triggers SIGABRT (exit code -6) in scheduler_0, killing the SGLang server. The call path is: aiter/ops/mha.py:_mha_batch_prefill → aiter_backend.py:2365 forward_extend → radix_attention.py → model forward.

Key evidence: These failures occur exclusively in the run-amd-ci jobs (run 23790604853), which use the PR-built Docker images containing aiter v0.1.12. The directly-triggered PR Test (AMD) (run 23790604664) uses nightly images with the old aiter version and does NOT exhibit these crashes — its only failure is a server startup timeout, a different and unrelated issue.

This strongly indicates the aiter v0.1.11.post1 → v0.1.12 bump introduced a regression in the CK FMHA kernel that causes int32 overflow assertions on the KV cache offset calculation.

🟢 stage-a-test-1-gpu-small-amd startup timeout (run 23790604664)

This failure uses the nightly image (not PR-built). The error is a generic server startup timeout (TimeoutError: Server failed to start within the timeout period), which is a common infra/flaky issue on AMD runners. No code changed by this PR is involved in this code path.

Overall Assessment

The PR's workflow infrastructure changes (new workflow, `--custom-image` plumbing) work correctly — the Docker images build and push successfully. However, the aiter version bump to v0.1.12 in `docker/rocm.Dockerfile` introduces a critical regression: the new aiter CK FMHA batch prefill kernel crashes with an int32 overflow assertion across multiple models and GPU configurations (MI325X and MI35X). This needs to be resolved before merging — either by fixing the aiter issue upstream, pinning back to `v0.1.11.post1`, or coordinating with the aiter team on a compatible version.

Generated by amd-bot using Claude Code CLI

e2e amd docker test

4279b17

github-actions Bot added the amd label Mar 31, 2026

Remove rocm.Dockerfile from pr-test-amd.yml path triggers to avoid du…

fd1d6cb

…plicate CI runs

yctseng0211 added rocm-docker run-ci labels Mar 31, 2026

test upgrade aiter

2e649cc

yctseng0211 had a problem deploying to prod March 31, 2026 07:22 — with GitHub Actions Error

block draft pr

4d65cdb

yctseng0211 marked this pull request as ready for review March 31, 2026 07:27

yctseng0211 requested review from Fridge003, HaiShaw, Kangyan-Zhou, bingxche, ishandhanani, ispobock and merrymercy as code owners March 31, 2026 07:27

Merge branch 'main' into amd/e2e_amd_docker

715d75c

yctseng0211 had a problem deploying to prod March 31, 2026 07:27 — with GitHub Actions Error

gpu arch dispatch

8c8ecf5

yctseng0211 had a problem deploying to prod March 31, 2026 07:37 — with GitHub Actions Error

job parallel

7924461

yctseng0211 temporarily deployed to prod March 31, 2026 09:35 — with GitHub Actions Inactive

yctseng0211 changed the title ~~[AMD] WIP - end2end rocm docker pr test~~ [AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow Mar 31, 2026

HaiShaw mentioned this pull request Apr 22, 2026

AMD Development Roadmap (2026 Q2) #23494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow#21720

[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow#21720
yctseng0211 wants to merge 7 commits intomainfrom
amd/e2e_amd_docker

yctseng0211 commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 31, 2026

Uh oh!

yctseng0211 commented Apr 1, 2026

Uh oh!

amd-bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yctseng0211 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Trigger conditions

Pipeline

Usage

No duplicate runs

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Mar 31, 2026

Uh oh!

yctseng0211 commented Apr 1, 2026

Uh oh!

amd-bot commented Apr 1, 2026

CI Status for PR #21720

What this PR does

Two separate CI runs

Failure Summary Table

Details

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yctseng0211 commented Mar 31, 2026 •

edited

Loading