Skip to content

[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow#21720

Open
yctseng0211 wants to merge 7 commits intomainfrom
amd/e2e_amd_docker
Open

[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow#21720
yctseng0211 wants to merge 7 commits intomainfrom
amd/e2e_amd_docker

Conversation

@yctseng0211
Copy link
Copy Markdown
Collaborator

@yctseng0211 yctseng0211 commented Mar 31, 2026

Summary

Add a new CI workflow (pr-test-amd-dockerfile.yml) that enables end-to-end testing of rocm.Dockerfile changes before merging. Previously, Dockerfile changes could not be properly validated in CI because the test pipeline always used the latest release image.
Changes:

  • New workflow pr-test-amd-dockerfile.yml: builds temporary Docker images (MI30X + MI35X) from the PR's Dockerfile, then runs the full AMD CI test suite using these images
  • pr-test-amd.yml: add docker_image_mi35x input so MI30X and MI35X jobs use the correct architecture-specific image; remove docker/rocm.Dockerfile from path triggers to avoid duplicate CI runs
  • amd_ci_start_container_disagg.sh: add --custom-image argument support (already existed in amd_ci_start_container.sh)

How it works

Trigger conditions

The workflow triggers on PRs that modify docker/rocm.Dockerfile, but only executes when ALL of the following are true:

  1. PR has the rocm-docker label
  2. PR has the run-ci label (enforced by pr-gate.yml in the downstream pr-test-amd.yml)
  3. PR is not a draft
    If rocm-docker label is absent, the workflow skips entirely and falls back to the normal pr-test-amd.yml behavior (which now ignores Dockerfile-only changes since they can't be tested with the existing image).

Pipeline

PR with rocm.Dockerfile changes + labels [rocm-docker, run-ci]

build-test-images (amd-docker-scale) 
    |-- Build MI30X image (gfx942) -> rocm/sgl-dev:v0.x.x-rocm700-mi30x-YYYYMMDD_test 
    |-- Build MI35X image (gfx950) -> rocm/sgl-dev:v0.x.x-rocm700-mi35x-YYYYMMDD_test 
run-amd-ci (calls pr-test-amd.yml) 
    |-- MI30X jobs use docker_image 
    |-- MI35X jobs use docker_image_mi35x

Usage

  1. Create a PR that modifies docker/rocm.Dockerfile
  2. Add label rocm-docker (and run-ci as usual)
  3. The workflow builds both MI30X and MI35X images, then runs the full test suite
  4. Can also be triggered manually via workflow_dispatch

No duplicate runs

docker/rocm.Dockerfile has been removed from pr-test-amd.yml's path triggers, so only one workflow runs at a time depending on the labels present.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the amd label Mar 31, 2026
@yctseng0211 yctseng0211 changed the title [AMD] WIP - end2end rocm docker pr test [AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow Mar 31, 2026
@yctseng0211
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 1, 2026

@yctseng0211

Now I have a clear picture. Let me separate the two workflow contexts:

  1. Run ID 23790604664PR Test (AMD) (existing workflow pr-test-amd.yml, triggered directly) — uses nightly images
  2. Run ID 23790604853PR Test AMD Dockerfile (new workflow pr-test-amd-dockerfile.yml, introduced by this PR) — uses PR-built images with aiter v0.1.12

The failures are:

  • From run 23790604664: stage-a-test-1-gpu-small-amd (server startup timeout) + pr-test-amd-finish (gate)
  • From run 23790604853: massive failures across run-amd-ci / stage-b-* jobs (aiter SIGABRT crashes, CK FMHA assertion failures, OOM kills, perf regressions)

Now let me compile the final report.

CI Status for PR #21720

PR: [AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow

Changed files: .github/workflows/pr-test-amd-dockerfile.yml (new, +143), .github/workflows/pr-test-amd.yml (+60/-2), docker/rocm.Dockerfile (+4/-4), scripts/ci/amd/amd_ci_start_container_disagg.sh (+14/-3)

What this PR does

  1. Adds a new workflow pr-test-amd-dockerfile.yml that builds custom MI30X/MI35X Docker images from docker/rocm.Dockerfile and then calls pr-test-amd.yml with those custom images
  2. Modifies pr-test-amd.yml to accept docker_image / docker_image_mi35x inputs and passes --custom-image to the container start scripts
  3. Bumps aiter from v0.1.11.post1v0.1.12 in docker/rocm.Dockerfile
  4. Removes docker/rocm.Dockerfile from the paths trigger of pr-test-amd.yml (moves it to the new dedicated workflow)

Two separate CI runs

Run Workflow Images Used Conclusion
23790604664 PR Test (AMD) — direct trigger Nightly images (pre-built, unchanged) Failure (1 job)
23790604853 PR Test AMD Dockerfile — new workflow PR-built images (aiter v0.1.12) Failure (many jobs)

Failure Summary Table

Job Error Related? Explanation Log
stage-a-test-1-gpu-small-amd (run 23790604664) TimeoutError: Server failed to start within the timeout period 🟢 Unlikely Uses nightly image, not PR-built. Server startup timeout is an infra/flaky issue. No code changed by this PR runs in this path. Log
pr-test-amd-finish (run 23790604664) Gate job: upstream stage-a-test-1-gpu-small-amd: failure 🟢 Unlikely Propagated failure from above. Log
run-amd-ci / stage-b-test-1-gpu-small-amd shards 1,3,4,5,6,7,8,10,13 Subprocess scheduler_0 crashed with exit code -6 (SIGABRT) in aiter _mha_batch_prefillaiter_backend.py:2365 🔴 Likely All 9 shards crash in the aiter MHA prefill kernel. These jobs use the PR-built Docker image containing aiter v0.1.12 (bumped from v0.1.11.post1 by this PR). The crash is a C++ assertion failure inside the Composable Kernel FMHA at fmha_batch_prefill_kernel.hpp:661 ("KV cache K offset overflow: exceed int32 max"). Shard 1, Shard 3, Shard 5, etc.
run-amd-ci / stage-b-test-1-gpu-small-amd shards 9,11 Perf regressions: VLM throughput 1117 < 2000, TTFT 88.9ms > 86ms 🔴 Likely Running on PR-built image with aiter v0.1.12. Performance regression in the new aiter version. Shard 9, Shard 11
run-amd-ci / stage-b-test-1-gpu-large-amd shard 1 CK FMHA assertion crash + benchmark parse failure, timeout 🔴 Likely Same CK FMHA int32 overflow assertion in PR-built image. Log
run-amd-ci / stage-b-test-2-gpu-large-amd shards 0,1 Server OOM-killed (-9) / CK FMHA assertion crash 🔴 Likely Same pattern with PR-built image. Shard 1 has the explicit CK FMHA assertion. Shard 0, Shard 1
run-amd-ci / stage-b-test-1-gpu-small-amd-mi35x GPT-OSS 20B server crash (FMHA assertion then OOM-killed) 🔴 Likely MI35X PR-built image with aiter v0.1.12. Same crash pattern. Log
run-amd-ci / stage-b-test-large-8-gpu-35x-disaggregation-amd PP disaggregation server OOM-killed (-9) 🔴 Likely MI35X PR-built image. Server crash during disaggregated PP serving. Log
run-amd-ci / pr-test-amd-finish (run 23790604853) Gate job: multiple upstream failures 🔴 Likely Propagated from all the above failures. Log

Details

🔴 Aiter v0.1.12 crash — all run-amd-ci failures

The dominant failure across ~15 jobs is a crash in the aiter Composable Kernel FMHA batch prefill kernel. The crash manifests as:

python3: .../composable_kernel/include/ck_tile/ops/fmha/fmha_batch_prefill_kernel.hpp:661:
Assertion `static_cast<int64_t>(num_total_pages - 1) * batch_stride_k <=
  static_cast<int64_t>(std::numeric_limits<index_t>::max()) &&
  "KV cache K offset overflow: exceed int32 max"' failed.

This assertion triggers SIGABRT (exit code -6) in scheduler_0, killing the SGLang server. The call path is: aiter/ops/mha.py:_mha_batch_prefillaiter_backend.py:2365 forward_extendradix_attention.py → model forward.

Key evidence: These failures occur exclusively in the run-amd-ci jobs (run 23790604853), which use the PR-built Docker images containing aiter v0.1.12. The directly-triggered PR Test (AMD) (run 23790604664) uses nightly images with the old aiter version and does NOT exhibit these crashes — its only failure is a server startup timeout, a different and unrelated issue.

This strongly indicates the aiter v0.1.11.post1 → v0.1.12 bump introduced a regression in the CK FMHA kernel that causes int32 overflow assertions on the KV cache offset calculation.

🟢 stage-a-test-1-gpu-small-amd startup timeout (run 23790604664)

This failure uses the nightly image (not PR-built). The error is a generic server startup timeout (TimeoutError: Server failed to start within the timeout period), which is a common infra/flaky issue on AMD runners. No code changed by this PR is involved in this code path.

Overall Assessment

The PR's workflow infrastructure changes (new workflow, --custom-image plumbing) work correctly — the Docker images build and push successfully. However, the aiter version bump to v0.1.12 in docker/rocm.Dockerfile introduces a critical regression: the new aiter CK FMHA batch prefill kernel crashes with an int32 overflow assertion across multiple models and GPU configurations (MI325X and MI35X). This needs to be resolved before merging — either by fixing the aiter issue upstream, pinning back to v0.1.11.post1, or coordinating with the aiter team on a compatible version.

Generated by amd-bot using Claude Code CLI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants