[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow#21720
[AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow#21720yctseng0211 wants to merge 7 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@amd-bot ci-status |
|
Now I have a clear picture. Let me separate the two workflow contexts:
The failures are:
Now let me compile the final report. CI Status for PR #21720PR: [AMD] WIP - end-to-end rocm.Dockerfile build and PR test workflow Changed files: What this PR does
Two separate CI runs
Failure Summary Table
Details🔴 Aiter v0.1.12 crash — all The dominant failure across ~15 jobs is a crash in the aiter Composable Kernel FMHA batch prefill kernel. The crash manifests as: This assertion triggers SIGABRT (exit code -6) in Key evidence: These failures occur exclusively in the This strongly indicates the aiter v0.1.11.post1 → v0.1.12 bump introduced a regression in the CK FMHA kernel that causes int32 overflow assertions on the KV cache offset calculation. 🟢 This failure uses the nightly image (not PR-built). The error is a generic server startup timeout ( Overall AssessmentThe PR's workflow infrastructure changes (new workflow,
|
Summary
Add a new CI workflow (
pr-test-amd-dockerfile.yml) that enables end-to-end testing ofrocm.Dockerfilechanges before merging. Previously, Dockerfile changes could not be properly validated in CI because the test pipeline always used the latest release image.Changes:
pr-test-amd-dockerfile.yml: builds temporary Docker images (MI30X + MI35X) from the PR's Dockerfile, then runs the full AMD CI test suite using these imagespr-test-amd.yml: adddocker_image_mi35xinput so MI30X and MI35X jobs use the correct architecture-specific image; removedocker/rocm.Dockerfilefrom path triggers to avoid duplicate CI runsamd_ci_start_container_disagg.sh: add--custom-imageargument support (already existed inamd_ci_start_container.sh)How it works
Trigger conditions
The workflow triggers on PRs that modify
docker/rocm.Dockerfile, but only executes when ALL of the following are true:rocm-dockerlabelrun-cilabel (enforced bypr-gate.ymlin the downstreampr-test-amd.yml)If
rocm-dockerlabel is absent, the workflow skips entirely and falls back to the normalpr-test-amd.ymlbehavior (which now ignores Dockerfile-only changes since they can't be tested with the existing image).Pipeline
PR with rocm.Dockerfile changes + labels [rocm-docker, run-ci]
Usage
docker/rocm.Dockerfilerocm-docker(andrun-cias usual)workflow_dispatchNo duplicate runs
docker/rocm.Dockerfilehas been removed frompr-test-amd.yml's path triggers, so only one workflow runs at a time depending on the labels present.Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci