Skip to content

[ci] fix: treat cancelled run-main-script step as failure#4727

Merged
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/ci/fix-timeout-reports-passed
May 11, 2026
Merged

[ci] fix: treat cancelled run-main-script step as failure#4727
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/ci/fix-timeout-reports-passed

Conversation

@ko3n1g

@ko3n1g ko3n1g commented May 11, 2026

Copy link
Copy Markdown
Contributor
Claude summary

What

When the job-level 60-minute timeout kills the Run main script step in .github/actions/action.yml, the step's exit_code output is never written. The downstream Check result step (if: always()) then expands ${{ steps.run-main-script.outputs.exit_code }} to an empty string, and bash treats [[ "" -eq 0 ]] as true — so a test that hit the 1-hour wall reports ✅ PASSED.

Observed instance: gpt_grpo_basic_function falsely reported as PASSED after timing out.

Fix

Use the step's GitHub-tracked conclusion (success / failure / cancelled) as the source of truth, not the missing exit-code output. The exit-code output is still used for the display message when it exists, falling back to the conclusion string (e.g. FAILED (exit code: cancelled)).

env:
  MAIN_CONCLUSION: ${{ steps.run-main-script.conclusion }}
  MAIN_EXIT_CODE: ${{ steps.run-main-script.outputs.exit_code }}
run: |
  EXIT_CODE="${MAIN_EXIT_CODE:-${MAIN_CONCLUSION}}"
  if [[ "$MAIN_CONCLUSION" == "success" ]]; then
    IS_SUCCESS=true
  else
    IS_SUCCESS=false
  fi

When the job-level 60-minute timeout kills the Run main script step,
the exit_code output is never written. The Check result step then
expands the missing output to an empty string, and bash's `[[ "" -eq 0 ]]`
evaluates as true, so the test was reported as PASSED.

Use the step's GitHub-tracked conclusion as the source of truth so
cancellations and timeouts are reported as FAILED.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g requested a review from a team as a code owner May 11, 2026 16:42
@ko3n1g

ko3n1g commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test fab5c39

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 11, 2026 16:42
@github-actions

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot

copy-pr-bot Bot commented May 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g ko3n1g marked this pull request as ready for review May 11, 2026 16:44
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 11, 2026 16:44
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 11, 2026
@ko3n1g ko3n1g merged commit 33d47e0 into NVIDIA:main May 11, 2026
12 checks passed
svcnvidia-nemo-ci added a commit that referenced this pull request May 12, 2026
Merges 8 commits from main into dev. Dev already contains yesterday's
sync (PR #4716) plus follow-up fixes, so this PR only carries main
commits made after that sync.

Notable changes:
- 434368c build(deps): bump nvidia-modelopt to 0.43 (#4723)
- e42e2fa ci: Major refactor of release-workflows (#4602)
- 33d47e0 [ci] fix: treat cancelled run-main-script step as failure (#4727)
- 5123f6a ci: revert bad uv.lock bump and label future bumps with
  Run functional tests (#4730)
- ad58411 Add Python-side guardrail for DeepEP IB limits (#4719)
- e93755e chore(beep boop): Bump (main) (2026-05-11)
- a2ec5c1 Revert Add Python-side guardrail for HybridEP IB limit (#4718)
- 5e31514 Create a Protocol for the MLP layer of TransformerLayer (#3435)

Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and
.github/CODEOWNERS (per nightly-sync skill).

Ran black + isort on changed Python files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants