Skip to content

Add a workflow file for AWS#386

Merged
shi-eric merged 1 commit into
newton-physics:mainfrom
shi-eric:ershi/aws-workflow
Jul 12, 2025
Merged

Add a workflow file for AWS#386
shi-eric merged 1 commit into
newton-physics:mainfrom
shi-eric:ershi/aws-workflow

Conversation

@shi-eric

@shi-eric shi-eric commented Jul 12, 2025

Copy link
Copy Markdown
Member

Description

Add a workflow for testing Newton on AWS EC2 instances.

Newton Migration Guide

Please ensure the migration guide for warp.sim users is up-to-date with the changes made in this MR.

  • The migration guide in docs/migration.rst is up-to date

Before your PR is "Ready for review"

  • All commits are signed-off to indicate that your contribution adheres to the Developer Certificate of Origin requirements
  • I understand that GitHub does not perform any GPU testing of this pull request
  • Necessary tests have been added
  • Documentation is up-to-date
  • Code passes formatting and linting checks with pre-commit run -a

Summary by CodeRabbit

  • New Features
    • Introduced automated GPU-accelerated unit testing on AWS for pull requests, main/release branches, and version tags.
    • Test results and coverage reports are now displayed and uploaded automatically.
    • Ensured automatic cleanup of AWS resources after test completion to prevent lingering instances.
  • Chores
    • Updated pull request template to clarify testing requirements and remove outdated GPU testing disclaimer.

@coderabbitai

coderabbitai Bot commented Jul 12, 2025

Copy link
Copy Markdown
Contributor
📝 Walkthrough

"""

Walkthrough

A new GitHub Actions workflow, "GPU Unit Tests on AWS," is introduced to automate running GPU-accelerated unit tests on AWS EC2 instances. The workflow provisions a GPU-enabled EC2 instance, executes remote tests via AWS SSM, collects artifacts from S3, and ensures resource cleanup by terminating the instance post-execution. Additionally, the pull request template was updated to remove a GPU testing disclaimer and clarify test requirements.

Changes

File(s) Change Summary
.github/workflows/aws-gpu-tests.yml Added workflow to automate GPU unit testing on AWS EC2, including provisioning, remote test execution, artifact handling, and cleanup.
.github/PULL_REQUEST_TEMPLATE.md Removed GPU testing disclaimer and revised test checklist to explicitly require added tests and example tests in newton/tests/test_examples.py.

Sequence Diagram(s)

sequenceDiagram
    participant GitHub Actions
    participant AWS EC2
    participant AWS SSM
    participant AWS S3

    GitHub Actions->>AWS EC2: Launch GPU-enabled EC2 instance
    GitHub Actions->>AWS SSM: Send remote test script to EC2 instance
    AWS SSM->>AWS EC2: Execute test script
    AWS EC2->>AWS S3: Upload test artifacts and logs
    GitHub Actions->>AWS S3: Download test artifacts and logs
    GitHub Actions->>AWS EC2: Terminate EC2 instance (cleanup)
Loading

"""


📜 Recent review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1927051f0f4d78d428ebf7b3aeabbe4dab02534b and 6f831caef9592fb9b33f7bb66edbd5740873497e.

📒 Files selected for processing (2)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/workflows/aws-gpu-tests.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-asv-benchmarks

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@shi-eric shi-eric force-pushed the ershi/aws-workflow branch from f0eac3e to 6e3c9e9 Compare July 12, 2025 01:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
.github/workflows/aws-gpu-tests.yml (1)

238-255: Cleanup job may skip termination if the test job fails early.

needs.run-unit-tests-on-aws.outputs.instance_id is only populated when the whole job succeeds.
If any later step (artifact download, Codecov upload) fails, the output is blank → instance leaks.

Safer pattern:

-    needs: run-unit-tests-on-aws
-    if: github.repository == 'newton-physics/newton' && always()
+    needs: run-unit-tests-on-aws
+    if: always()  # always run, independent of previous job result

and pass the ID via an artefact or workflow-level environment file ($GITHUB_ENV) written in the launch step and picked up here:

- if: needs.run-unit-tests-on-aws.outputs.instance_id != ''
+ env:
+   INSTANCE_ID: ${{ needs.run-unit-tests-on-aws.outputs.instance_id || env.INSTANCE_ID }}
  ...
- aws ec2 terminate-instances --instance-ids ${{ needs.run-unit-tests-on-aws.outputs.instance_id }}
+ aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"

Ensures termination under every failure mode.

🧹 Nitpick comments (2)
.github/workflows/aws-gpu-tests.yml (2)

160-178: Trailing spaces break YAML-lint and make reviews noisy.

Lines 166 and 174 carry stray spaces → yamllint error.
Quick fix:

-            echo "Current status: $STATUS. Waited $elapsed_time seconds..."␠
+            echo "Current status: $STATUS. Waited $elapsed_time seconds..."

[to delete the two offending spaces]


136-138: jq munging a full script into a single array element is brittle.

AWS-RunShellScript expects an array of commands. Embedding the entire multi-line script as one string works only if the remote interpreter honours embedded newlines; quoting errors silently turn the script into a single long line.

Simpler & safer:

- jq -n --arg script_body "$(cat remote_script.sh)" \
- '{ "commands": [$script_body] }' > ssm_params.json
+ aws ssm send-command \
+   --instance-ids "$INSTANCE_ID" \
+   --document-name "AWS-RunShellScript" \
+   --comment "Running Newton GPU unit tests" \
+   --parameters "commands=$(cat remote_script.sh | base64)" \
+   --cli-binary-format raw-in-base64-out \
+   ...

(or upload the script to S3 and execute bash /tmp/script.sh). Reduces JSON-escaping headaches.

📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11f31bc and 6e3c9e98eeeab50729e12ed4c36a98d99fd5211a.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/aws-gpu-tests.yml

75-75: "github.head_ref" is potentially untrusted. avoid using it directly in inline scripts. instead, pass it through an environment variable. see https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions for more details

(expression)

🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 166-166: trailing spaces

(trailing-spaces)


[error] 174-174: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-asv-benchmarks
🔇 Additional comments (1)
.github/workflows/aws-gpu-tests.yml (1)

56-66: Verify g6.xlarge availability in us-east-2 & spot-price.

g6.xlarge (Ada-L4) isn’t yet GA in every region; us-east-2 often lags.
A failed capacity reservation will abort the workflow after the 75-min timeout yet still incur runner cost.

Consider:

-  AWS_INSTANCE_TYPE: g6.xlarge
+  AWS_INSTANCE_TYPE: g5.xlarge   # wider regional support

or add a fallback strategy (ec2 run-instances --instance-type-override).

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
Comment thread .github/workflows/aws-gpu-tests.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
.github/workflows/aws-gpu-tests.yml (2)

10-11: AMI filter pattern is still fragile – refine or pin to a stable AMI.
Wildcard sequence ???????? is brittle; the comment from the previous review remains valid. Use a build-pattern that will not break on the next AMI release or pin to an explicit owner alias instead.


125-129: Shell-injection risk: branch name is still interpolated directly.
BRANCH_NAME="${{ github.head_ref || github.ref_name }}" passes attacker-controlled data straight into the remote shell. The earlier mitigation proposal (use a sanitized env var) hasn’t been applied.

-          BRANCH_NAME="${{ github.head_ref || github.ref_name }}"
+          BRANCH_NAME="${BRANCH_NAME_ENV}"

and add at the step level:

env:
  BRANCH_NAME_ENV: ${{ github.event.pull_request.head.ref || github.ref_name }}
🧹 Nitpick comments (1)
.github/workflows/aws-gpu-tests.yml (1)

172-190: Trailing whitespace triggers YAML-lint failures.
Lines 178 and 186 have stray spaces → yamllint error. Strip them to keep the CI green.

📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e3c9e98eeeab50729e12ed4c36a98d99fd5211a and 97cf854473c8c4fc56cdfb7d069d9678019212ec.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/aws-gpu-tests.yml

87-87: "github.head_ref" is potentially untrusted. avoid using it directly in inline scripts. instead, pass it through an environment variable. see https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions for more details

(expression)

🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 178-178: trailing spaces

(trailing-spaces)


[error] 186-186: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-asv-benchmarks
🔇 Additional comments (1)
.github/workflows/aws-gpu-tests.yml (1)

262-267: needs.*.outputs.instance_id may be undefined when launch fails.
If the launch step aborts before setting its output, the expression will raise a runtime error and block cleanup. Initialise instance_id to an empty string on job-level defaults or guard with if: failure() && ….

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
@shi-eric shi-eric force-pushed the ershi/aws-workflow branch 4 times, most recently from 0610dce to 9fedd24 Compare July 12, 2025 17:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
.github/workflows/aws-gpu-tests.yml (2)

154-188: final_status still misses edge cases like Cancelled or DeliveryTimedOut.

The catch-all branch sets final_status="$STATUS" and breaks, but STATUS can be values that are not actionable failures (Cancelled,TimedOut,Undelivered, …).
Recommend enumerating all success statuses and treat everything else as failure, e.g.

if [[ "$STATUS" == "Success" || "$STATUS" == "Cancelled" ]]; then
  :
elif [[ "$STATUS" == "InProgress" || "$STATUS" == "Pending" ]]; then
  ...
else
  final_status="$STATUS"
  break
fi

(or invert the condition: if [[ "$STATUS" != "InProgress" && "$STATUS" != "Pending" ]]; then … fi) to avoid silent masking.


160-179: Trailing whitespace flagged by YAMLlint

YAMLlint reports trailing spaces on lines 166 and 179 – minor, but fixing avoids noisy lint failures.

-              --output text)␠
+              --output text)
...
-            echo "Current status: $STATUS. Waited $elapsed_time seconds..."␠
+            echo "Current status: $STATUS. Waited $elapsed_time seconds..."
📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97cf854473c8c4fc56cdfb7d069d9678019212ec and 9fedd2483846d00ec6279c6012be7b4cb6a57033.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 166-166: trailing spaces

(trailing-spaces)


[error] 179-179: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-asv-benchmarks

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
Comment thread .github/workflows/aws-gpu-tests.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
.github/workflows/aws-gpu-tests.yml (1)

88-96: Here-doc still unquoted → variables can expand on the runner side

The delimiter << EOF permits interpolation of all $FOO, $( … ) and back-ticks before the file is even copied to the EC2 box. A malicious value that slipped past safe_branch or any future variable introduced in the runner’s env would be executed locally with full GitHub-host permissions.

Quoting the delimiter eliminates this class of bugs with zero downside:

-          cat << EOF > remote_script.sh
+          cat <<'EOF' > remote_script.sh

Everything that really must be substituted (e.g. S3 bucket, key) can be injected explicitly after the file is generated or passed via SSM parameters/environment.

Also applies to: 120-123

🧹 Nitpick comments (3)
.github/workflows/aws-gpu-tests.yml (3)

14-17: pull_request_target + AWS role = high-risk surface

Running untrusted PR code with an assumed AWS role (even if scoped) can expose the AWS account. A safer pattern is:

• Trigger on pull_request (not *_target) so the fork’s workflow file must reference the role – impossible.
• Or keep pull_request_target but checkout the merge commit instead of the attacker’s branch and pass --depth 1.

Please re-evaluate the threat model before merging.

Also applies to: 30-32


74-76: Possible job timeout before SSM command finishes

The job has a 75-minute timeout but:

  1. EC2 boot (wait instance-status-ok) can take >10 min on bursty regions.
  2. The polling loop waits up to 60 min for SSM.
  3. Tests themselves may run 30 min on a cold GPU AMI.

A narrow margin (<5 min) remains. Either shorten TIMEOUT, increase timeout-minutes, or push polling into a separate job so cleanup never races the watchdog.

Also applies to: 160-195


172-173: Trailing whitespace – YAMLlint error

Lines 172 and 185 contain stray spaces that fail the default yamllint ruleset.

-              --output text)␠
+              --output text)
...
-              final_status="$STATUS"␠
+              final_status="$STATUS"

Minor, but it will break pipelines that enforce linting.

Also applies to: 185-186

📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fedd2483846d00ec6279c6012be7b4cb6a57033 and 335f8fe9ac08ea80251b08fd8eb6ae45c01d1973.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 172-172: trailing spaces

(trailing-spaces)


[error] 185-185: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-asv-benchmarks

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
Comment thread .github/workflows/aws-gpu-tests.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
.github/workflows/aws-gpu-tests.yml (1)

84-120: Remove double-quoting of an already printf %q-escaped branch name

printf '%q' wraps the value in single-quotes when needed.
Embedding that string again inside double-quotes (BRANCH_NAME="${safe_branch}") makes the quotes part of the value, so the remote git clone --branch looks for a ref literally called 'feature-xyz' and fails.

- safe_branch=$(printf '%q' "${BRANCH_NAME_ENV}")
- ...
- BRANCH_NAME="${safe_branch}"
+safe_branch=$(printf '%q' -- "${BRANCH_NAME_ENV}")   # keep existing escaping
+# no extra quotes – value is already shell-safe
+BRANCH_NAME=${safe_branch}
🧹 Nitpick comments (2)
.github/workflows/aws-gpu-tests.yml (2)

18-24: Cancellation can orphan EC2 instances – put termination in an independent cleanup job

When a newer run hits the same concurrency group, GitHub cancels the entire previous job; step-level if: always() hooks never fire, so aws ec2 terminate-instances is skipped and the GPU instance leaks.

Create a separate cleanup job that

  • has if: always()
  • is not in the concurrency group (omit needs: or use a different group key)
  • fetches the INSTANCE_ID artifact (or tags) and terminates the instance.

This guarantees the teardown runs even after a workflow-level cancellation.

Also applies to: 247-255


164-186: Minor: trailing whitespace trips yamllint

Static analysis already reports trailing spaces on lines 170 & 183.
Strip them to keep the workflow lint-clean.

-          while [ $elapsed_time -lt $TIMEOUT ]; do␠
+          while [ $elapsed_time -lt $TIMEOUT ]; do
...
-            echo "Current status: $STATUS. Waited $elapsed_time seconds..."␠
+            echo "Current status: $STATUS. Waited $elapsed_time seconds..."
📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 335f8fe9ac08ea80251b08fd8eb6ae45c01d1973 and cb08cb4152f0737a5abe03b21960588ee8e91733.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
.github/workflows/aws-gpu-tests.yml (1)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:86-87
Timestamp: 2025-07-12T17:40:10.078Z
Learning: Git does not allow spaces in branch names. Branch names with spaces will result in "is not a valid branch name" errors.
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 170-170: trailing spaces

(trailing-spaces)


[error] 183-183: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-asv-benchmarks
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
🔇 Additional comments (1)
.github/workflows/aws-gpu-tests.yml (1)

140-142: jq may not be available on self-hosted runners

Ubuntu-latest currently bundles jq, but this isn’t guaranteed and self-hosted runners often omit it.
Add an explicit install (e.g., sudo apt-get -y install jq) or use aws ssm send-command --parameters "commands=[\"$(<remote_script.sh | base64 -w0)\"]" to avoid the dependency.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
.github/workflows/aws-gpu-tests.yml (4)

90-96: Redundant re-quoting may leak literal quote characters into $BRANCH_NAME

printf '%q' already returns a shell-escaped token. Wrapping it again in double quotes (line 128) can embed the quotes themselves when no escaping is needed, yielding git clone --branch "'feature-x'" …. Consider dropping the extra double quotes:

-SAFE_BRANCH=$(printf '%q' "${BRANCH_NAME_ENV}")
+SAFE_BRANCH=$(printf '%q' -- "${BRANCH_NAME_ENV}")   # -- guards against leading ‘-’-BRANCH_NAME="${SAFE_BRANCH}"
+BRANCH_NAME=${SAFE_BRANCH}

Minor, but removes one layer of quoting and avoids surprises.


50-61: AMI lookup is region-hard-coded – breaks in non-us-east-2 regions

The SSM parameter path is valid only where AWS publishes that specific DLAMI build. If the repository later sets AWS_REGION to another region (or contributors fork with a different default), get-parameter returns ParameterNotFound and the job aborts.

Recommend either:

  1. Pin the AMI ID in the workflow inputs (cheap, deterministic), or
  2. Drive the parameter path from the region:
PARAM="/aws/service/deeplearning/ami/x86_64/base-oss-nvidia-driver-gpu-ubuntu-24.04/${{ env.AWS_REGION }}/latest/ami-id"
LATEST_AMI_ID=$(aws ssm get-parameter --region "${{ env.AWS_REGION }}" \
              --name "$PARAM" --query 'Parameter.Value' --output text)

150-152: Risk of exceeding SSM 16 KB payload limit

$(cat remote_script.sh) inlines the entire script into a single JSON string. Once the script grows beyond ~16 KB the send-command call is rejected.

Mitigations:

  • gzip → base64 the script and decode on the instance, or
  • store the script in S3 and reference it via --document-version / --parameters sourceType=S3 ….

Nice-to-have now, but future-proofs the workflow.


173-196: Trailing whitespace flagged by yamllint

Lines 179 and 192 contain stray spaces that yamllint reports. They’re inside a shell heredoc, so harmless at runtime, but cleaning them avoids noisy CI lint failures.

-            echo "Current status: $STATUS. Waited $elapsed_time seconds..."␠
+            echo "Current status: $STATUS. Waited $elapsed_time seconds..."
📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb08cb4152f0737a5abe03b21960588ee8e91733 and 135909831a85af63b768a637d98e443f749e4523.

📒 Files selected for processing (2)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .github/PULL_REQUEST_TEMPLATE.md
🧰 Additional context used
🧠 Learnings (1)
.github/workflows/aws-gpu-tests.yml (1)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:86-87
Timestamp: 2025-07-12T17:40:10.078Z
Learning: Git does not allow spaces in branch names. Branch names with spaces will result in "is not a valid branch name" errors.
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 179-179: trailing spaces

(trailing-spaces)


[error] 192-192: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-asv-benchmarks

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.github/workflows/aws-gpu-tests.yml (1)

182-195: Trailing whitespace – fails yamllint

yamllint flags lines 182 and 195 for trailing spaces.
They are inside a bash heredoc, so removal is safe and keeps the CI green.

-            
+
📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 135909831a85af63b768a637d98e443f749e4523 and 8a29822a92ef09c10c694cb2b50145eab4711285.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
.github/workflows/aws-gpu-tests.yml (1)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:86-87
Timestamp: 2025-07-12T17:40:10.078Z
Learning: Git does not allow spaces in branch names. Branch names with spaces will result in "is not a valid branch name" errors.
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 182-182: trailing spaces

(trailing-spaces)


[error] 195-195: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-asv-benchmarks
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
Comment thread .github/workflows/aws-gpu-tests.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/aws-gpu-tests.yml (1)

126-129: Forked-PR code is never executed – clone URL & ref still point at base repository

REPO_URL and BRANCH_NAME are hard-wired to the target repo / branch.
When somebody opens a PR from a fork, the branch of the contributor does not exist in github.repository; the workflow therefore tests stale code and can report false-green.

-REPO_URL="https://github.com/${{ github.repository }}.git"
-BRANCH_NAME="${SAFE_BRANCH}"
+if [[ "${{ github.event.pull_request.head.repo.full_name }}" != "${{ github.repository }}" ]]; then
+  # External fork – test the exact commit under review
+  REPO_URL="${{ github.event.pull_request.head.repo.clone_url }}"
+  BRANCH_NAME="${{ github.event.pull_request.head.sha }}"
+else
+  REPO_URL="https://github.com/${{ github.repository }}.git"
+  BRANCH_NAME="${SAFE_BRANCH}"
+fi

Without this, GPU tests give a misleading sense of safety for every external contribution.

🧹 Nitpick comments (2)
.github/workflows/aws-gpu-tests.yml (2)

90-96: printf %q adds complexity for little gain

SAFE_BRANCH=$(printf '%q' …) produces shell-escaped fragments like 'feature/xyz'.
Because you later embed the value in an assignment (BRANCH_NAME=${SAFE_BRANCH}) the extra quoting is stripped anyway for normal branch names, while exotic names (: ~ etc.) are already rejected by Git. Dropping this makes the script easier to read:

-SAFE_BRANCH=$(printf '%q' "${BRANCH_NAME_ENV}")
+SAFE_BRANCH="${BRANCH_NAME_ENV}"

180-195: Trailing whitespace flagged by yamllint

Lines 181 and 194 contain stray spaces which fail the lint step. Quick tidy-up:

-              --output text)␠
+              --output text)-              echo "SSM command failed or returned unexpected status: $STATUS"␠
+              echo "SSM command failed or returned unexpected status: $STATUS"

Pure style, but keeps CI green.

📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a29822a92ef09c10c694cb2b50145eab4711285 and 2739be88c83fc8c7753d8c2d7ecea6450d02dcf4.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
.github/workflows/aws-gpu-tests.yml (1)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:86-87
Timestamp: 2025-07-12T17:40:10.078Z
Learning: Git does not allow spaces in branch names. Branch names with spaces will result in "is not a valid branch name" errors.
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 181-181: trailing spaces

(trailing-spaces)


[error] 194-194: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-asv-benchmarks

Comment thread .github/workflows/aws-gpu-tests.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
.github/workflows/aws-gpu-tests.yml (4)

100-113: printf %q + double-quoting still injects quotes into the remote variable

SAFE_BRANCH may already be quoted ('feature/xyz').
Assigning it again inside double quotes

BRANCH_NAME="${SAFE_BRANCH}"

preserves the literal single quotes and yields a branch called 'feature/xyz' on the remote host. git clone --branch will fail.

Either:

-BRANCH_NAME="${SAFE_BRANCH}"
+BRANCH_NAME=${SAFE_BRANCH}   # no extra layer of quoting

or drop printf %q altogether and rely on Git’s own ref rules.


165-167: SSM payload can exceed the 16 KB limit – externalise the script

send-command embeds the entire remote_script.sh as a single JSON element.
SSM RunShellScript rejects payloads > 16 KB with ValidationException, and this script is already sizeable.

Recommended pattern:

  1. aws s3 cp remote_script.sh s3://$S3_BUCKET/$S3_KEY/remote_script.sh
  2. Send a tiny SSM command: aws s3 cp … && bash remote_script.sh

Keeps the payload small and future-proofs the workflow.


275-283: Instance-leak on concurrency cancellation persists

The termination step lives inside the same job that can be cancelled by
concurrency.cancel-in-progress.
When a new push arrives, the running job is aborted before reaching this
step and the GPU instance keeps running.

Move the cleanup to a separate cleanup job with if: always() and no needs
dependency, pass the INSTANCE_ID via an output or artifact, and terminate the
instance there.


144-148: Run the tests against the PR’s code, not the base repository

For PRs coming from forks https://github.com/${{ github.repository }}.git does not contain the contributor’s commits.
Result: the workflow happily green-lights stale code.

Quick fix:

-REPO_URL="https://github.com/${{ github.repository }}.git"
-BRANCH_NAME="${SAFE_BRANCH}"
+if [[ -n "${{ github.event.pull_request.head.repo.clone_url }}" ]]; then
+  REPO_URL="${{ github.event.pull_request.head.repo.clone_url }}"
+  BRANCH_NAME="${{ github.event.pull_request.head.ref }}"
+else
+  REPO_URL="https://github.com/${{ github.repository }}.git"
+  BRANCH_NAME=${SAFE_BRANCH}
+fi

This clones the exact commit under review while still covering internal pushes.

🧹 Nitpick comments (1)
.github/workflows/aws-gpu-tests.yml (1)

73-75: Avoid embedding literal single-quotes in the GitHub-Repository tag

TAGS="$TAGS,{Key=GitHub-Repository,Value='${{ github.repository }}'}"
The inner single quotes become part of the tag value ('newton-physics/newton').
Drop them:

-TAGS="$TAGS,{Key=GitHub-Repository,Value='${{ github.repository }}'}"
+TAGS="$TAGS,{Key=GitHub-Repository,Value=${{ github.repository }}}"

Minor, but keeps tag values clean and easier to query.

📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2739be88c83fc8c7753d8c2d7ecea6450d02dcf4 and 234f57697bd408609ffcd11d534201fb494a722d.

📒 Files selected for processing (1)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
.github/workflows/aws-gpu-tests.yml (1)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:86-87
Timestamp: 2025-07-12T17:40:10.078Z
Learning: Git does not allow spaces in branch names. Branch names with spaces will result in "is not a valid branch name" errors.
🪛 YAMLlint (1.37.1)
.github/workflows/aws-gpu-tests.yml

[error] 199-199: trailing spaces

(trailing-spaces)


[error] 212-212: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-asv-benchmarks

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
@shi-eric shi-eric force-pushed the ershi/aws-workflow branch from 5f5dea8 to 6032b20 Compare July 12, 2025 20:28
@shi-eric

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jul 12, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.github/PULL_REQUEST_TEMPLATE.md (1)

25-25: Checklist item is clear – consider explicitly mentioning GPU tests

You removed the outdated “GitHub does not perform GPU testing” disclaimer, which is great.
Since GPU tests are now executed by the new AWS workflow, you might add a short hint such as “GPU tests pass in CI” so contributors are aware this box covers both CPU and GPU paths.

.github/workflows/aws-gpu-tests.yml (2)

63-72: Tag value contains literal single-quotes

GitHub-Repository is assembled as
Value='${{ github.repository }}' which results in the tag value literally including the ' characters ('newton-physics/newton').
AWS accepts it, but it bloats searches/filters and looks odd in the console.

-TAGS="$TAGS,{Key=GitHub-Repository,Value='${{ github.repository }}'}"
+TAGS="$TAGS,{Key=GitHub-Repository,Value=${{ github.repository }}}"

108-112: Harden the remote script with set -u

set -e and set -o pipefail are enabled, but missing set -u means unset variables silently expand to empty strings (e.g., if $PR_NUMBER or $S3_BUCKET were ever unset).
Adding it improves safety with negligible risk.

-set -e
-set -o pipefail
+set -euo pipefail
📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11f31bc and 6032b20c1fa1a571684378d7de9028bf39312672.

📒 Files selected for processing (2)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • .github/workflows/aws-gpu-tests.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:144-156
Timestamp: 2025-07-12T19:46:06.538Z
Learning: In the newton-physics/newton repository workflow discussions, the maintainer shi-eric prefers that forked PR testing behavior is intentionally designed to fail rather than test untrusted code from forks on AWS infrastructure.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:65-82
Timestamp: 2025-07-12T19:46:47.510Z
Learning: In the newton-physics/newton repository AWS GPU workflow, the workflow only triggers on `pull_request_target` and `push` events, not on `pull_request` events. The `pull_request_target` event is used for testing PRs from forks safely on AWS infrastructure.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:275-283
Timestamp: 2025-07-12T19:45:04.640Z
Learning: In GitHub Actions, the `if: always()` condition runs even when a job is cancelled due to concurrency settings with `cancel-in-progress: true`. This means cleanup steps with `if: always()` will execute properly to prevent resource leaks, even if the main workflow is cancelled by a newer run.
.github/PULL_REQUEST_TEMPLATE.md (2)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:65-82
Timestamp: 2025-07-12T19:46:47.510Z
Learning: In the newton-physics/newton repository AWS GPU workflow, the workflow only triggers on `pull_request_target` and `push` events, not on `pull_request` events. The `pull_request_target` event is used for testing PRs from forks safely on AWS infrastructure.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:144-156
Timestamp: 2025-07-12T19:46:06.538Z
Learning: In the newton-physics/newton repository workflow discussions, the maintainer shi-eric prefers that forked PR testing behavior is intentionally designed to fail rather than test untrusted code from forks on AWS infrastructure.
.github/workflows/aws-gpu-tests.yml (6)
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:65-82
Timestamp: 2025-07-12T19:46:47.510Z
Learning: In the newton-physics/newton repository AWS GPU workflow, the workflow only triggers on `pull_request_target` and `push` events, not on `pull_request` events. The `pull_request_target` event is used for testing PRs from forks safely on AWS infrastructure.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:144-156
Timestamp: 2025-07-12T19:46:06.538Z
Learning: In the newton-physics/newton repository workflow discussions, the maintainer shi-eric prefers that forked PR testing behavior is intentionally designed to fail rather than test untrusted code from forks on AWS infrastructure.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:275-283
Timestamp: 2025-07-12T19:45:04.640Z
Learning: In GitHub Actions, steps with `if: always()` continue to run even when the workflow is canceled via `concurrency.cancel-in-progress: true`. GitHub Actions re-evaluates if conditions during cancellation, and since `if: always()` always evaluates to true, those steps are not canceled and will continue to execute. This means cleanup steps with `if: always()` will run properly to prevent resource leaks even when workflows are cancelled due to concurrency limits.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:275-283
Timestamp: 2025-07-12T19:45:04.640Z
Learning: In GitHub Actions, the `if: always()` condition runs even when a job is cancelled due to concurrency settings with `cancel-in-progress: true`. This means cleanup steps with `if: always()` will execute properly to prevent resource leaks, even if the main workflow is cancelled by a newer run.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:0-0
Timestamp: 2025-07-12T19:46:37.456Z
Learning: For AWS SSM send-command payload limits, always measure the actual script size before suggesting optimizations. The 16KB limit is rarely hit by typical GitHub Actions workflow scripts, which are usually only a few KB in size.
Learnt from: shi-eric
PR: newton-physics/newton#386
File: .github/workflows/aws-gpu-tests.yml:86-87
Timestamp: 2025-07-12T17:40:10.078Z
Learning: Git does not allow spaces in branch names. Branch names with spaces will result in "is not a valid branch name" errors.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)

Comment thread .github/workflows/aws-gpu-tests.yml Outdated
@shi-eric shi-eric force-pushed the ershi/aws-workflow branch from 6032b20 to 1927051 Compare July 12, 2025 20:48
Signed-off-by: Eric Shi <ershi@nvidia.com>
@shi-eric shi-eric force-pushed the ershi/aws-workflow branch 2 times, most recently from 6f831ca to e7b5535 Compare July 12, 2025 21:16
@shi-eric shi-eric marked this pull request as ready for review July 12, 2025 21:22
@shi-eric shi-eric merged commit ff38d78 into newton-physics:main Jul 12, 2025
7 checks passed
@shi-eric shi-eric deleted the ershi/aws-workflow branch July 12, 2025 21:29
jvonmuralt pushed a commit to jvonmuralt/newton that referenced this pull request Jul 14, 2025
Signed-off-by: Eric Shi <ershi@nvidia.com>
@coderabbitai coderabbitai Bot mentioned this pull request Jul 27, 2025
5 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Oct 11, 2025
4 tasks
eric-heiden pushed a commit to eric-heiden/newton that referenced this pull request Jan 28, 2026
Signed-off-by: Eric Shi <ershi@nvidia.com>
vidurv-nvidia pushed a commit to vidurv-nvidia/newton that referenced this pull request Mar 6, 2026
# Description

Adds license files for additional 3rd party OSS dependencies added to
Isaac Lab.

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
@coderabbitai coderabbitai Bot mentioned this pull request Mar 13, 2026
3 tasks
mmacklin pushed a commit to mmacklin/newton that referenced this pull request Apr 7, 2026
Signed-off-by: Eric Shi <ershi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant