fix: job id race condition with large, dynamic matrices by jemc · Pull Request #451 · OP5dev/TF-via-PR

jemc · 2025-04-17T23:02:50Z

When you execute a matrix job, and the matrix is dynamic (i.e. based on the output of a previous job; for example, a change detection job that looks for workspaces that had changes), then GitHub won't immediately show the full job list for the current workflow run. You can observe this in the GitHub UI, where the matrix slowly gets more job instances, even as the matrix jobs instances are starting.

Unfortunately, even after a matrix job instance has started executing, it may not be visible in the UI or API yet. This means that the API call which TF-via-PR makes to get the job id is not guaranteed to succeed, and in practice it will reliably fail if the dynamic matrix is large enough (e.g. 50 instances), and if the identifier step of the TF-via-PR action is reached quickly enough.

This PR adds a workaround for that issue, where the API call will be retried with exponential backoff, up to a maximum limit of attempts. In practice, this should avoid the race condition without introducing too much complexity, despite being a bit inelegant.

When you execute a matrix job, and the matrix is dynamic (i.e. based on the output of a previous job; for example, a change detection job that looks for workspaces that had changes), then GitHub won't immediately show the full job list for the current workflow run. You can observe this in the GitHub UI, where the matrix slowly gets more job instances, even as the matrix jobs instances are starting. Unfortunately, even after a matrix job instance has started executing, it may not be visible in the UI or API yet. This means that the API call which TF-via-PR makes to get the job id is not guaranteed to succeed, and in practice it will reliably fail if the dynamic matrix is large enough (e.g. 50 instances), and if the `identifier` step of the `TF-via-PR` action is reached quickly enough. This PR adds a workaround for that issue, where the API call will be retried with exponential backoff, up to a maximum limit of attempts. In practice, this should avoid the race condition without introducing too much complexity, despite being a bit inelegant.

jemc · 2025-04-17T23:04:26Z

Note: I have tested this by integrating my branch with our private repo that has ~50 Terraform workspaces. It solved the issue.

Copilot

Pull Request Overview

This PR addresses a race condition when retrieving a job ID from a dynamic matrix by implementing an exponential backoff retry mechanism.

Introduces a while loop to retry fetching the job ID via the GitHub API.
Implements an exponential backoff strategy with a maximum retry interval.

Comments suppressed due to low confidence (1)

action.yml:114

[nitpick] Consider defining a constant (e.g. MAX_RETRY_INTERVAL=64) instead of a hard-coded numeric literal to improve readability and maintainability.

if [[ $retry_interval -gt 64 ]]; then

action.yml

rdhar

Really impressive, and better still to hear that you've been dogfooding it!

I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?

rdhar · 2025-04-21T22:26:51Z

Happy to see this shipped with v13.3.1 (v13), where your contribution has been credited!

Please consider ⭐ this project, if you or your team find it useful.

@jemc BIG thanks (once again!) for contributing this enhancement! I'd still love to know about your experience contributing to this Action, and more than open to any feedback you'd like to share.

jemc · 2025-04-24T17:04:14Z

I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?

I personally prefer Bash/shell actions over JS/TS ones. Correctly piping stdin/stdout/stderr things around in JS/TS is more ceremony and people often do it poorly as a result.

I will say that this action might benefit from moving the more complicated bash/shell steps into dedicated scripts outside the YAML files, though, which would allow for unit-testing individual steps more easily.

rdhar · 2025-04-28T11:45:27Z

I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?

I personally prefer Bash/shell actions over JS/TS ones. Correctly piping stdin/stdout/stderr things around in JS/TS is more ceremony and people often do it poorly as a result.

I will say that this action might benefit from moving the more complicated bash/shell steps into dedicated scripts outside the YAML files, though, which would allow for unit-testing individual steps more easily.

Have to agree on both counts. The initial reason for opting to use Bash is because the Action's sole purpose is to interact with Terraform/OpenTofu CLI and there's no reason to introduce NodeJS dependency into the mix. However, it's since grown from a few dozen lines to over 200 so it stands to gain from some modularization and file separation.

Another complicating factor is GitHub dependencies; mainly GitHub CLI. There's heavy use of jq throughout, which isn't pretty. What's more, the sole other Action dependency of TF-via-PR is GitHub's own actions/upload-artifact—in fact, it's called twice because GitHub deprecated v3 support even though GitHub Enterprise is still reliant on v3. 🤦

Purely for ease of GitHub interoperability, use of typescript-action would make more sense, provided that Terraform/OpenTofu outputs can be wrapped accurately. Really just trying to find the balance between maintainability/simplicity and right-tool-for-the-job.

jemc requested a review from rdhar as a code owner April 17, 2025 23:02

rdhar self-assigned this Apr 21, 2025

rdhar added the enhancement New feature or request label Apr 21, 2025

rdhar requested a review from Copilot April 21, 2025 18:17

Copilot AI reviewed Apr 21, 2025

View reviewed changes

action.yml Show resolved Hide resolved

rdhar previously approved these changes Apr 21, 2025

View reviewed changes

Comment

538c75a

rdhar dismissed their stale review via 538c75a April 21, 2025 22:13

rdhar approved these changes Apr 21, 2025

View reviewed changes

rdhar merged commit 5788866 into OP5dev:main Apr 21, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: job id race condition with large, dynamic matrices#451

fix: job id race condition with large, dynamic matrices#451
rdhar merged 2 commits intoOP5dev:mainfrom
jemc:fix/job-id-race-condition

jemc commented Apr 17, 2025

Uh oh!

jemc commented Apr 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

rdhar left a comment

Uh oh!

Uh oh!

rdhar commented Apr 21, 2025

Uh oh!

jemc commented Apr 24, 2025

Uh oh!

rdhar commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jemc commented Apr 17, 2025

Uh oh!

jemc commented Apr 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

rdhar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdhar commented Apr 21, 2025

Uh oh!

jemc commented Apr 24, 2025

Uh oh!

rdhar commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants