Skip to content

fix: job id race condition with large, dynamic matrices#451

Merged
rdhar merged 2 commits intoOP5dev:mainfrom
jemc:fix/job-id-race-condition
Apr 21, 2025
Merged

fix: job id race condition with large, dynamic matrices#451
rdhar merged 2 commits intoOP5dev:mainfrom
jemc:fix/job-id-race-condition

Conversation

@jemc
Copy link
Copy Markdown
Contributor

@jemc jemc commented Apr 17, 2025

When you execute a matrix job, and the matrix is dynamic (i.e. based on the output of a previous job; for example, a change detection job that looks for workspaces that had changes), then GitHub won't immediately show the full job list for the current workflow run. You can observe this in the GitHub UI, where the matrix slowly gets more job instances, even as the matrix jobs instances are starting.

Unfortunately, even after a matrix job instance has started executing, it may not be visible in the UI or API yet. This means that the API call which TF-via-PR makes to get the job id is not guaranteed to succeed, and in practice it will reliably fail if the dynamic matrix is large enough (e.g. 50 instances), and if the identifier step of the TF-via-PR action is reached quickly enough.

This PR adds a workaround for that issue, where the API call will be retried with exponential backoff, up to a maximum limit of attempts. In practice, this should avoid the race condition without introducing too much complexity, despite being a bit inelegant.

When you execute a matrix job, and the matrix is dynamic
(i.e. based on the output of a previous job; for example,
a change detection job that looks for workspaces that had changes),
then GitHub won't immediately show the full job list for
the current workflow run. You can observe this in the GitHub UI,
where the matrix slowly gets more job instances, even as the
matrix jobs instances are starting.

Unfortunately, even after a matrix job instance has started
executing, it may not be visible in the UI or API yet.
This means that the API call which TF-via-PR makes to get the job id
is not guaranteed to succeed, and in practice it will reliably
fail if the dynamic matrix is large enough (e.g. 50 instances),
and if the `identifier` step of the `TF-via-PR` action is
reached quickly enough.

This PR adds a workaround for that issue, where the API call
will be retried with exponential backoff, up to a maximum limit
of attempts. In practice, this should avoid the race condition
without introducing too much complexity, despite being a bit inelegant.
@jemc jemc requested a review from rdhar as a code owner April 17, 2025 23:02
@jemc
Copy link
Copy Markdown
Contributor Author

jemc commented Apr 17, 2025

Note: I have tested this by integrating my branch with our private repo that has ~50 Terraform workspaces. It solved the issue.

@rdhar rdhar self-assigned this Apr 21, 2025
@rdhar rdhar added the enhancement New feature or request label Apr 21, 2025
@rdhar rdhar requested a review from Copilot April 21, 2025 18:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition when retrieving a job ID from a dynamic matrix by implementing an exponential backoff retry mechanism.

  • Introduces a while loop to retry fetching the job ID via the GitHub API.
  • Implements an exponential backoff strategy with a maximum retry interval.
Comments suppressed due to low confidence (1)

action.yml:114

  • [nitpick] Consider defining a constant (e.g. MAX_RETRY_INTERVAL=64) instead of a hard-coded numeric literal to improve readability and maintainability.
if [[ $retry_interval -gt 64 ]]; then

rdhar
rdhar previously approved these changes Apr 21, 2025
Copy link
Copy Markdown
Member

@rdhar rdhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive, and better still to hear that you've been dogfooding it!

I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?

@rdhar rdhar merged commit 5788866 into OP5dev:main Apr 21, 2025
10 checks passed
@rdhar
Copy link
Copy Markdown
Member

rdhar commented Apr 21, 2025

Happy to see this shipped with v13.3.1 (v13), where your contribution has been credited!

Please consider ⭐ this project, if you or your team find it useful.

GitHub repository stargazers


@jemc BIG thanks (once again!) for contributing this enhancement! I'd still love to know about your experience contributing to this Action, and more than open to any feedback you'd like to share.

@jemc
Copy link
Copy Markdown
Contributor Author

jemc commented Apr 24, 2025

I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?

I personally prefer Bash/shell actions over JS/TS ones. Correctly piping stdin/stdout/stderr things around in JS/TS is more ceremony and people often do it poorly as a result.

I will say that this action might benefit from moving the more complicated bash/shell steps into dedicated scripts outside the YAML files, though, which would allow for unit-testing individual steps more easily.

@rdhar
Copy link
Copy Markdown
Member

rdhar commented Apr 28, 2025

I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?

I personally prefer Bash/shell actions over JS/TS ones. Correctly piping stdin/stdout/stderr things around in JS/TS is more ceremony and people often do it poorly as a result.

I will say that this action might benefit from moving the more complicated bash/shell steps into dedicated scripts outside the YAML files, though, which would allow for unit-testing individual steps more easily.

Have to agree on both counts. The initial reason for opting to use Bash is because the Action's sole purpose is to interact with Terraform/OpenTofu CLI and there's no reason to introduce NodeJS dependency into the mix. However, it's since grown from a few dozen lines to over 200 so it stands to gain from some modularization and file separation.

Another complicating factor is GitHub dependencies; mainly GitHub CLI. There's heavy use of jq throughout, which isn't pretty. What's more, the sole other Action dependency of TF-via-PR is GitHub's own actions/upload-artifact—in fact, it's called twice because GitHub deprecated v3 support even though GitHub Enterprise is still reliant on v3. 🤦

Purely for ease of GitHub interoperability, use of typescript-action would make more sense, provided that Terraform/OpenTofu outputs can be wrapped accurately. Really just trying to find the balance between maintainability/simplicity and right-tool-for-the-job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants