fix: job id race condition with large, dynamic matrices#451
fix: job id race condition with large, dynamic matrices#451rdhar merged 2 commits intoOP5dev:mainfrom
Conversation
When you execute a matrix job, and the matrix is dynamic (i.e. based on the output of a previous job; for example, a change detection job that looks for workspaces that had changes), then GitHub won't immediately show the full job list for the current workflow run. You can observe this in the GitHub UI, where the matrix slowly gets more job instances, even as the matrix jobs instances are starting. Unfortunately, even after a matrix job instance has started executing, it may not be visible in the UI or API yet. This means that the API call which TF-via-PR makes to get the job id is not guaranteed to succeed, and in practice it will reliably fail if the dynamic matrix is large enough (e.g. 50 instances), and if the `identifier` step of the `TF-via-PR` action is reached quickly enough. This PR adds a workaround for that issue, where the API call will be retried with exponential backoff, up to a maximum limit of attempts. In practice, this should avoid the race condition without introducing too much complexity, despite being a bit inelegant.
|
Note: I have tested this by integrating my branch with our private repo that has ~50 Terraform workspaces. It solved the issue. |
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a race condition when retrieving a job ID from a dynamic matrix by implementing an exponential backoff retry mechanism.
- Introduces a while loop to retry fetching the job ID via the GitHub API.
- Implements an exponential backoff strategy with a maximum retry interval.
Comments suppressed due to low confidence (1)
action.yml:114
- [nitpick] Consider defining a constant (e.g. MAX_RETRY_INTERVAL=64) instead of a hard-coded numeric literal to improve readability and maintainability.
if [[ $retry_interval -gt 64 ]]; then
rdhar
left a comment
There was a problem hiding this comment.
Really impressive, and better still to hear that you've been dogfooding it!
I have to ask, as a contributor, does it make it easier or harder that this Action is composed in Bash/shell rather than JS/TS?
|
Happy to see this shipped with v13.3.1 (v13), where your contribution has been credited! Please consider ⭐ this project, if you or your team find it useful. @jemc BIG thanks (once again!) for contributing this enhancement! I'd still love to know about your experience contributing to this Action, and more than open to any feedback you'd like to share. |
I personally prefer Bash/shell actions over JS/TS ones. Correctly piping stdin/stdout/stderr things around in JS/TS is more ceremony and people often do it poorly as a result. I will say that this action might benefit from moving the more complicated bash/shell steps into dedicated scripts outside the YAML files, though, which would allow for unit-testing individual steps more easily. |
Have to agree on both counts. The initial reason for opting to use Bash is because the Action's sole purpose is to interact with Terraform/OpenTofu CLI and there's no reason to introduce NodeJS dependency into the mix. However, it's since grown from a few dozen lines to over 200 so it stands to gain from some modularization and file separation. Another complicating factor is GitHub dependencies; mainly GitHub CLI. There's heavy use of Purely for ease of GitHub interoperability, use of typescript-action would make more sense, provided that Terraform/OpenTofu outputs can be wrapped accurately. Really just trying to find the balance between maintainability/simplicity and right-tool-for-the-job. |
When you execute a matrix job, and the matrix is dynamic (i.e. based on the output of a previous job; for example, a change detection job that looks for workspaces that had changes), then GitHub won't immediately show the full job list for the current workflow run. You can observe this in the GitHub UI, where the matrix slowly gets more job instances, even as the matrix jobs instances are starting.
Unfortunately, even after a matrix job instance has started executing, it may not be visible in the UI or API yet. This means that the API call which TF-via-PR makes to get the job id is not guaranteed to succeed, and in practice it will reliably fail if the dynamic matrix is large enough (e.g. 50 instances), and if the
identifierstep of theTF-via-PRaction is reached quickly enough.This PR adds a workaround for that issue, where the API call will be retried with exponential backoff, up to a maximum limit of attempts. In practice, this should avoid the race condition without introducing too much complexity, despite being a bit inelegant.