Fix: Adding a retry mechanism in case the addMoreGameServers function call fails.#4214
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
/gcbrun |
|
Build Failed 😭 Build Id: b17dfa63-46dd-49e0-84c0-c2cb31558f88 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
Build Succeeded 🥳 Build Id: 74c02649-f424-490b-ad05-ee6459b3b3e5 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: bf8d4493-7f6a-4ab0-833e-8133900ef684 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: 7a741d83-08db-4428-b04e-9924722a854c The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
What type of PR is this?
/kind bug
Which issue(s) this PR fixes:
Closes #4162
Fix GameServer Not Created on addMoreGameServers Failure
What this PR does / Why we need it
affected version <= 1.50.0The core of my pull request is that I identified an issue where, if
all GameServersare deleted and the addMoreGameServers function fails, no further GameServers are created. To resolve this, I added logic to reprocess the GameServerSet usingc.workerqueue.EnqueueAfter.Note
In the case of partial GameServer deletion (not all GameServers), the GameServer informer’s resync triggers an Update event. This leads to a call to the computeReconciliationAction function, which calculates the number of GameServers to add (numServersToAdd), followed by a call to the addMoreGameServers function.
Without this fix, since no GameServers exist, there are no further GameServer informer updates or resyncs triggered. As a result, the addMoreGameServers function is never called again, and no new GameServers are created.
POC
Imagine that the Agones Controller and Agones Extension are deployed on top of Kubernetes, and there is one GameServerSet (test-gss) with two GameServers (test-gss-1 and test-gss-2).
Delete the two GameServers (test-gss-1 and test-gss-2) while the Agones Extension is in the TLS certs changing state.
Note
Delete all deployed GameServers at once.
# Agones Extension Pod http: TLS handshake error from ip:port: EOF http: TLS handshake error from ip:port: EOF http: TLS handshake error from ip:port: EOF [...]# my cli terminal kubectl delete pod test-gss-1 test-gss-2 -n defaultIn a normal scenario, the addMoreGameServers function is triggered when a GameServer transitions to the Shutdown state or when its Pod is deleted.
but we can see this error message when invoke
addMoreGameServers{ "error": "error creating gameserver for gameserverset test-gss: Internal error occurred: failed calling webhook \"mutations.agones.dev\": failed to call webhook: Post \"https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s\": tls: failed to verify certificate: x509~~" }Due to the above error, the addMoreGameServers function is no longer invoked.
Solution
The solution is simple: if the addMoreGameServers function call fails, it should be reprocessed using
c.workerqueue.EnqueueAfter.Limitation
A potential drawback is that retries may continue indefinitely until success.
If you have any suggestions for a better approach, I'd appreciate your comments.
Special notes for your reviewer:
N/A