Skip to content

Fix handling of "leader changed" errors#11426

Merged
hickeyma merged 2 commits intohelm:mainfrom
cenkalti:leader-changed
Nov 16, 2022
Merged

Fix handling of "leader changed" errors#11426
hickeyma merged 2 commits intohelm:mainfrom
cenkalti:leader-changed

Conversation

@cenkalti
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:
This PR aims to fix temporary "etcdserver: leader changed" errors from kube-apiserver that is previously attempted by #11401 by adding a single retry when this kind of error is detected.

/cc @dims

Special notes for your reviewer:
I also reverted the previous fix that didn't solve the issue. One commit is the revert, the other one is the new fix. It's better if you review commits separately.

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 11, 2022
@dims
Copy link
Copy Markdown
Contributor

dims commented Oct 11, 2022

@hickeyma @technosophos @mattfarina Folks, @cenkalti tested a bunch of scenarios recreating the problem(s) with etcd with this patch.

This reverts commit ebc79fa.

Signed-off-by: Cenk Alti <cenkalti@gmail.com>
Signed-off-by: Cenk Alti <cenkalti@gmail.com>
@cenkalti
Copy link
Copy Markdown
Contributor Author

Hey @technosophos @hickeyma. Bumping up this for visibility. It's a small fix. Can you take a look please?

Copy link
Copy Markdown
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cenkalti thanks for pushing the PR. The code looks good in principle. However, we don't have any easy way of testing this and this is the 2nd fix following #11401. Would it be possible to provide some tests for this?

@cenkalti
Copy link
Copy Markdown
Contributor Author

cenkalti commented Nov 8, 2022

@hickeyma I tested this fix manually by setting up a custom k8s cluster with 3 etcd nodes and running etcdctl move-leader with random member ID continuously to trigger the error case. Even with this special setup, the error does not occur every time when I run helm install command as the timing of API calls are very critical that they must overlap with the leader change event.

I checked https://github.com/helm/acceptance-testing project to see If I can write this scenario as an acceptance test but I couldn't find a solution as the clusters are created with kind and there is no easy way to trigger a leader change event.

Copy link
Copy Markdown
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @cenkalti

@hickeyma
Copy link
Copy Markdown
Contributor

hickeyma commented Nov 8, 2022

Needs another maintainer approval for merge.

@hickeyma hickeyma requested a review from dims November 8, 2022 16:22
Copy link
Copy Markdown
Contributor

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dims
Copy link
Copy Markdown
Contributor

dims commented Nov 8, 2022

thanks @cenkalti @hickeyma

@cenkalti
Copy link
Copy Markdown
Contributor Author

@technosophos @mattfarina Can you take a look please?

Copy link
Copy Markdown
Member

@technosophos technosophos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@technosophos
Copy link
Copy Markdown
Member

@hickeyma feel free to merge whenever you feel good about it.

@hickeyma hickeyma merged commit 3974136 into helm:main Nov 16, 2022
@hickeyma hickeyma added this to the 3.10.3 milestone Nov 16, 2022
@hickeyma hickeyma added the needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. label Nov 16, 2022
@hickeyma hickeyma modified the milestones: 3.10.3, 3.11.0 Dec 14, 2022
@hickeyma hickeyma removed the needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. label Dec 14, 2022
@sruthiwander
Copy link
Copy Markdown

Hi, Any chance this could be patched to the 3.2.x release.

luisdavim added a commit to luisdavim/helm-controller that referenced this pull request Oct 8, 2024
Fixes fluxcd/flux2/#4804 by copying the solution used in helm/helm#11426
luisdavim added a commit to luisdavim/helm-controller that referenced this pull request Oct 8, 2024
Fixes fluxcd/flux2/#4804 by copying the solution used in helm/helm#11426

Signed-off-by: Luis Davim <luis.davim@gmail.com>
luisdavim added a commit to luisdavim/helm-controller that referenced this pull request Nov 16, 2024
Fixes fluxcd/flux2/#4804 by copying the solution used in helm/helm#11426

Signed-off-by: Luis Davim <luis.davim@gmail.com>
luisdavim added a commit to luisdavim/helm-controller that referenced this pull request Nov 16, 2024
Fixes fluxcd/flux2/#4804 by copying the solution used in helm/helm#11426

Signed-off-by: Luis Davim <luis.davim@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants