Add leader election for autodiscover#20281
Conversation
Signed-off-by: chrismark <chrismarkou92@gmail.com>
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Log outputExpand to view the last 100 lines of log output
|
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
Tested this on a 3-node cluster on GKE. Seems to work:
I will open it for review by now. |
|
Pinging @elastic/integrations-platforms (Team:Platforms) |
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
Thanks for the review and the (offline) suggestions @jsoriano! |
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
jsoriano
left a comment
There was a problem hiding this comment.
LGTM, only some small details.
|
|
||
| // Start for EventManager interface. | ||
| func (p *leaderElectionManager) Start() { | ||
| ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Nit.
| ctx, cancel := context.WithCancel(context.Background()) | |
| ctx, cancel := context.WithCancel(context.TODO()) |
| func (p *leaderElectionManager) startLeaderElector(ctx context.Context, lec leaderelection.LeaderElectionConfig) { | ||
| le, err := leaderelection.NewLeaderElector(lec) | ||
| if err != nil { | ||
| p.logger.Errorf("leader election lock GAINED, id %v", err) |
There was a problem hiding this comment.
This doesn't look like an error 🙂 Log it at the info or debug level?
There was a problem hiding this comment.
This actually an error..but with wrong error message. This piece is "copied" from RunOrDie (https://github.com/kubernetes/client-go/blob/36233866f1c7c0ad3bdac1fc466cb5de3746cfa2/tools/leaderelection/leaderelection.go#L214). See the watchdog related comment for more info.
| if err != nil { | ||
| p.logger.Errorf("leader election lock GAINED, id %v", err) | ||
| } | ||
| if lec.WatchDog != nil { |
There was a problem hiding this comment.
What is this watchdog? we are never setting it.
There was a problem hiding this comment.
This was mostly because I tried to "copy" the RunOrDie of https://github.com/kubernetes/client-go/blob/36233866f1c7c0ad3bdac1fc466cb5de3746cfa2/tools/leaderelection/leaderelection.go#L219, because of an async problem described at https://github.com/kubernetes/client-go/issues/837 .
Good catch I will remove it, it's redundant.
CHANGELOG.next.asciidoc
Outdated
| - Added the `max_cached_sessions` option to the script processor. {pull}19562[19562] | ||
| - Add support for DNS over TLS for the dns_processor. {pull}19321[19321] | ||
| - Set index.max_docvalue_fields_search in index template to increase value to 200 fields. {issue}20215[20215] | ||
| - Add leader election for autodiscover. {pull}20281[20281] |
There was a problem hiding this comment.
| - Add leader election for autodiscover. {pull}20281[20281] | |
| - Add leader election for Kubernetes autodiscover. {pull}20281[20281] |
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
jenkins run the tests please |
jsoriano
left a comment
There was a problem hiding this comment.
LGTM, but let's wait to have a greener build.
💔 Tests FailedExpand to view the summary
Build stats
Test stats 🧪
Test errorsExpand to view the tests failures
Steps errorsExpand to view the steps failures
Log outputExpand to view the last 100 lines of log output
|
|
@jsoriano I think the only issue with the tests is https://travis-ci.org/github/elastic/beats/jobs/716456468#L1590 which is also failing in general currently. I think we can proceed with merging, wdyt? |
Yes, I think the issue in CI is not related. |
(cherry picked from commit 9ab9b97)
After PR elastic#20281 was merged, the signature of the `autodiscover.ProviderBuilder` changed to accept an additional name (string) parameter.
What does this PR do?
Implements leader election as part of kubernetes autodiscover provider. With this, when the beat is elected as leader it will start cluster scope metricsets.
Under the hood, all leader candidates will try to gain the lock on Lease object (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#lease-v1-coordination-k8s-io). When the lock is gained then the configured cluster wide mericsets will get started.
Why is it important?
To get rid of
Deploymentprocedure for the singleton Metricbeat instance which is needed for the cluster scope metricsets.Sample configuration that will start
state_nodemetricset when a leadership is gained:Then we can monitor the
leaseto check which provider holds the lock:Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.How to test this PR locally
It is recommended to testing this along with #20512 for easiness of configuration/setup etc
Make sure that events for
state_nodeare being collected from only one metricbeat instance. You can the in the logs and verify that only one metricbeat has gained the lock so far.Monitor the
leaseobject to keep track of the lock holder:watch kubectl describe lease beats-cluster-leaderStop the Metricbeat instance that currently holds the lock and make sure that another one takes over and we keep getting events from
state-node. Make sure that theleasechanged holder (logs will also let us know that a new metricset is getting started)Also test without setting
identifierand check that it takes the default value properly likemetricbeat-cluster-leader.Check with old autodiscover (template based/hints based) for possible regressions.
Related issues