o365input: Restart after fatal error#21258
Merged
adriansr merged 3 commits intoelastic:masterfrom Sep 29, 2020
Merged
Conversation
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilent against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes.
Contributor
|
Pinging @elastic/siem (Team:SIEM) |
Contributor
Author
|
@urso can you have a quick look? Is there a better way to "restart" an input? |
Contributor
|
Thanks for taking this and for resolving. Can you please let me know any plans when it will be roll out to new version |
marc-gr
approved these changes
Sep 28, 2020
Contributor
Author
|
@27bharath this fix will be available in the next versions, 7.9.3 and 7.10.0 |
v1v
added a commit
to v1v/beats
that referenced
this pull request
Sep 29, 2020
* upstream/master: feat: prepare release pipelines (elastic#21238) Add IP validation to Security module (elastic#21325) Fixes for new 7.10 rsa2elk datasets (elastic#21240) o365input: Restart after fatal error (elastic#21258) Fix panic in cgroups monitoring (elastic#21355) Handle multiple upstreams in ingress-controller (elastic#21215) [CI] Fix runbld when workspace does not exist (elastic#21350) [Filebeat] Fix checkpoint (elastic#21344) [CI] Archive build reasons (elastic#21347) Add dashboard for pubsub metricset in googlecloud module (elastic#21326) [Elastic Agent] Allow embedding of certificate (elastic#21179) Adds a default for failure_cache.min_ttl (elastic#21085) [libbeat] Disk queue implementation (elastic#21176)
4 tasks
4 tasks
adriansr
added a commit
to adriansr/beats
that referenced
this pull request
Sep 29, 2020
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
adriansr
added a commit
that referenced
this pull request
Sep 30, 2020
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
adriansr
added a commit
to adriansr/beats
that referenced
this pull request
Oct 5, 2020
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
adriansr
added a commit
that referenced
this pull request
Oct 6, 2020
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
urso
reviewed
Oct 6, 2020
| publisher.Publish(event, nil) | ||
| ctx.Logger.Errorf("Input failed: %v", err) | ||
| ctx.Logger.Infof("Restarting in %v", failureRetryInterval) | ||
| time.Sleep(failureRetryInterval) |
There was a problem hiding this comment.
This can block proper filebeat shutdown for 5min. Better use timed.Wait that is cancellable from go-concert.
adriansr
added a commit
to adriansr/beats
that referenced
this pull request
Oct 7, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination.
2 tasks
adriansr
added a commit
that referenced
this pull request
Oct 9, 2020
PR #21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination.
adriansr
added a commit
to adriansr/beats
that referenced
this pull request
Oct 9, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit 1abe97b)
2 tasks
adriansr
added a commit
to adriansr/beats
that referenced
this pull request
Oct 9, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit 1abe97b)
2 tasks
adriansr
added a commit
to adriansr/beats
that referenced
this pull request
Oct 9, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit 1abe97b)
2 tasks
leweafan
pushed a commit
to leweafan/beats
that referenced
this pull request
Apr 28, 2023
elastic#21386) Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit c723c1e)
leweafan
pushed a commit
to leweafan/beats
that referenced
this pull request
Apr 28, 2023
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit f2ab428)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Updates
o365inputto restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error.This enables the input to be more resilient against errors.
Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes.
Why is it important?
Some users are reporting that the
o365module stops ingesting events after some days. In all cases it's been observed that the input terminated at some point due to errors contacting the Azure authentication server to refresh a token.Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Author's Checklist
How to test this PR locally
Testing the case of token refresh errors is difficult as they are refreshed once every ~12h. But the behavior can be tested by starting Filebeat without an internet connection.