Fix(inputs/mongodb): better handling of conn errors#10086
Fix(inputs/mongodb): better handling of conn errors#10086papey wants to merge 2 commits intoinfluxdata:masterfrom
Conversation
There was a problem hiding this comment.
Thank you for taking the time to write this pull request, I've added some comments for you to review. I think adding a retry mechanism to this plugin would require some sort of configuration setting for users to opt into it, otherwise this new behavior could break existing setups where people do expect it to fail early. Also including a test for the retry logic would be great.
| client, err := mongo.Connect(ctx, opts) | ||
| if err != nil { | ||
| return fmt.Errorf("unable to connect to MongoDB: %q", err) | ||
| m.Log.Errorf("unable to create MongoDB client: %q", err) |
There was a problem hiding this comment.
I think we still want to return here in case of an error, otherwise client.Ping will get called and this could lead to a panic because client could be nil if there was an error.
There was a problem hiding this comment.
I think this is what causes the behavior I wanted to avoid with this PR. If you return an error in an init, Telegraf exits.
| // is not reachable try a reconnect | ||
| disconnectCtx, disconnectCancel := context.WithTimeout(context.Background(), 1*time.Second) | ||
| defer disconnectCancel() | ||
| err := srv.client.Disconnect(disconnectCtx) |
There was a problem hiding this comment.
client needs to be checked if it isn't nil before calling Disconnect, because src.reachable could be set to false when mongo.Connect fails resulting in a nil client.
| err := srv.client.Disconnect(disconnectCtx) | ||
| if err != nil { | ||
| m.Log.Errorf("unable to reconnect to MongoDB: %q", err) | ||
| return |
There was a problem hiding this comment.
Because we aren't returning an error, this would cause a retry to happen everytime Gather is called right? This could cause a lot of "unable to reconnect to mongoDB" log messages to show up. Maybe adding a configuration option for the user to set how often to retry would help? Or how many times to retry before either exiting or only stop retrying for this instance?
There was a problem hiding this comment.
Yes totally, this the first behavior we wanted on our side (as a draft I did not invest to much time on it as I wanted to be shure it will have an interest for Telegraf upstream). Can you share an example of this kind of configuration. I think we want it plugin scoped and not instance scoped ?
| connectCtx, connectCancel := context.WithTimeout(context.Background(), 1*time.Second) | ||
| defer connectCancel() |
There was a problem hiding this comment.
Reusing the connect context from init might be a good idea, is there a reason you picked 1*time.Second for the timeout?
There was a problem hiding this comment.
No reason for 1 second, it was just looking like a simple default.
|
Hi @sspaink, thanks for taking the time to review. Here are quick feedback of what I remember from the initial idea behind this PR. I will take another look at the code ASAP to have a fresh look. |
Hipska
left a comment
There was a problem hiding this comment.
Actually, the trying to connect to MongoDB should be in the Start() method and not in the Init() method. Would you be able to refactor this?
|
Thanks for the feedback @Hipska, I look at it ASAP ! Just one question/detail. I am no longer a member of @bearstech org (where the fork is), I think we shoud close this PR and open a new one with the changes you requested. What do you think ? |
|
Yeah sure go ahead. Leave a note to reference the new PR |
|
Replaced by #11629, thx for the feedback. |
Required for all PRs:
resolves #10078
Here is a draft for improvements in the MongoDB input plugin.
The current issue is described and tracked in #10078.
For now, Telegraf exits if a conn to a configured MongoDB server failed on startup.
This PR tries to add a more reliable "retry after" mechanism.