Skip to content

Fix(inputs/mongodb): better handling of conn errors#10086

Closed
papey wants to merge 2 commits intoinfluxdata:masterfrom
bearstech:fix/mongodb-failed-conn-handling
Closed

Fix(inputs/mongodb): better handling of conn errors#10086
papey wants to merge 2 commits intoinfluxdata:masterfrom
bearstech:fix/mongodb-failed-conn-handling

Conversation

@papey
Copy link
Copy Markdown
Contributor

@papey papey commented Nov 10, 2021

Required for all PRs:

resolves #10078

Here is a draft for improvements in the MongoDB input plugin.
The current issue is described and tracked in #10078.
For now, Telegraf exits if a conn to a configured MongoDB server failed on startup.
This PR tries to add a more reliable "retry after" mechanism.

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Nov 10, 2021
@papey papey marked this pull request as ready for review November 17, 2021 14:39
Copy link
Copy Markdown
Contributor

@sspaink sspaink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to write this pull request, I've added some comments for you to review. I think adding a retry mechanism to this plugin would require some sort of configuration setting for users to opt into it, otherwise this new behavior could break existing setups where people do expect it to fail early. Also including a test for the retry logic would be great.

client, err := mongo.Connect(ctx, opts)
if err != nil {
return fmt.Errorf("unable to connect to MongoDB: %q", err)
m.Log.Errorf("unable to create MongoDB client: %q", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want to return here in case of an error, otherwise client.Ping will get called and this could lead to a panic because client could be nil if there was an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what causes the behavior I wanted to avoid with this PR. If you return an error in an init, Telegraf exits.

// is not reachable try a reconnect
disconnectCtx, disconnectCancel := context.WithTimeout(context.Background(), 1*time.Second)
defer disconnectCancel()
err := srv.client.Disconnect(disconnectCtx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client needs to be checked if it isn't nil before calling Disconnect, because src.reachable could be set to false when mongo.Connect fails resulting in a nil client.

err := srv.client.Disconnect(disconnectCtx)
if err != nil {
m.Log.Errorf("unable to reconnect to MongoDB: %q", err)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we aren't returning an error, this would cause a retry to happen everytime Gather is called right? This could cause a lot of "unable to reconnect to mongoDB" log messages to show up. Maybe adding a configuration option for the user to set how often to retry would help? Or how many times to retry before either exiting or only stop retrying for this instance?

Copy link
Copy Markdown
Contributor Author

@papey papey May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally, this the first behavior we wanted on our side (as a draft I did not invest to much time on it as I wanted to be shure it will have an interest for Telegraf upstream). Can you share an example of this kind of configuration. I think we want it plugin scoped and not instance scoped ?

Comment on lines +180 to +181
connectCtx, connectCancel := context.WithTimeout(context.Background(), 1*time.Second)
defer connectCancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the connect context from init might be a good idea, is there a reason you picked 1*time.Second for the timeout?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason for 1 second, it was just looking like a simple default.

@sspaink sspaink added the waiting for response waiting for response from contributor label Apr 27, 2022
@papey
Copy link
Copy Markdown
Contributor Author

papey commented May 6, 2022

Hi @sspaink, thanks for taking the time to review. Here are quick feedback of what I remember from the initial idea behind this PR. I will take another look at the code ASAP to have a fresh look.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label May 6, 2022
@Hipska Hipska added area/mongodb plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Jun 15, 2022
Copy link
Copy Markdown
Contributor

@Hipska Hipska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the trying to connect to MongoDB should be in the Start() method and not in the Init() method. Would you be able to refactor this?

@papey
Copy link
Copy Markdown
Contributor Author

papey commented Jun 18, 2022

Thanks for the feedback @Hipska, I look at it ASAP !

Just one question/detail. I am no longer a member of @bearstech org (where the fork is), I think we shoud close this PR and open a new one with the changes you requested. What do you think ?

@Hipska
Copy link
Copy Markdown
Contributor

Hipska commented Jun 18, 2022

Yeah sure go ahead. Leave a note to reference the new PR

@papey
Copy link
Copy Markdown
Contributor Author

papey commented Aug 6, 2022

Replaced by #11629, thx for the feedback.

@papey papey closed this Aug 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mongodb fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[inputs.mongodb] Telegraf crash on init if conn to MongoDB fails

3 participants