Skip to content

fix: directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1#9865

Merged
reimda merged 6 commits intoinfluxdata:masterfrom
etycomputer:fix-csv-skip-row-for-input-plugins
Nov 16, 2021
Merged

fix: directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1#9865
reimda merged 6 commits intoinfluxdata:masterfrom
etycomputer:fix-csv-skip-row-for-input-plugins

Conversation

@etycomputer
Copy link
Copy Markdown
Contributor

@etycomputer etycomputer commented Oct 6, 2021

Required for all PRs:

resolves #9898
resolves #9903
resolves #5338

🐛 Fixed an issue with directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1
✅ Added new test cases that cover these scenarios

@telegraf-tiger
Copy link
Copy Markdown
Contributor

telegraf-tiger bot commented Oct 6, 2021

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Oct 6, 2021
@etycomputer
Copy link
Copy Markdown
Contributor Author

!signed-cla

@telegraf-tiger
Copy link
Copy Markdown
Contributor

telegraf-tiger bot commented Oct 6, 2021

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

2 similar comments
@telegraf-tiger
Copy link
Copy Markdown
Contributor

telegraf-tiger bot commented Oct 6, 2021

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger
Copy link
Copy Markdown
Contributor

telegraf-tiger bot commented Oct 6, 2021

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@etycomputer
Copy link
Copy Markdown
Contributor Author

!signed-cla

@etycomputer
Copy link
Copy Markdown
Contributor Author

For this PR, did I have to create an issue first?

@powersj
Copy link
Copy Markdown
Contributor

powersj commented Oct 6, 2021

It would be good to create an issue describing a bit more of what you are fixing, expected vs actual results, etc :) Then you can update the PR to reference that issue #.

Also, note that the lint tests failed, so you will want to clean those up.

Thanks!

@etycomputer etycomputer force-pushed the fix-csv-skip-row-for-input-plugins branch 4 times, most recently from 5c18217 to d9dc03c Compare October 10, 2021 06:43

// Related to CSV data format type
CSVHeaderRowCount int `toml:"csv_header_row_count"`
CSVSkipRows int `toml:"csv_skip_rows"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this apply to other parsers as well, say if they have a comment?

Also, could this simplify this to a single "SkipHeaderRowCount" to define the number of rows to skip at the top of a file? I don't think we need two different values that are only used to add each other together.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Joshua,

I have looked at the documentation, and the only code pattern I could find similar was in the inputs.tail. I realized this while I was fixing this issue and would create a new issue for that. I would not resolve it in this job. It is better to keep things small and simple. The other reason I would not like to tough the inputs.tail is its use cases.

Regarding your second comment, I did not want to introduce a new parameter for inputs.directory_monitor this is a parse parameter. I am just adding the extra support for it. In this job, we don't just want to skip rows; the header rows we do want to read and parse.

I hope that answers your questions and concerns.

Kind regards,
Ehsan

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created this issue #9903.
I would need more feedback on it before I make the change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any parser option works already out of the box. I don't see a reason why this should be added here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have already created test cases that fail without the changes.

The issue is that this input reads the file one line at a time.

The parse is designed to have all the data at the same time.

That is why this input type and tail would fail the tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the input is designed around processing large files. We dont want to read the whole file. We have another input type that reads the whole file. We need to fix this issue without making drastic modifications that change the initial input specifications and requirements.

What I have done is just that. I have not changed any input parameters and have just resolved the issue by reading the skip rows and headers all at once before reading the rest of the file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I understand, but it seems this is also a problem with the tail plugin currently. Or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why I created #9903 to deal with tail

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it looks like this plugin does not work with xml files that get dropped into a directory?
(since it is reading it line by line) I think an option wether or not to read the complete file would be better I guess.

Copy link
Copy Markdown
Contributor Author

@etycomputer etycomputer Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this for XML?

I agree, having that option makes things easier. But, that would be a feature to add. I will create a new job for that #9936. I like this feature.
I do however like to fix this issue, since lots of people have problems with it, and they might have large files that they like to process.
This new flag should only be added to the input.directory_monitor. input.tail is designed to tail a file and read it one line at a time. and we already have inputs.file that does read the whole file.

@Hipska Hipska requested a review from srebhan October 14, 2021 14:16
Copy link
Copy Markdown
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @etycomputer, thanks for the PR! I don't think this is the right place to fix this. Instead you should change the CSV parser's expectation of getting all rows at once by decrementing the SkipRows by the rows you already skipped and check if there are remaining ones... This will then also fix tail.

@srebhan srebhan self-assigned this Oct 15, 2021
@etycomputer
Copy link
Copy Markdown
Contributor Author

etycomputer commented Oct 15, 2021

Hey @etycomputer, thanks for the PR! I don't think this is the right place to fix this. Instead you should change the CSV parser's expectation of getting all rows at once by decrementing the SkipRows by the rows you already skipped and check if there are remaining ones... This will then also fix tail.

Hi Sven,
Your approach would work but would require a lot of changes.
It is not as easy as just decrementing one variable. That would have been my first choice. Both directory_monitor and tail, first parse the first line, which they are assuming contains the header, then for everyone after that they run the parseLine().
The way the current CSV parser is written, we have the parse() function that handles both skips and headers, and we have the parseLine() that only handles rows. So I would have to alter both functions so that they handle both.
The parseLine() has a comment at the top the explicitly outlines the expectation of when this function should be used, and that is when the headers are processed.
ParseLine does not use any information in header and assumes DataColumns is set it will also not skip any rows
Do you still think that is the best approach? I still think what I have done has the lowest impact.
Kind regards,
Ehsan

@etycomputer etycomputer requested review from Hipska and srebhan October 15, 2021 13:59
@etycomputer
Copy link
Copy Markdown
Contributor Author

Hi @srebhan, can you please, let me know if you still think I need to make any changes.
Thank

@srebhan
Copy link
Copy Markdown
Member

srebhan commented Oct 19, 2021

@etycomputer we will bring this topic up with the core developers and let you know what they think...

@etycomputer
Copy link
Copy Markdown
Contributor Author

etycomputer commented Oct 20, 2021

Hi @srebhan, can you tell me when you are going to have your core developer meeting?

@Hipska
Copy link
Copy Markdown
Contributor

Hipska commented Oct 20, 2021

That would be today in a few hours, we will come back to you after that.

@etycomputer
Copy link
Copy Markdown
Contributor Author

etycomputer commented Oct 21, 2021

What did the dev team decided in the meeting?

@srebhan
Copy link
Copy Markdown
Member

srebhan commented Nov 9, 2021

So, do you think this is ready to be merged or do we need to get feedback from anyone else like @srebhan?

After my approval (which you just got) I set the "ready for final review" tag. With this someone from the InfluxData telegraf team will give the PR another review and, if he/she does not have any comment, merge it.

Also, do I need to update the CHANGELOG or you is it done automatically?

You don't have to care for this, it's done (semi)automatically IIRC.

@etycomputer
Copy link
Copy Markdown
Contributor Author

Thank you very much @srebhan you were also have me good comments and pointed out excellent alternative solution. I really enjoyed working with you on this PR.

Kind regarding,
Ehsan

@etycomputer
Copy link
Copy Markdown
Contributor Author

Hey @powersj,

Are you able to do the final review and push my changes into master, please?

Cheers,
Ehsan

@etycomputer etycomputer force-pushed the fix-csv-skip-row-for-input-plugins branch from 3b30d90 to 5d59242 Compare November 10, 2021 21:39
@telegraf-tiger
Copy link
Copy Markdown
Contributor

🥳 This pull request decreases the Telegraf binary size by -0.02 % for linux amd64 (new size: 131.5 MB, nightly size 131.5 MB)

📦 Looks like new artifacts were built from this PR.

Expand this list to get them **here**! 🐯

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm freebsd_amd64.tar.gz windows_i386.zip
armel.deb armv6hl.rpm freebsd_armv7.tar.gz
armhf.deb i386.rpm freebsd_i386.tar.gz
i386.deb ppc64le.rpm linux_amd64.tar.gz
mips.deb s390x.rpm linux_arm64.tar.gz
mipsel.deb x86_64.rpm linux_armel.tar.gz
ppc64el.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_s390x.tar.gz
static_linux_amd64.tar.gz

@srebhan
Copy link
Copy Markdown
Member

srebhan commented Nov 11, 2021

@etycomputer please be patient waiting for the final review. Rest assured that your PR is on the list and will not be forgotten. As the change is touching a tricky part of telegraf, it might be possible your PR will only be included in the next version (1.21).

@etycomputer
Copy link
Copy Markdown
Contributor Author

Hey @srebhan,

Sorry, the reason I wanted this change to go in was because my metadata feature that I wanted to add depends on these changes.

I have already based my new PR #10083 off this branch. So most probably they will all go in at the same time.

Is there a timeline when you are expecting to go to the next version?

It would be awesome if you could also have a look at #10083 and let me know what you think.

I have just started programming in Go and I really like it so far.

I would be happy to work on some of the smaller jobs if you come across any please let me know.

Kind regards,
Ehsan

@etycomputer etycomputer requested a review from reimda November 12, 2021 01:55
@telegraf-tiger
Copy link
Copy Markdown
Contributor

@etycomputer
Copy link
Copy Markdown
Contributor Author

Hey @reimda,

Thanks for reviewing my work.
I have added the extra test you requested.

Regarding your other comment, I have chosen to use the combination to cover scenarios that someone may use the parse.Parse() for its first line and the parse.ParseLine() for any line after that. This was the approach that is currently is being used on the master branch but going forward I have removed the need to differentiate between the initial line (header column) and other lines (records).

Let me know if you like to change the p.parse() to p.ParseLine().

Kind regards,
Ehsan

ehsan-yazdi and others added 6 commits November 12, 2021 12:10
🐛 Merging the CSV Parse and ParseLine Code into one function
✅ Updating the Test to code the scenario that the CSV file is parsed line by line
🐛 Modifying the directory_monitor and tail input plugin to handle the new error exceptions
✅ Adding new tests to the directory_monitor and tail input plugin to cover the new scenarios
🚨 Run Go fmt on all altered files
✅ Added and Updated unit tests
@etycomputer etycomputer force-pushed the fix-csv-skip-row-for-input-plugins branch from 2cf9e2d to 1c938e7 Compare November 12, 2021 02:11
@telegraf-tiger
Copy link
Copy Markdown
Contributor

@srebhan
Copy link
Copy Markdown
Member

srebhan commented Nov 16, 2021

I would be happy to work on some of the smaller jobs if you come across any please let me know.

@etycomputer, you might want to go through help wanted PRs or check the issues (best would be bug reports) if you can find anything you like. Furthermore, you might want to join our Slack channel #telegraf-dev and/or #telegraf. Happy to see you there. :-)

@reimda reimda merged commit db86904 into influxdata:master Nov 16, 2021
@etycomputer
Copy link
Copy Markdown
Contributor Author

Thank you All for helping me with this PR.

@srebhan
Copy link
Copy Markdown
Member

srebhan commented Nov 17, 2021

@etycomputer thank you for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/csv csv parser/serialiser related fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.

Projects

None yet

7 participants