fix: directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1 by etycomputer · Pull Request #9865 · influxdata/telegraf

etycomputer · 2021-10-06T00:44:52Z

Required for all PRs:

Updated associated README.md.
Wrote appropriate unit tests.
Pull request title or commits are in conventional commit format

resolves #9898
resolves #9903
resolves #5338

🐛 Fixed an issue with directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1
✅ Added new test cases that cover these scenarios

telegraf-tiger · 2021-10-06T00:44:55Z

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

etycomputer · 2021-10-06T00:45:10Z

!signed-cla

telegraf-tiger · 2021-10-06T00:45:12Z

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

telegraf-tiger · 2021-10-06T02:42:35Z

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

telegraf-tiger · 2021-10-06T03:09:41Z

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

etycomputer · 2021-10-06T03:20:03Z

!signed-cla

telegraf-tiger · 2021-10-06T03:25:39Z

Looks like new artifacts were built from this PR. Get them here!

Artifact URLs

etycomputer · 2021-10-06T19:07:48Z

For this PR, did I have to create an issue first?

powersj · 2021-10-06T19:10:16Z

It would be good to create an issue describing a bit more of what you are fixing, expected vs actual results, etc :) Then you can update the PR to reference that issue #.

Also, note that the lint tests failed, so you will want to clean those up.

Thanks!

powersj · 2021-10-11T14:45:44Z

plugins/inputs/directory_monitor/directory_monitor.go


+	// Related to CSV data format type
+	CSVHeaderRowCount int `toml:"csv_header_row_count"`
+	CSVSkipRows       int `toml:"csv_skip_rows"`


Could this apply to other parsers as well, say if they have a comment?

Also, could this simplify this to a single "SkipHeaderRowCount" to define the number of rows to skip at the top of a file? I don't think we need two different values that are only used to add each other together.

Hi Joshua,

I have looked at the documentation, and the only code pattern I could find similar was in the inputs.tail. I realized this while I was fixing this issue and would create a new issue for that. I would not resolve it in this job. It is better to keep things small and simple. The other reason I would not like to tough the inputs.tail is its use cases.

Regarding your second comment, I did not want to introduce a new parameter for inputs.directory_monitor this is a parse parameter. I am just adding the extra support for it. In this job, we don't just want to skip rows; the header rows we do want to read and parse.

I hope that answers your questions and concerns.

Kind regards,
Ehsan

I just created this issue #9903.
I would need more feedback on it before I make the change.

Any parser option works already out of the box. I don't see a reason why this should be added here.

Hi, I have already created test cases that fail without the changes.

The issue is that this input reads the file one line at a time.

The parse is designed to have all the data at the same time.

That is why this input type and tail would fail the tests.

No, the input is designed around processing large files. We dont want to read the whole file. We have another input type that reads the whole file. We need to fix this issue without making drastic modifications that change the initial input specifications and requirements.

What I have done is just that. I have not changed any input parameters and have just resolved the issue by reading the skip rows and headers all at once before reading the rest of the file.

Okay, I understand, but it seems this is also a problem with the tail plugin currently. Or not?

Yes, that is why I created #9903 to deal with tail

Still, it looks like this plugin does not work with xml files that get dropped into a directory?
(since it is reading it line by line) I think an option wether or not to read the complete file would be better I guess.

Have you tested this for XML?

I agree, having that option makes things easier. But, that would be a feature to add. I will create a new job for that #9936. I like this feature.
I do however like to fix this issue, since lots of people have problems with it, and they might have large files that they like to process.
This new flag should only be added to the input.directory_monitor. input.tail is designed to tail a file and read it one line at a time. and we already have inputs.file that does read the whole file.

plugins/inputs/directory_monitor/directory_monitor.go

srebhan

Hey @etycomputer, thanks for the PR! I don't think this is the right place to fix this. Instead you should change the CSV parser's expectation of getting all rows at once by decrementing the SkipRows by the rows you already skipped and check if there are remaining ones... This will then also fix tail.

etycomputer · 2021-10-15T13:49:25Z

Hey @etycomputer, thanks for the PR! I don't think this is the right place to fix this. Instead you should change the CSV parser's expectation of getting all rows at once by decrementing the SkipRows by the rows you already skipped and check if there are remaining ones... This will then also fix tail.

Hi Sven,
Your approach would work but would require a lot of changes.
It is not as easy as just decrementing one variable. That would have been my first choice. Both directory_monitor and tail, first parse the first line, which they are assuming contains the header, then for everyone after that they run the parseLine().
The way the current CSV parser is written, we have the parse() function that handles both skips and headers, and we have the parseLine() that only handles rows. So I would have to alter both functions so that they handle both.
The parseLine() has a comment at the top the explicitly outlines the expectation of when this function should be used, and that is when the headers are processed.
ParseLine does not use any information in header and assumes DataColumns is set it will also not skip any rows
Do you still think that is the best approach? I still think what I have done has the lowest impact.
Kind regards,
Ehsan

etycomputer · 2021-10-18T18:20:57Z

Hi @srebhan, can you please, let me know if you still think I need to make any changes.
Thank

srebhan · 2021-10-19T07:18:49Z

@etycomputer we will bring this topic up with the core developers and let you know what they think...

etycomputer · 2021-10-20T10:58:38Z

Hi @srebhan, can you tell me when you are going to have your core developer meeting?

Hipska · 2021-10-20T11:23:07Z

That would be today in a few hours, we will come back to you after that.

etycomputer · 2021-10-21T09:25:40Z

What did the dev team decided in the meeting?

srebhan · 2021-11-09T08:42:18Z

So, do you think this is ready to be merged or do we need to get feedback from anyone else like @srebhan?

After my approval (which you just got) I set the "ready for final review" tag. With this someone from the InfluxData telegraf team will give the PR another review and, if he/she does not have any comment, merge it.

Also, do I need to update the CHANGELOG or you is it done automatically?

You don't have to care for this, it's done (semi)automatically IIRC.

etycomputer · 2021-11-09T12:05:26Z

Thank you very much @srebhan you were also have me good comments and pointed out excellent alternative solution. I really enjoyed working with you on this PR.

Kind regarding,
Ehsan

etycomputer · 2021-11-10T13:55:27Z

Hey @powersj,

Are you able to do the final review and push my changes into master, please?

Cheers,
Ehsan

telegraf-tiger · 2021-11-10T22:00:07Z

🥳 This pull request decreases the Telegraf binary size by -0.02 % for linux amd64 (new size: 131.5 MB, nightly size 131.5 MB)

📦 Looks like new artifacts were built from this PR.

Expand this list to get them **here**! 🐯

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armel.deb	armv6hl.rpm	freebsd_armv7.tar.gz
armhf.deb	i386.rpm	freebsd_i386.tar.gz
i386.deb	ppc64le.rpm	linux_amd64.tar.gz
mips.deb	s390x.rpm	linux_arm64.tar.gz
mipsel.deb	x86_64.rpm	linux_armel.tar.gz
ppc64el.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_s390x.tar.gz
		static_linux_amd64.tar.gz

srebhan · 2021-11-11T08:27:10Z

@etycomputer please be patient waiting for the final review. Rest assured that your PR is on the list and will not be forgotten. As the change is touching a tricky part of telegraf, it might be possible your PR will only be included in the next version (1.21).

etycomputer · 2021-11-11T09:09:43Z

Hey @srebhan,

Sorry, the reason I wanted this change to go in was because my metadata feature that I wanted to add depends on these changes.

I have already based my new PR #10083 off this branch. So most probably they will all go in at the same time.

Is there a timeline when you are expecting to go to the next version?

It would be awesome if you could also have a look at #10083 and let me know what you think.

I have just started programming in Go and I really like it so far.

I would be happy to work on some of the smaller jobs if you come across any please let me know.

Kind regards,
Ehsan

plugins/parsers/csv/parser_test.go

telegraf-tiger · 2021-11-12T02:00:39Z

👍 This pull request doesn't change the Telegraf binary size

📦 Looks like new artifacts were built from this PR.

Expand this list to get them here ! 🐯

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armel.deb	armv6hl.rpm	freebsd_armv7.tar.gz
armhf.deb	i386.rpm	freebsd_i386.tar.gz
i386.deb	ppc64le.rpm	linux_amd64.tar.gz
mips.deb	s390x.rpm	linux_arm64.tar.gz
mipsel.deb	x86_64.rpm	linux_armel.tar.gz
ppc64el.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_s390x.tar.gz
		static_linux_amd64.tar.gz

etycomputer · 2021-11-12T02:04:14Z

Hey @reimda,

Thanks for reviewing my work.
I have added the extra test you requested.

Regarding your other comment, I have chosen to use the combination to cover scenarios that someone may use the parse.Parse() for its first line and the parse.ParseLine() for any line after that. This was the approach that is currently is being used on the master branch but going forward I have removed the need to differentiate between the initial line (header column) and other lines (records).

Let me know if you like to change the p.parse() to p.ParseLine().

Kind regards,
Ehsan

🐛 Merging the CSV Parse and ParseLine Code into one function ✅ Updating the Test to code the scenario that the CSV file is parsed line by line 🐛 Modifying the directory_monitor and tail input plugin to handle the new error exceptions ✅ Adding new tests to the directory_monitor and tail input plugin to cover the new scenarios 🚨 Run Go fmt on all altered files

🚨 Fixed Linter warnings

✅ Added and Updated unit tests

telegraf-tiger · 2021-11-12T02:32:13Z

👍 This pull request doesn't change the Telegraf binary size

📦 Looks like new artifacts were built from this PR.

Expand this list to get them here ! 🐯

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armel.deb	armv6hl.rpm	freebsd_armv7.tar.gz
armhf.deb	i386.rpm	freebsd_i386.tar.gz
i386.deb	ppc64le.rpm	linux_amd64.tar.gz
mips.deb	s390x.rpm	linux_arm64.tar.gz
mipsel.deb	x86_64.rpm	linux_armel.tar.gz
ppc64el.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_s390x.tar.gz
		static_linux_amd64.tar.gz

srebhan · 2021-11-16T08:32:59Z

I would be happy to work on some of the smaller jobs if you come across any please let me know.

@etycomputer, you might want to go through help wanted PRs or check the issues (best would be bug reports) if you can find anything you like. Furthermore, you might want to join our Slack channel #telegraf-dev and/or #telegraf. Happy to see you there. :-)

etycomputer · 2021-11-16T22:42:50Z

Thank you All for helping me with this PR.

srebhan · 2021-11-17T14:20:57Z

@etycomputer thank you for fixing this!

telegraf-tiger bot added the fix pr to fix corresponding bug label Oct 6, 2021

etycomputer mentioned this pull request Oct 6, 2021

feat: Adds the ability to create and name a tag containing the filename using the directory monitor input plugin #9860

Merged

3 tasks

etycomputer force-pushed the fix-csv-skip-row-for-input-plugins branch 4 times, most recently from 5c18217 to d9dc03c Compare October 10, 2021 06:43

powersj reviewed Oct 11, 2021

View reviewed changes

etycomputer requested a review from powersj October 11, 2021 19:09

etycomputer mentioned this pull request Oct 11, 2021

inputs.tail does not support csv_skip_rows>0 and csv_header_row_count>=1 #9903

Closed

Hipska reviewed Oct 14, 2021

View reviewed changes

plugins/inputs/directory_monitor/directory_monitor.go Outdated Show resolved Hide resolved

Hipska requested a review from srebhan October 14, 2021 14:16

srebhan requested changes Oct 15, 2021

View reviewed changes

srebhan self-assigned this Oct 15, 2021

etycomputer requested review from Hipska and srebhan October 15, 2021 13:59

etycomputer mentioned this pull request Nov 9, 2021

Adding the option to read the whole file before parsing for inputs.directory_monitor #9936

Closed

etycomputer force-pushed the fix-csv-skip-row-for-input-plugins branch from 3b30d90 to 5d59242 Compare November 10, 2021 21:39

etycomputer mentioned this pull request Nov 11, 2021

feat(parsers/csv): Add metadata support to CSV parser plugin #10083

Merged

3 tasks

srebhan approved these changes Nov 11, 2021

View reviewed changes

reimda reviewed Nov 11, 2021

View reviewed changes

plugins/parsers/csv/parser_test.go Show resolved Hide resolved

plugins/parsers/csv/parser_test.go Show resolved Hide resolved

etycomputer requested a review from reimda November 12, 2021 01:55

ehsan-yazdi and others added 6 commits November 12, 2021 12:10

🚨 Fixed Linter warnings

c7d522e

🐛 Applied the requested changes and updated tests

0b0146a

🚨 Fixed Linter warnings

♻️ Refactoring code, and add error if more than one metric is returned

4ec20f5

♻️ Refactored code

f31b851

✅ Added and Updated unit tests

✅ Added tests to check expected Tags

1c938e7

etycomputer force-pushed the fix-csv-skip-row-for-input-plugins branch from 2cf9e2d to 1c938e7 Compare November 12, 2021 02:11

reimda approved these changes Nov 16, 2021

View reviewed changes

reimda merged commit db86904 into influxdata:master Nov 16, 2021

DennisBayer mentioned this pull request Dec 24, 2021

Regression: CSV parser does not handle csv_skip_rows correctly #10347

Closed

Hipska added the area/csv csv parser/serialiser related label Feb 15, 2022

srebhan mentioned this pull request Oct 26, 2022

inputs.exec with data_format = "csv" #12022

Closed

Conversation

etycomputer commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Required for all PRs:

Uh oh!

telegraf-tiger bot commented Oct 6, 2021

Uh oh!

etycomputer commented Oct 6, 2021

Uh oh!

telegraf-tiger bot commented Oct 6, 2021

Uh oh!

telegraf-tiger bot commented Oct 6, 2021

Uh oh!

telegraf-tiger bot commented Oct 6, 2021

Uh oh!

etycomputer commented Oct 6, 2021

Uh oh!

telegraf-tiger bot commented Oct 6, 2021

Artifact URLs

Uh oh!

etycomputer commented Oct 6, 2021

Uh oh!

powersj commented Oct 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etycomputer Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srebhan left a comment

Choose a reason for hiding this comment

Uh oh!

etycomputer commented Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etycomputer commented Oct 18, 2021

Uh oh!

srebhan commented Oct 19, 2021

Uh oh!

etycomputer commented Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hipska commented Oct 20, 2021

Uh oh!

etycomputer commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srebhan commented Nov 9, 2021

Uh oh!

etycomputer commented Nov 9, 2021

Uh oh!

etycomputer commented Nov 10, 2021

Uh oh!

telegraf-tiger bot commented Nov 10, 2021

Artifact URLs

Uh oh!

srebhan commented Nov 11, 2021

Uh oh!

etycomputer commented Nov 11, 2021

Uh oh!

Uh oh!

Uh oh!

etycomputer commented Oct 6, 2021 •

edited

Loading

etycomputer Oct 15, 2021 •

edited

Loading

etycomputer commented Oct 15, 2021 •

edited

Loading

etycomputer commented Oct 20, 2021 •

edited

Loading

etycomputer commented Oct 21, 2021 •

edited

Loading