fix: directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1#9865
Conversation
|
Thanks so much for the pull request! |
|
!signed-cla |
|
Thanks so much for the pull request! |
2 similar comments
|
Thanks so much for the pull request! |
|
Thanks so much for the pull request! |
|
!signed-cla |
Looks like new artifacts were built from this PR. Get them here!Artifact URLs |
|
For this PR, did I have to create an issue first? |
|
It would be good to create an issue describing a bit more of what you are fixing, expected vs actual results, etc :) Then you can update the PR to reference that issue #. Also, note that the lint tests failed, so you will want to clean those up. Thanks! |
5c18217 to
d9dc03c
Compare
|
|
||
| // Related to CSV data format type | ||
| CSVHeaderRowCount int `toml:"csv_header_row_count"` | ||
| CSVSkipRows int `toml:"csv_skip_rows"` |
There was a problem hiding this comment.
Could this apply to other parsers as well, say if they have a comment?
Also, could this simplify this to a single "SkipHeaderRowCount" to define the number of rows to skip at the top of a file? I don't think we need two different values that are only used to add each other together.
There was a problem hiding this comment.
Hi Joshua,
I have looked at the documentation, and the only code pattern I could find similar was in the inputs.tail. I realized this while I was fixing this issue and would create a new issue for that. I would not resolve it in this job. It is better to keep things small and simple. The other reason I would not like to tough the inputs.tail is its use cases.
Regarding your second comment, I did not want to introduce a new parameter for inputs.directory_monitor this is a parse parameter. I am just adding the extra support for it. In this job, we don't just want to skip rows; the header rows we do want to read and parse.
I hope that answers your questions and concerns.
Kind regards,
Ehsan
There was a problem hiding this comment.
I just created this issue #9903.
I would need more feedback on it before I make the change.
There was a problem hiding this comment.
Any parser option works already out of the box. I don't see a reason why this should be added here.
There was a problem hiding this comment.
Hi, I have already created test cases that fail without the changes.
The issue is that this input reads the file one line at a time.
The parse is designed to have all the data at the same time.
That is why this input type and tail would fail the tests.
There was a problem hiding this comment.
No, the input is designed around processing large files. We dont want to read the whole file. We have another input type that reads the whole file. We need to fix this issue without making drastic modifications that change the initial input specifications and requirements.
What I have done is just that. I have not changed any input parameters and have just resolved the issue by reading the skip rows and headers all at once before reading the rest of the file.
There was a problem hiding this comment.
Okay, I understand, but it seems this is also a problem with the tail plugin currently. Or not?
There was a problem hiding this comment.
Yes, that is why I created #9903 to deal with tail
There was a problem hiding this comment.
Still, it looks like this plugin does not work with xml files that get dropped into a directory?
(since it is reading it line by line) I think an option wether or not to read the complete file would be better I guess.
There was a problem hiding this comment.
Have you tested this for XML?
I agree, having that option makes things easier. But, that would be a feature to add. I will create a new job for that #9936. I like this feature.
I do however like to fix this issue, since lots of people have problems with it, and they might have large files that they like to process.
This new flag should only be added to the input.directory_monitor. input.tail is designed to tail a file and read it one line at a time. and we already have inputs.file that does read the whole file.
srebhan
left a comment
There was a problem hiding this comment.
Hey @etycomputer, thanks for the PR! I don't think this is the right place to fix this. Instead you should change the CSV parser's expectation of getting all rows at once by decrementing the SkipRows by the rows you already skipped and check if there are remaining ones... This will then also fix tail.
Hi Sven, |
|
Hi @srebhan, can you please, let me know if you still think I need to make any changes. |
|
@etycomputer we will bring this topic up with the core developers and let you know what they think... |
|
Hi @srebhan, can you tell me when you are going to have your core developer meeting? |
|
That would be today in a few hours, we will come back to you after that. |
|
What did the dev team decided in the meeting? |
After my approval (which you just got) I set the "ready for final review" tag. With this someone from the InfluxData telegraf team will give the PR another review and, if he/she does not have any comment, merge it.
You don't have to care for this, it's done (semi)automatically IIRC. |
|
Thank you very much @srebhan you were also have me good comments and pointed out excellent alternative solution. I really enjoyed working with you on this PR. Kind regarding, |
|
Hey @powersj, Are you able to do the final review and push my changes into master, please? Cheers, |
3b30d90 to
5d59242
Compare
|
🥳 This pull request decreases the Telegraf binary size by -0.02 % for linux amd64 (new size: 131.5 MB, nightly size 131.5 MB) 📦 Looks like new artifacts were built from this PR. Expand this list to get them **here**! 🐯Artifact URLs |
|
@etycomputer please be patient waiting for the final review. Rest assured that your PR is on the list and will not be forgotten. As the change is touching a tricky part of telegraf, it might be possible your PR will only be included in the next version (1.21). |
|
Hey @srebhan, Sorry, the reason I wanted this change to go in was because my metadata feature that I wanted to add depends on these changes. I have already based my new PR #10083 off this branch. So most probably they will all go in at the same time. Is there a timeline when you are expecting to go to the next version? It would be awesome if you could also have a look at #10083 and let me know what you think. I have just started programming in Go and I really like it so far. I would be happy to work on some of the smaller jobs if you come across any please let me know. Kind regards, |
|
👍 This pull request doesn't change the Telegraf binary size 📦 Looks like new artifacts were built from this PR. Expand this list to get them here ! 🐯Artifact URLs |
|
Hey @reimda, Thanks for reviewing my work. Regarding your other comment, I have chosen to use the combination to cover scenarios that someone may use the parse.Parse() for its first line and the parse.ParseLine() for any line after that. This was the approach that is currently is being used on the master branch but going forward I have removed the need to differentiate between the initial line (header column) and other lines (records). Let me know if you like to change the p.parse() to p.ParseLine(). Kind regards, |
🐛 Merging the CSV Parse and ParseLine Code into one function ✅ Updating the Test to code the scenario that the CSV file is parsed line by line 🐛 Modifying the directory_monitor and tail input plugin to handle the new error exceptions ✅ Adding new tests to the directory_monitor and tail input plugin to cover the new scenarios 🚨 Run Go fmt on all altered files
🚨 Fixed Linter warnings
✅ Added and Updated unit tests
2cf9e2d to
1c938e7
Compare
|
👍 This pull request doesn't change the Telegraf binary size 📦 Looks like new artifacts were built from this PR. Expand this list to get them here ! 🐯Artifact URLs |
@etycomputer, you might want to go through |
|
Thank you All for helping me with this PR. |
|
@etycomputer thank you for fixing this! |
Required for all PRs:
resolves #9898
resolves #9903
resolves #5338
🐛 Fixed an issue with directory monitor input plugin when data format is CSV and csv_skip_rows>0 and csv_header_row_count>=1
✅ Added new test cases that cover these scenarios