Conversation
This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113
|
Pinging @elastic/es-core-features (:Core/Features/Ingest) |
|
@probakowski Can you add some docs with this pull request? Some REST tests would be nice too. |
|
@jasontedor sure, I'll add both on Monday I've also run JMH benchmark to compare version above with https://github.com/johtani/elasticsearch-ingest-csv mentioned in #49113 (the less the better) (I've used version 7.4.2 of @johtani library, which is thread safe but suffers from synchronization) |
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine update branch |
martijnvg
left a comment
There was a problem hiding this comment.
I left two more comments, otherwise LGTM.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine update branch |
jbaiera
left a comment
There was a problem hiding this comment.
Some small things and a question regarding the trim setting when parsing quoted values.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
| boolean shouldSetField = true; | ||
| for (; currentIndex < length; currentIndex++) { | ||
| c = currentChar(); | ||
| if (isWhitespace(c)) { |
There was a problem hiding this comment.
If the value we are parsing is a quoted string, are the spaces around it always trimmed?
I've taken a look at the RFC but I'm still not sure what the right path here is. It states "Spaces are considered part of a field and should not be ignored." In the implementation we have a trim option which is a fine extension, but if reading a quoted string, the spaces are trimmed regardless of if the trim option is set to true.
There was a problem hiding this comment.
Spaces around quotes are always trimmed since it's well defined where start and end of data is. It's deviation from RFC, where no spaces are allowed both before and after quotes at all. But it's common for CSV files in the wild to have them, so this change make it easier for a user with no ambiguity introduced.
For unquoted fields situation is different, there's no way to tell automatically where the data starts so a user must decide here if he wants to trim whitespaces or not. That's why it's parametrized.
Now when I think of it, my gut feeling is most use cases would trim leading/trailing whitespaces so I would even consider default trim to true. What do you think? @martijnvg you opinion would be appreciated as well.
There was a problem hiding this comment.
I tend to agree with trimming leading/trailing whitespaces by default, because it seems more practical to me.
There was a problem hiding this comment.
@droberts195 convinced me otherwise: #49509 (comment)
There was a problem hiding this comment.
yep, me too, reverted back to false
…ommon/CsvParser.java Co-Authored-By: James Baiera <james.baiera@gmail.com>
|
Thanks for changing the trim functionality 👍 |
|
@elasticmachine update branch |
|
Some reasons not to trim CSV by default are:
I agree it's quite common to find non-standard CSV that needs spaces trimming, so it's good to have the option, but it seems more defensible to me to default to the standard and make people with non-standard data customise an option. |
|
@droberts195 these are valid points, I'll revert default value for trim back to false |
jbaiera
left a comment
There was a problem hiding this comment.
Good points about the clear start and end for quoted content. LGTM!
|
@elasticmachine update branch |
|
@elasticmachine run elasticsearch-ci/2 |
* CSV Processor for Ingest This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113
* CSV Processor for Ingest This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113
|
@probakowski What's the intended behavior of this processor for quoted fields containing line-breaks? It seems that the processor is interpreting this as a new message, which is in contrast to how Excel and Numbers are dealing with line breaks in quoted strings. |
|
Hi @dschneiter, this processor handles only single line from CSV so it doesn't care for line breaks at all (unless you set it as separator) and it doesn't have any notion of "new message" |
This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.
Closes #49113