[CSV-196] Track byte position#502
Conversation
…#9) Add support in Commons CSV for tracking byte positions during parsing
There was a problem hiding this comment.
Hello @DarrenJAN
Thank you for providing a PR!
Please see my comments. Note that some comments will apply to more to one location in the code.
Make sure you rebase on git master.
| * @throws IOException If an I/O error occurs | ||
| * @throws CSVException Thrown on invalid input. | ||
| */ | ||
| public CSVParser parse(final Reader reader, final long characterOffset, final long recordNumber, String encoding) throws IOException { |
There was a problem hiding this comment.
No new public constructors please. You can augment the builder and add a private constructor.
All new and protected elements should have a Java @since 1.13.0.
There was a problem hiding this comment.
I noticed that there is only builder class in CSVFormat, there are no builder class in CSVParser, would you like me to add one?
There are CSVParserBuilder() from com.opencsv.CSVParser, that is another package.
There was a problem hiding this comment.
Uh? See
There was a problem hiding this comment.
It seems like I did not rebase on the master branch. Thanks.
There was a problem hiding this comment.
This section is not resolved.
| * If there is a problem reading the header or skipping the first record | ||
| * @throws CSVException Thrown on invalid input. | ||
| */ | ||
| public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber, |
| } | ||
|
|
||
| /** | ||
| * Returns the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Please follow the same Javadoc patterns as in other getter methods: A getter "Gets...".
| private long bytesReadMark; | ||
|
|
||
| /** Encoder used to calculate the bytes of characters */ | ||
| CharsetEncoder encoder; |
There was a problem hiding this comment.
Make this new instance variable private.
| super(reader); | ||
| } | ||
|
|
||
| ExtendedBufferedReader(final Reader reader, String encoding) { |
There was a problem hiding this comment.
Use a Charset instead of a Charset name.
| /** | ||
| * Returns the start byte of this record as a character byte in the source stream. | ||
| * | ||
| * @return the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Either the code or the comment is wrong. The return value is a long but the comment talks about a byte. It would help to make the comment clear as to why the return type is a long if that is indeed correct.
There was a problem hiding this comment.
New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".
| private final long characterPosition; | ||
|
|
||
| /** | ||
| * The start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
This comment is confusing because it talks about a byte but the type is a long. It would be better to explain the mismatch, if that is indeed the intent.
There was a problem hiding this comment.
We are calculating the number of bytes required to encode a single character and storing this value as a long. This choice provides a larger range for the position and ensures consistency with the design, as position is also defined as a long.
There was a problem hiding this comment.
Please update the comment to clarify.
| * @throws CSVException Thrown on invalid input. | ||
| */ | ||
| public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber, | ||
| String encoding) throws IOException { |
There was a problem hiding this comment.
Use a Charset, not a charset name.
…#12) Add support in Commons CSV for tracking byte positions during parsing
|
Hi Gary,
Thanks |
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
Thank you for your updates. There is one adjustment to make where code is not needed because it is provided in the superclass.
|
Hi Gary, Just submit another change. Please note that due to the encoding algorithm used, it only supports UTF-8 for now. Full encoding can be supported; I just need to put more effort into this. Thanks |
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
Now that I look at this again, we are always using the Charset, which is not enough to only have the feature enabled as an opt-in. This is due to the Charset being used to read the file in the first place, it's likely that the Charset will be often set, so I think we need a boolean flag that says "recordByteCount". The flag should drive whether the feature is enabled or not, not the Charset. You still need the Charset, but that's not what enables the feature.
Does that make sense? I think adding a boolean recordByteCount to the builder is what's needed here. WDYT?
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
There are many comments that were incorrectly marked resolved. See also my comment about adding a boolean to drive the new opt-in behavior.
|
Hi Gary, Yes, that makes sense. I agree that relying solely on the Charset to enable the feature isn’t sufficient, as the Charset is often set independently of whether the feature should be enabled. In our original design, instead of using a boolean, we used a String variable because it served a dual purpose: first, it controlled whether the feature was enabled or not, and second, it specified the Charset and the corresponding byte-counting algorithm to be used. Do you consider using a String? |
|
Hello @DarrenJAN I don't understand how a String can work. Let's say I specify I want to read a CSV file encoded in Charset X. The Charset (encoder) that is used by the new counting feature MUST match X, otherwise the counting risks being mismatched. Or am I missing something? Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off. If the PR only supports a subset of Charsets, then the Javadoc of the setter for the feature toggle must document this. It's also likely that I am not seeing how the PR code only supports some Charsets and not others. What's missing? TY. |
|
Hi Gary, In ExtendedBufferedReader.java, this function getCharBytes(int current), the logic of this code is based on the UTF-8 encoding specification. We need extra implementation to support the full set of characters. "Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off." |
|
Hi Gary, any thoughts on this? |
|
Hello @DarrenJAN |
Adding a boolean to drive byte tracking opt-in behavior
|
Hi @garydgregory, |
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
Thank you for your updates. Please see my comments.
| private CSVFormat format; | ||
| private long characterOffset; | ||
| private long recordNumber = 1; | ||
| private boolean enableByteTracking = false; |
There was a problem hiding this comment.
Don't initialize to the default value.
| } | ||
|
|
||
| /** | ||
| * Returns the start byte of this record as a character byte in the source stream. |
| } | ||
|
|
||
| /** | ||
| * In Java, the {@code char} data type is based on the original Unicode |
There was a problem hiding this comment.
Getter get, IOW, the Javadoc should start with "Gets ...".
| } | ||
|
|
||
| /** | ||
| * Returns the number of bytes read |
|
|
||
| @Test | ||
| public void testGetRecordFourBytesRead() throws Exception { | ||
| String code = "id,a,b,c\n" + |
* Fix comments
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
Thank you for updates.
I think we have the basic logic done but I have some additional change requests.
TY!
| * @throws IOException | ||
| * If there is a problem reading the header or skipping the first record. | ||
| * @throws CSVException Thrown on invalid input. | ||
| * @since 1.13.0. |
There was a problem hiding this comment.
New private elements do not need a Javadoc since tag.
| * @return the byte length of the character. | ||
| * @throws CharacterCodingException if the character cannot be encoded. | ||
| */ | ||
| private long getCharBytes(int current) throws CharacterCodingException { |
There was a problem hiding this comment.
The return type here should be an int because this API returns 0 or CharBuffer.limit() which itself returns an int.
| * @return the byte length of the character. | ||
| * @throws CharacterCodingException if the character cannot be encoded. | ||
| */ | ||
| private long getCharBytes(int current) throws CharacterCodingException { |
There was a problem hiding this comment.
I think a better method name here is getEncodedCharLength().
| * @param recordNumber | ||
| * The next record number to assign. | ||
| * @param charset | ||
| * The character encoding to be used for the reader. |
There was a problem hiding this comment.
"The character encoding to be used for the reader."
->
"The character encoding to be used for the reader when enableByteTracking is true."
| /** | ||
| * The start byte of this record as a character byte in the source stream. | ||
| */ | ||
| private final long characterByte; |
There was a problem hiding this comment.
This name is too confusing IMO, please rename to bytePosition to mirror the existing characterPosition.
|
Hi @garydgregory, Happy Holidays! The changes that I made:
|
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
Thank you for updating the PR.
You missed a couple of my comments regarding documentation.
Please rebase on git master to pickup updates that will automatically show you where you missed using final and other Checkstyle updates. You'll have to resolve conflicts in one test file.
TY & happy hols!
| private final long characterPosition; | ||
|
|
||
| /** | ||
| * The start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Please update the comment to clarify.
| /** | ||
| * Returns the start byte of this record as a character byte in the source stream. | ||
| * | ||
| * @return the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".
garydgregory
left a comment
There was a problem hiding this comment.
Hello @DarrenJAN
I cannot edit files in this PR, which is unusual; otherwise, I would have added the missing Javadoc @since tags. Please do so and see my unresolved comments.
TY!
|
Hi @garydgregory, I checked and made the enhancements:
|
|
Hi @garydgregory , Thank you very much. When is the next release? Yuzhan |
|
Hello @DarrenJAN https://repository.apache.org/content/repositories/snapshots/ |


Add support in Commons CSV for tracking byte positions during parsing.
Summary of Modifications
Constructor Enhancements
a. Added support for an optional parameter -- String encoding--, which specifies the encoding to use for the reader.
Add new Constructor: support track byte positions in record class
Test result:


mvn
Pass unit tests and other restrictions