Skip to content

[CSV-196] Track byte position#502

Merged
garydgregory merged 9 commits into
apache:masterfrom
marklogic:CSV-196-master
Jan 2, 2025
Merged

[CSV-196] Track byte position#502
garydgregory merged 9 commits into
apache:masterfrom
marklogic:CSV-196-master

Conversation

@DarrenJAN

Copy link
Copy Markdown
Contributor

Add support in Commons CSV for tracking byte positions during parsing.

Summary of Modifications

  1. Test Data Files: Added new test data files, and updated pom.xml to exclude these files from RAT checks, avoiding unapproved license checks.
  2. CSVParser class:
    Constructor Enhancements
    a. Added support for an optional parameter -- String encoding--, which specifies the encoding to use for the reader.
  3. CSVRecord class
  • private long characterByte: start byte position of this record
    Add new Constructor: support track byte positions in record class
  1. ExtendedBufferedReader Class:
  • private long bytesRead: Tracks the number of bytes read so far.
  • private long bytesReadMark: Stores the marked byte position.
  • CharsetEncoder encoder: Encoder used to calculate byte size of characters.
  • getCharBytes(int current): This function calculates character bytes based on UTF-8 encoding. Note: it only supports UTF-8 due to the encoding algorithm used. Full encoding can be supported and we just need more effort on this.
  • reset() and mark() Methods: Enhanced to prevent consuming characters and bytes unintentionally.

Test result:
mvn
image
image
Pass unit tests and other restrictions

…#9)

Add support in Commons CSV for tracking byte positions during parsing

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for providing a PR!

Please see my comments. Note that some comments will apply to more to one location in the code.

Make sure you rebase on git master.

* @throws IOException If an I/O error occurs
* @throws CSVException Thrown on invalid input.
*/
public CSVParser parse(final Reader reader, final long characterOffset, final long recordNumber, String encoding) throws IOException {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new public constructors please. You can augment the builder and add a private constructor.

All new and protected elements should have a Java @since 1.13.0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there is only builder class in CSVFormat, there are no builder class in CSVParser, would you like me to add one?
There are CSVParserBuilder() from com.opencsv.CSVParser, that is another package.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh? See

public static class Builder extends AbstractStreamBuilder<CSVParser, Builder> {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like I did not rebase on the master branch. Thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is not resolved.

* If there is a problem reading the header or skipping the first record
* @throws CSVException Thrown on invalid input.
*/
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

}

/**
* Returns the start byte of this record as a character byte in the source stream.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the same Javadoc patterns as in other getter methods: A getter "Gets...".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

private long bytesReadMark;

/** Encoder used to calculate the bytes of characters */
CharsetEncoder encoder;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this new instance variable private.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved.

super(reader);
}

ExtendedBufferedReader(final Reader reader, String encoding) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a Charset instead of a Charset name.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved.

Comment thread src/test/java/org/apache/commons/csv/CSVParserTest.java Outdated
/**
* Returns the start byte of this record as a character byte in the source stream.
*
* @return the start byte of this record as a character byte in the source stream.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the code or the comment is wrong. The return value is a long but the comment talks about a byte. It would help to make the comment clear as to why the return type is a long if that is indeed correct.

@garydgregory garydgregory Dec 27, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".

private final long characterPosition;

/**
* The start byte of this record as a character byte in the source stream.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing because it talks about a byte but the type is a long. It would be better to explain the mismatch, if that is indeed the intent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calculating the number of bytes required to encode a single character and storing this value as a long. This choice provides a larger range for the position and ensures consistency with the design, as position is also defined as a long.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment to clarify.

* @throws CSVException Thrown on invalid input.
*/
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber,
String encoding) throws IOException {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a Charset, not a charset name.

Comment thread src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java Outdated
@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi Gary,
I submitted another commit. Here are the changes that I made:

  1. Add augment the builder and add a private constructor to create CVSParser class
  2. Delete unneeded constructors in CSVformat class
  3. Use a try-with-resources block
  4. Improve the comments and fix the indentation
  5. Use a Charset instead of a string Charset name in ExtendedBufferedReader class

Thanks
Yuzhan

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for your updates. There is one adjustment to make where code is not needed because it is provided in the superclass.

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java Outdated
Comment thread src/main/java/org/apache/commons/csv/CSVParser.java Outdated
Comment thread src/main/java/org/apache/commons/csv/CSVParser.java Outdated
@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi Gary,

Just submit another change.

Please note that due to the encoding algorithm used, it only supports UTF-8 for now. Full encoding can be supported; I just need to put more effort into this.

Thanks
Yuzhan

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Now that I look at this again, we are always using the Charset, which is not enough to only have the feature enabled as an opt-in. This is due to the Charset being used to read the file in the first place, it's likely that the Charset will be often set, so I think we need a boolean flag that says "recordByteCount". The flag should drive whether the feature is enabled or not, not the Charset. You still need the Charset, but that's not what enables the feature.

Does that make sense? I think adding a boolean recordByteCount to the builder is what's needed here. WDYT?

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

There are many comments that were incorrectly marked resolved. See also my comment about adding a boolean to drive the new opt-in behavior.

@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi Gary,

Yes, that makes sense. I agree that relying solely on the Charset to enable the feature isn’t sufficient, as the Charset is often set independently of whether the feature should be enabled.

In our original design, instead of using a boolean, we used a String variable because it served a dual purpose: first, it controlled whether the feature was enabled or not, and second, it specified the Charset and the corresponding byte-counting algorithm to be used. Do you consider using a String?

@garydgregory

Copy link
Copy Markdown
Member

Hello @DarrenJAN

I don't understand how a String can work. Let's say I specify I want to read a CSV file encoded in Charset X. The Charset (encoder) that is used by the new counting feature MUST match X, otherwise the counting risks being mismatched. Or am I missing something? Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off.

If the PR only supports a subset of Charsets, then the Javadoc of the setter for the feature toggle must document this.

It's also likely that I am not seeing how the PR code only supports some Charsets and not others. What's missing?

TY.

@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi Gary,

In ExtendedBufferedReader.java, this function getCharBytes(int current), the logic of this code is based on the UTF-8 encoding specification. We need extra implementation to support the full set of characters.

"Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off."
--- We used a String here. We only did a simple check whether String is null to enable this encoder feature since our customers only asked for UTF-8 encoder. I agree with adding a flag to control this feature.

@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi Gary, any thoughts on this?

@garydgregory

Copy link
Copy Markdown
Member

Hello @DarrenJAN
Yes, follow up on my comments to add a boolean feature toggle 😉

Adding a boolean to drive byte tracking opt-in behavior
@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi @garydgregory,
I submitted another commit to add a boolean flag

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for your updates. Please see my comments.

private CSVFormat format;
private long characterOffset;
private long recordNumber = 1;
private boolean enableByteTracking = false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't initialize to the default value.

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java
}

/**
* Returns the start byte of this record as a character byte in the source stream.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Comment thread src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
Comment thread src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
}

/**
* In Java, the {@code char} data type is based on the original Unicode

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getter get, IOW, the Javadoc should start with "Gets ...".

}

/**
* Returns the number of bytes read

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Returns" -> "Gets".


@Test
public void testGetRecordFourBytesRead() throws Exception {
String code = "id,a,b,c\n" +

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use final.

Comment thread src/test/java/org/apache/commons/csv/CSVParserTest.java
* Fix comments

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for updates.

I think we have the basic logic done but I have some additional change requests.

TY!

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java
* @throws IOException
* If there is a problem reading the header or skipping the first record.
* @throws CSVException Thrown on invalid input.
* @since 1.13.0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New private elements do not need a Javadoc since tag.

* @return the byte length of the character.
* @throws CharacterCodingException if the character cannot be encoded.
*/
private long getCharBytes(int current) throws CharacterCodingException {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type here should be an int because this API returns 0 or CharBuffer.limit() which itself returns an int.

* @return the byte length of the character.
* @throws CharacterCodingException if the character cannot be encoded.
*/
private long getCharBytes(int current) throws CharacterCodingException {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better method name here is getEncodedCharLength().

* @param recordNumber
* The next record number to assign.
* @param charset
* The character encoding to be used for the reader.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The character encoding to be used for the reader."
->
"The character encoding to be used for the reader when enableByteTracking is true."

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java
Comment thread src/main/java/org/apache/commons/csv/CSVRecord.java
/**
* The start byte of this record as a character byte in the source stream.
*/
private final long characterByte;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is too confusing IMO, please rename to bytePosition to mirror the existing characterPosition.

@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi @garydgregory,

Happy Holidays! The changes that I made:

  1. New private elements do not need a Javadoc since tag.
  2. Return type of getCharByte is int
  3. Change getCharBytes to getEncodedCharLength
  4. Add Javadoc @param for enableByteTracking
  5. CSVRecord: Remove the old constructors and adapt the existing test.
  6. Change characterByte to bytePosition @yunzvanessa How do you think this change?

Test:
mvn
image

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for updating the PR.

You missed a couple of my comments regarding documentation.

Please rebase on git master to pickup updates that will automatically show you where you missed using final and other Checkstyle updates. You'll have to resolve conflicts in one test file.

TY & happy hols!

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java
private final long characterPosition;

/**
* The start byte of this record as a character byte in the source stream.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment to clarify.

/**
* Returns the start byte of this record as a character byte in the source stream.
*
* @return the start byte of this record as a character byte in the source stream.

@garydgregory garydgregory Dec 27, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".

Comment thread src/test/java/org/apache/commons/csv/CSVParserTest.java
Comment thread src/test/java/org/apache/commons/csv/CSVParserTest.java

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN
I cannot edit files in this PR, which is unusual; otherwise, I would have added the missing Javadoc @since tags. Please do so and see my unresolved comments.
TY!

@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi @garydgregory, I checked and made the enhancements:

  1. git rebase master
  2. For new public and protected methods, add javadoc since 1.13.0
  3. Clarify the comment:

Test:
mvn
image

@garydgregory garydgregory merged commit b40039b into apache:master Jan 2, 2025
@garydgregory garydgregory changed the title CSV-196-TrackBytePositions [CSV-196] Track byte position Jan 2, 2025
asfgit pushed a commit that referenced this pull request Jan 2, 2025
@garydgregory

Copy link
Copy Markdown
Member

@DarrenJAN

Copy link
Copy Markdown
Contributor Author

Hi @garydgregory ,

Thank you very much. When is the next release?

Yuzhan

@garydgregory

garydgregory commented Jan 3, 2025

Copy link
Copy Markdown
Member

Hello @DarrenJAN
I have three open release candidates ATM, so when that's done, I'll start the release process for this component. In the meantime, please test a build from our Maven snapshot repository:

https://repository.apache.org/content/repositories/snapshots/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants