[CSV-196] Track byte position by DarrenJAN · Pull Request #502 · apache/commons-csv

DarrenJAN · 2024-11-07T01:47:22Z

Add support in Commons CSV for tracking byte positions during parsing.

Summary of Modifications

Test Data Files: Added new test data files, and updated pom.xml to exclude these files from RAT checks, avoiding unapproved license checks.
CSVParser class:
Constructor Enhancements
a. Added support for an optional parameter -- String encoding--, which specifies the encoding to use for the reader.
CSVRecord class

private long characterByte: start byte position of this record
Add new Constructor: support track byte positions in record class

ExtendedBufferedReader Class:

private long bytesRead: Tracks the number of bytes read so far.
private long bytesReadMark: Stores the marked byte position.
CharsetEncoder encoder: Encoder used to calculate byte size of characters.
getCharBytes(int current): This function calculates character bytes based on UTF-8 encoding. Note: it only supports UTF-8 due to the encoding algorithm used. Full encoding can be supported and we just need more effort on this.
reset() and mark() Methods: Enhanced to prevent consuming characters and bytes unintentionally.

Test result:
mvn

Pass unit tests and other restrictions

…#9) Add support in Commons CSV for tracking byte positions during parsing

garydgregory

Hello @DarrenJAN

Thank you for providing a PR!

Please see my comments. Note that some comments will apply to more to one location in the code.

Make sure you rebase on git master.

garydgregory · 2024-11-07T12:34:14Z

+     * @throws IOException If an I/O error occurs
+     * @throws CSVException Thrown on invalid input.
+     */
+    public CSVParser parse(final Reader reader, final long characterOffset, final long recordNumber, String encoding) throws IOException {


No new public constructors please. You can augment the builder and add a private constructor.

All new and protected elements should have a Java @since 1.13.0.

I noticed that there is only builder class in CSVFormat, there are no builder class in CSVParser, would you like me to add one?
There are CSVParserBuilder() from com.opencsv.CSVParser, that is another package.

Uh? See

commons-csv/src/main/java/org/apache/commons/csv/CSVParser.java

Line 151 in 96427fc

public static class Builder extends AbstractStreamBuilder<CSVParser, Builder> {

It seems like I did not rebase on the master branch. Thanks.

This section is not resolved.

garydgregory · 2024-11-07T12:35:34Z

+     *             If there is a problem reading the header or skipping the first record
+     * @throws CSVException Thrown on invalid input.
+     */
+    public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber,


See previous comment.

garydgregory · 2024-11-07T12:36:41Z

    }

+    /**
+     * Returns the start byte of this record as a character byte in the source stream.


Please follow the same Javadoc patterns as in other getter methods: A getter "Gets...".

garydgregory · 2024-11-07T12:37:15Z

+    private long bytesReadMark;
+
+    /** Encoder used to calculate the bytes of characters */
+    CharsetEncoder encoder;


Make this new instance variable private.

Not resolved.

garydgregory · 2024-11-07T12:37:35Z

        super(reader);
    }

+    ExtendedBufferedReader(final Reader reader, String encoding) {


Use a Charset instead of a Charset name.

Not resolved.

garydgregory · 2024-11-07T12:43:00Z

+    /**
+     * Returns the start byte of this record as a character byte in the source stream.
+     *
+     * @return the start byte of this record as a character byte in the source stream.


Either the code or the comment is wrong. The return value is a long but the comment talks about a byte. It would help to make the comment clear as to why the return type is a long if that is indeed correct.

New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".

garydgregory · 2024-11-07T12:44:00Z

    private final long characterPosition;

+    /**
+     * The start byte of this record as a character byte in the source stream.


This comment is confusing because it talks about a byte but the type is a long. It would be better to explain the mismatch, if that is indeed the intent.

We are calculating the number of bytes required to encode a single character and storing this value as a long. This choice provides a larger range for the position and ensures consistency with the design, as position is also defined as a long.

Please update the comment to clarify.

garydgregory · 2024-11-07T12:44:37Z

+     * @throws CSVException Thrown on invalid input.
+     */
+    public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber,
+        String encoding) throws IOException {


Use a Charset, not a charset name.

…#12) Add support in Commons CSV for tracking byte positions during parsing

DarrenJAN · 2024-11-19T20:27:36Z

Hi Gary,
I submitted another commit. Here are the changes that I made:

Add augment the builder and add a private constructor to create CVSParser class
Delete unneeded constructors in CSVformat class
Use a try-with-resources block
Improve the comments and fix the indentation
Use a Charset instead of a string Charset name in ExtendedBufferedReader class

Thanks
Yuzhan

garydgregory

Hello @DarrenJAN

Thank you for your updates. There is one adjustment to make where code is not needed because it is provided in the superclass.

DarrenJAN · 2024-11-19T22:45:03Z

Hi Gary,

Just submit another change.

Please note that due to the encoding algorithm used, it only supports UTF-8 for now. Full encoding can be supported; I just need to put more effort into this.

Thanks
Yuzhan

garydgregory

Hello @DarrenJAN

Now that I look at this again, we are always using the Charset, which is not enough to only have the feature enabled as an opt-in. This is due to the Charset being used to read the file in the first place, it's likely that the Charset will be often set, so I think we need a boolean flag that says "recordByteCount". The flag should drive whether the feature is enabled or not, not the Charset. You still need the Charset, but that's not what enables the feature.

Does that make sense? I think adding a boolean recordByteCount to the builder is what's needed here. WDYT?

garydgregory

Hello @DarrenJAN

There are many comments that were incorrectly marked resolved. See also my comment about adding a boolean to drive the new opt-in behavior.

DarrenJAN · 2024-11-20T16:55:04Z

Hi Gary,

Yes, that makes sense. I agree that relying solely on the Charset to enable the feature isn’t sufficient, as the Charset is often set independently of whether the feature should be enabled.

In our original design, instead of using a boolean, we used a String variable because it served a dual purpose: first, it controlled whether the feature was enabled or not, and second, it specified the Charset and the corresponding byte-counting algorithm to be used. Do you consider using a String?

garydgregory · 2024-11-20T18:42:16Z

Hello @DarrenJAN

I don't understand how a String can work. Let's say I specify I want to read a CSV file encoded in Charset X. The Charset (encoder) that is used by the new counting feature MUST match X, otherwise the counting risks being mismatched. Or am I missing something? Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off.

If the PR only supports a subset of Charsets, then the Javadoc of the setter for the feature toggle must document this.

It's also likely that I am not seeing how the PR code only supports some Charsets and not others. What's missing?

TY.

DarrenJAN · 2024-11-20T20:45:08Z

Hi Gary,

In ExtendedBufferedReader.java, this function getCharBytes(int current), the logic of this code is based on the UTF-8 encoding specification. We need extra implementation to support the full set of characters.

"Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off."
--- We used a String here. We only did a simple check whether String is null to enable this encoder feature since our customers only asked for UTF-8 encoder. I agree with adding a flag to control this feature.

DarrenJAN · 2024-11-25T15:43:59Z

Hi Gary, any thoughts on this?

garydgregory · 2024-11-25T16:07:14Z

Hello @DarrenJAN
Yes, follow up on my comments to add a boolean feature toggle 😉

Adding a boolean to drive byte tracking opt-in behavior

DarrenJAN · 2024-12-04T16:42:08Z

Hi @garydgregory,
I submitted another commit to add a boolean flag

garydgregory

Hello @DarrenJAN

Thank you for your updates. Please see my comments.

garydgregory · 2024-12-05T20:40:25Z

        private CSVFormat format;
        private long characterOffset;
        private long recordNumber = 1;
+        private boolean enableByteTracking = false;


Don't initialize to the default value.

garydgregory · 2024-12-05T20:41:18Z

    }

+    /**
+     * Returns the start byte of this record as a character byte in the source stream.


garydgregory · 2024-12-05T20:42:18Z

    }

+    /**
+     * In Java, the {@code char} data type is based on the original Unicode


Getter get, IOW, the Javadoc should start with "Gets ...".

garydgregory · 2024-12-05T20:42:34Z

    }

+    /**
+     * Returns the number of bytes read


"Returns" -> "Gets".

garydgregory · 2024-12-05T20:42:54Z

+
+    @Test
+    public void testGetRecordFourBytesRead() throws Exception {
+        String code = "id,a,b,c\n" +


Use final.

* Fix comments

garydgregory

Hello @DarrenJAN

Thank you for updates.

I think we have the basic logic done but I have some additional change requests.

TY!

garydgregory · 2024-12-16T15:40:05Z

+     * @throws IOException
+     *             If there is a problem reading the header or skipping the first record.
+     * @throws CSVException Thrown on invalid input.
+     * @since 1.13.0.


New private elements do not need a Javadoc since tag.

garydgregory · 2024-12-16T15:41:53Z

+     * @return the byte length of the character.
+     * @throws CharacterCodingException if the character cannot be encoded.
+     */
+    private long getCharBytes(int current) throws CharacterCodingException {


The return type here should be an int because this API returns 0 or CharBuffer.limit() which itself returns an int.

garydgregory · 2024-12-16T15:46:40Z

+     * @return the byte length of the character.
+     * @throws CharacterCodingException if the character cannot be encoded.
+     */
+    private long getCharBytes(int current) throws CharacterCodingException {


I think a better method name here is getEncodedCharLength().

garydgregory · 2024-12-16T15:52:20Z

+     * @param recordNumber
+     *            The next record number to assign.
+     * @param charset
+     *            The character encoding to be used for the reader.


"The character encoding to be used for the reader."
->
"The character encoding to be used for the reader when enableByteTracking is true."

garydgregory · 2024-12-16T16:05:49Z

+    /**
+     * The start byte of this record as a character byte in the source stream.
+     */
+    private final long characterByte;


This name is too confusing IMO, please rename to bytePosition to mirror the existing characterPosition.

DarrenJAN · 2024-12-27T00:10:01Z

Hi @garydgregory,

Happy Holidays! The changes that I made:

New private elements do not need a Javadoc since tag.
Return type of getCharByte is int
Change getCharBytes to getEncodedCharLength
Add Javadoc @param for enableByteTracking
CSVRecord: Remove the old constructors and adapt the existing test.
Change characterByte to bytePosition @yunzvanessa How do you think this change?

Test:
mvn

garydgregory

Hello @DarrenJAN

Thank you for updating the PR.

You missed a couple of my comments regarding documentation.

Please rebase on git master to pickup updates that will automatically show you where you missed using final and other Checkstyle updates. You'll have to resolve conflicts in one test file.

TY & happy hols!

garydgregory · 2024-12-27T12:38:52Z

    private final long characterPosition;

+    /**
+     * The start byte of this record as a character byte in the source stream.


Please update the comment to clarify.

garydgregory · 2024-12-27T12:40:03Z

+    /**
+     * Returns the start byte of this record as a character byte in the source stream.
+     *
+     * @return the start byte of this record as a character byte in the source stream.


New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".

garydgregory

Hello @DarrenJAN
I cannot edit files in this PR, which is unusual; otherwise, I would have added the missing Javadoc @since tags. Please do so and see my unresolved comments.
TY!

DarrenJAN · 2024-12-31T22:11:28Z

Hi @garydgregory, I checked and made the enhancements:

git rebase master
For new public and protected methods, add javadoc since 1.13.0
Clarify the comment:

Test:
mvn

garydgregory · 2025-01-02T21:23:57Z

https://issues.apache.org/jira/browse/CSV-196

DarrenJAN · 2025-01-03T18:54:45Z

Hi @garydgregory ,

Thank you very much. When is the next release?

Yuzhan

garydgregory · 2025-01-03T19:34:36Z

Hello @DarrenJAN
I have three open release candidates ATM, so when that's done, I'll start the release process for this component. In the meantime, please test a build from our Maven snapshot repository:

https://repository.apache.org/content/repositories/snapshots/

Add support in Commons CSV for tracking byte positions during parsing (…

7dca281

…#9) Add support in Commons CSV for tracking byte positions during parsing

garydgregory requested changes Nov 7, 2024

View reviewed changes

DarrenJAN added 2 commits November 12, 2024 11:05

Merge branch 'apache:master' into CSV-196-master

b244cb1

Add support in Commons CSV for tracking byte positions during parsing (…

3599f5b

…#12) Add support in Commons CSV for tracking byte positions during parsing

garydgregory requested changes Nov 19, 2024

View reviewed changes

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java Outdated

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java Outdated

Comment thread src/main/java/org/apache/commons/csv/CSVParser.java Outdated

CSV-196: Remove duplicated Charset (#13)

344f282

garydgregory requested changes Nov 19, 2024

View reviewed changes

garydgregory requested changes Nov 20, 2024

View reviewed changes

Adding a boolean to drive byte tracking opt-in behavior (#14)

27511be

Adding a boolean to drive byte tracking opt-in behavior

garydgregory requested changes Dec 5, 2024

View reviewed changes

Fix comments (#15)

8387f79

* Fix comments

garydgregory requested changes Dec 16, 2024

View reviewed changes

CSV-196-master: More changes (#16)

bdd152f

garydgregory requested changes Dec 27, 2024

View reviewed changes

Merge branch 'apache:master' into CSV-196-master

110fea8

garydgregory requested changes Dec 31, 2024

View reviewed changes

CSV-196: Comments changes on Dec30 (#17)

d403084

garydgregory merged commit b40039b into apache:master Jan 2, 2025

garydgregory changed the title ~~CSV-196-TrackBytePositions~~ [CSV-196] Track byte position Jan 2, 2025

asfgit pushed a commit that referenced this pull request Jan 2, 2025

[CSV=196] Track byte position #502

eddb332

Conversation

DarrenJAN commented Nov 7, 2024

Uh oh!

garydgregory left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garydgregory Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarrenJAN commented Nov 19, 2024

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarrenJAN commented Nov 19, 2024

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

DarrenJAN commented Nov 20, 2024

Uh oh!

garydgregory commented Nov 20, 2024

Uh oh!

DarrenJAN commented Nov 20, 2024

Uh oh!

DarrenJAN commented Nov 25, 2024

Uh oh!

garydgregory commented Nov 25, 2024

Uh oh!

DarrenJAN commented Dec 4, 2024

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garydgregory left a comment •

edited

Loading

garydgregory Dec 27, 2024 •

edited

Loading