Skip to content

GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader#38371

Merged
lidavidm merged 12 commits intoapache:mainfrom
vibhatha:gh-37841
Feb 1, 2024
Merged

GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader#38371
lidavidm merged 12 commits intoapache:mainfrom
vibhatha:gh-37841

Conversation

@vibhatha
Copy link
Copy Markdown
Contributor

@vibhatha vibhatha commented Oct 20, 2023

Rationale for this change

This PR addresses #37841.

What changes are included in this PR?

Adding compression-based write and read for Dictionary data.

Are these changes tested?

Yes.

Are there any user-facing changes?

No

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #37841 has been automatically assigned in GitHub to PR creator.

@vibhatha vibhatha marked this pull request as ready for review November 1, 2023 08:16
@vibhatha vibhatha requested a review from lidavidm as a code owner November 1, 2023 08:16
@vibhatha vibhatha marked this pull request as draft November 1, 2023 08:16
@vibhatha vibhatha marked this pull request as ready for review November 14, 2023 16:58
Copy link
Copy Markdown
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @vibhatha !

Comment on lines 69 to 73
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be marked private final and grouped with the other private final vars?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, way neater that way.

Comment on lines 146 to 145
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a helper function for creating a codec?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use getCodec() here, too!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Somewhat unrelated to the issue, but should we parameterize the tests to use all the different types of compression? See TestCompressionCodec.java as an example.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of these in favor of this.allocator if you keep the @Before/@After functions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I refactored the code to use this approach.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the @Before/@After functionality isn't used for all @Tests, it might be better to move this to a helper function instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use @BeforeEach so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think that is better and safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unaddressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw @lidavidm

I updated the JUnit annotations for consistency and compatibility. The @beforeeach annotation from JUnit 5 was being used in conjunction with @test from JUnit 4, causing the setup method not to run as expected before each test.

Furthermore do we need to make the usage of JUnit consistent across tests?

Copy link
Copy Markdown
Contributor Author

@vibhatha vibhatha Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change okay?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we still need the original test function testArrowFileZstdRoundTrip? Is this new test case possibly testing the same code path + dictionaries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same path + dictionaries, I refactored and reorganized the test cases, does it make sense or useful? I think my previous code had duplication.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 14, 2023
@vibhatha
Copy link
Copy Markdown
Contributor Author

@danepitkin Thanks a lot for the review comments, I will address them.

Copy link
Copy Markdown
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! The tests look very clean now. left a few small additional comments.

Comment on lines 146 to 145
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use getCodec() here, too!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use @BeforeEach so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.

Comment on lines 108 to 114
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: I think the readArrowFile() and readArrowStream() functions are not needed. While they improve code duplication, I think they also decrease code readability. Personally, I like to quickly see what is asserted in the test functions themselves. I do like how you've separated out other functionality into their own functions like createAndWriteArrowFile().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had doubts after the cleanup 😄
Let me update the PR.

@vibhatha
Copy link
Copy Markdown
Contributor Author

@danepitkin I am not sure if this one is practical though 🤔

@danepitkin
Copy link
Copy Markdown
Member

@danepitkin I am not sure if this one is practical though 🤔

I'm okay with leaving it as-is if there are issues with using @BeforeEach instead of @Before.

Overall, LGTM! I think it's ready for final review from a committer. Excellent job.

@vibhatha
Copy link
Copy Markdown
Contributor Author

@lidavidm appreciate your feedback.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should be able to initialize the codec once and reuse it in the constructor, rather than add all the new fields?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm an issue was filed here: https://github.com/apache/arrow/issues/39222
Also, I can work on this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: why did we pull this out? It's called only once. Why are we adding a bunch of new fields? We don't use them.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 27, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 13, 2023
@vibhatha vibhatha requested a review from lidavidm December 13, 2023 11:16
@vibhatha
Copy link
Copy Markdown
Contributor Author

@lidavidm I updated the PR. Appreciate another round of reviews.

Copy link
Copy Markdown
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha can you file a followup to add a new integration test to cover this scenario?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this removing a public constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm for my clarification.

Maybe I have misunderstood your comment here.

I thought it would be rather cleaner to pass the CompressionCodec rather than passing all the other parameters which make this object.

Did you mean something else?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delegate constructors without removing public ones (which breaks API).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm does the updated change make sense?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unaddressed.

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels Dec 13, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 1, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 1, 2024
@vibhatha
Copy link
Copy Markdown
Contributor Author

vibhatha commented Feb 1, 2024

@github-actions crossbow submit java

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 1, 2024

Revision: 907195a

Submitted crossbow builds: ursacomputing/crossbow @ actions-27dab1a4d2

Task Status
java-jars GitHub Actions
verify-rc-source-java-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-java-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-java-macos-amd64 GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 1, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 1, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 1, 2024
@lidavidm lidavidm merged commit f9b7ac2 into apache:main Feb 1, 2024
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Feb 1, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f9b7ac2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java] Dictionary decoding not using the compression factory from the ArrowReader

3 participants