-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding #4846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@emkornfield Hi Micah, I made a simple implementation for dictionary hash table. |
java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java
Outdated
Show resolved
Hide resolved
|
@liyafan82 Thanks for your comments, This PR is just a demo as I commented as WIP and this will work for VarCharVector type encoding in TestDictionaryVector to test correctness. default and memory copy already commented as TODO to make it perfect since we have to use memory level hashCode and implement equals in all type vectors. |
Codecov Report
@@ Coverage Diff @@
## master #4846 +/- ##
==========================================
+ Coverage 87.42% 89.58% +2.16%
==========================================
Files 994 661 -333
Lines 140102 96617 -43485
Branches 1418 0 -1418
==========================================
- Hits 122481 86555 -35926
+ Misses 17259 10062 -7197
+ Partials 362 0 -362Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a no-op I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments, IMO, what dose this mean? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Could you please explain this a little more? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Integer.hashCode(value) == value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are right, thanks, fixed.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start do you see performance improvements? If so I think the next step would be to add unit tests.
Also, CC @praveenbingo and @pravindra to make sure they don't have concerns with the new interfaces.
java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryHashTable.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
Outdated
Show resolved
Hide resolved
Sure, I have already added a DictionaryEncoderBenchmark and list its previous performance data in #4786. And add what kind of UT? now TestDictionaryVector already support many types like Struct, List, Binary, Varchar etc. |
|
Sorry forgot about the existing unit tests. LGTM. |
|
+1, than you |
… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
… dictionary encoding As discussed in apache#4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
As discussed in #4792
Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.