Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-5835.
Now is not implemented because byte array is not supported to be HashMap key.
One possible way is that wrap them with something to implement equals and hashcode.

@codecov-io
Copy link

Codecov Report

Merging #4792 into master will increase coverage by 2.16%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4792      +/-   ##
==========================================
+ Coverage   87.43%   89.59%   +2.16%     
==========================================
  Files         996      661     -335     
  Lines      139677    96300   -43377     
  Branches     1418        0    -1418     
==========================================
- Hits       122124    86282   -35842     
+ Misses      17191    10018    -7173     
+ Partials      362        0     -362
Impacted Files Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 325 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9a82ec...cdbcf41. Read the comment docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems String is also a good wrapper for byte array, with hashCode & equals properly defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments, @liyafan82 . Sure we could use String as the wrapper, but String will convert byte[] to char[] which I'm afraid it will affect performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a ByteBuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, fixed now, thanks a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to wrap the underlying data directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Micah, do you mean create a wrapper class to hold byte[] like ByteArrayWrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining my comment more on the last review about the custom hash-table:
Instead of this approach it might be nice to see we can somehow use either the comparators that @liyafan82 has introduced and also introduce a concept of a Hasher. If we use those, then we can avoid calling the getObject() method at all, which in some cases might be expensive? (string copying?). What do you thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield , I agree with you.
There are significant performance overhead here, due to repeated memory copy, and conversions between java object & binary bytes. So some significant rework is required before it can be applied to our scenario.
We are preparing for the rework, the sort/search functionalities are prerequisites for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. @emkornfield Let me understand, you suggest something like that to remove lookup HashMap, right?

for (int i = 0; i < count; i++) {
//vector represents the vector to encode
Object value = vector.getObject(i);
int index = dictionary.getVector().search(value); //comparators or hasher
encodedVector.setWithPossibleTruncate(i, index);
}

In some cases, will search perform worse than getObject? Since comparators are not supported supported or will introduce many changes with a concept of a Hasher, I would prefer to test and work for this in a follow-up PR, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up PR is fine because I think it . Not necessarily with search. I was thinking of still having a hash table. But something like:

class Hasher { ValueVector wrappedVector; // specialized per vector type. Avoids returning an object // so no object creation is required. abstract int getHash(int index); }
Then the HashTable would have a have comparator:
HashTable {
Comparator comparator;
Hasher hasher;
int getIndex(int arrayToEncodeIndex) {
entries = table.get(Hasher(arrayToEncodeindex));
for (entriy : entries) {
if (comparator(entry.index, arrayToEncodeIndex) == 0){
return entry.index
}
}
return -1;
}
}
`
A few open questions

  1. Do we want to move comparator to this module?
  2. Does this really give a performance benefit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And 3. Should there be either a new method on ValueVector or interface for direct equality checking instead of using comparator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your prototype, we need carefully think of the follow-up design and test the perf, BWT, I think it's a good start whatever method is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up PR is fine because I think it . Not necessarily with search. I was thinking of still having a hash table. But something like:

class Hasher { ValueVector wrappedVector; // specialized per vector type. Avoids returning an object // so no object creation is required. abstract int getHash(int index); }
Then the HashTable would have a have comparator:
HashTable {
Comparator comparator;
Hasher hasher;
int getIndex(int arrayToEncodeIndex) {
entries = table.get(Hasher(arrayToEncodeindex));
for (entriy : entries) {
if (comparator(entry.index, arrayToEncodeIndex) == 0){
return entry.index
}
}
return -1;
}
}
`
A few open questions

  1. Do we want to move comparator to this module?
  2. Does this really give a performance benefit?

@emkornfield , thanks for your comments. Since dictionary encoding is key to the performance of our scenario, I would like provide some comments:

  1. We can move the dictionary related code to the algorithm module, but not vice versa, because a cyclic dependency will be created.

  2. A big +1 for your Hash interface. It will be of great help for 1) reducing the conversions between Java objects and Arrow bufer; 2) avoiding unnecessary memory copy.

In addition, another lower level hasher based on memory buffer should be provided. I will start a new issue to track that.

  1. Equality can be determined in two ways: 1) by a comparator; 2) by hashCode + equals. I think it is OK to add a member for ValueVector to provide the default equality behavior.

However, it is also beneficial to provide some interface for calculating the hash code. According to our experience, different algorithms for computing the hash code manifest widely different behaviors and have significant performance implications, so they are suitable for different scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyafan82

We can move the dictionary related code to the algorithm module, but not vice versa, because a cyclic dependency will be created.

For now to avoid API breakages I would prefer to leave it here. Once we have more examples of dependencies we can figure out the best way to factor the code.

In addition, another lower level hasher based on memory buffer should be provided. I will start a new issue to track that.

SGTM

I think it is OK to add a member for ValueVector to provide the default equality behavior.

I tend to agree that an equality method would be helpful, I need to go back and look through the classes again because I'm surprised there isn't something already available.

However, it is also beneficial to provide some interface for calculating the hash code. According to our experience, different algorithms for computing the hash code manifest widely different behaviors and have significant performance implications, so they are suitable for different scenarios.

Can you elaborate on the different scenarios is it more than just varying by data type? How do you choose a hasher for any particular scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now to avoid API breakages I would prefer to leave it here. Once we have more examples of dependencies we can figure out the best way to factor the code.

Sure. Sounds reasonable.
My point is to leave this one alone and gradually deprecated it, and create a new one in the algorithm module. The reason is that the current one can hardly be used in any practical scenario, because of heavy performance overhead:

  1. There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)).
  2. Unnecessary memory copy (the vector data must be copied to the hash table).
  3. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
  4. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
  5. The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.

I will start a new issue to redesign the dictionary encoder.

Can you elaborate on the different scenarios is it more than just varying by data type? How do you choose a hasher for any particular scenario.

Sure.
Generally, we want the hash code to be uniformly distributed in the universe. This will have great benefits in practice. For example, in a hash table with open addressing, a uniform hash function will lead to small number of collisions, short bucket clusters, etc., which makes insert/search/update operations for the hash table much faster.

A uniform hash function, however, is usually compute-intensive. On the other hand, a simple hash function can be easy to compute, but it causes severe problem in practice (e.g. in hash table operations). Both hash functions are valid in the sense that if two objects are equal, they must have identical hash code.

We have such a experience, our hash join operator performs at least two orders of magnitude slower, just because of a poorly selected hash function.

Therefore, the hash function has significant performance implications. The key is the balance between being uniform and computational cost, and this balance should depend on concrete scenario.

So we should not rely on a single hash function. We should give the user the ability to plug-in the hash function as they like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversations and memory copy could be avoidable with new designed hash table & Hasher interface and new hash & equals API without changing decoder interface. #4846, #4844
+1 for provide multiple hash code implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment is that, we should be careful when introducing new interfaces, especially vector interfaces, since they are the core for Arrow.
New interfaces make the class hierarchy difficult to manage, and once an interface is added, it is difficult to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you we should be careful introducing new interfaces, if you all think this is not needed, it can be removed. But in some case we might use two "if" to judge the type separately like DictionaryEncoder#encode in this PR which seems a little ugly.

@emkornfield
Copy link
Contributor

emkornfield commented Jul 10, 2019

+1, thanks

@emkornfield
Copy link
Contributor

Actually, I would like to get a second opinion on the interface addition before merging @praveenbingo @pravindra?

@emkornfield
Copy link
Contributor

@praveenbingo @pravindra any concerns about the new interface?

@emkornfield
Copy link
Contributor

@tianchen92 you'll need to fix the conflicts from merging your other PR. Thanks

@tianchen92
Copy link
Contributor Author

@tianchen92 you'll need to fix the conflicts from merging your other PR. Thanks

Thanks for your efforts, fixed now.

@pravindra
Copy link
Contributor

@praveenbingo @pravindra any concerns about the new interface?

I'll review this over the weekend.

Copy link
Contributor

@pravindra pravindra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@emkornfield
Copy link
Contributor

+1, CI failure appears to be flight failure.

emkornfield pushed a commit that referenced this pull request Jul 19, 2019
… dictionary encoding

As discussed in #4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4846 from tianchen92/hasher and squashes the following commits:

2db7302 <tianchen> fix
5facc2a <tianchen> resolve comments
175192a <tianchen> fix test and style
7a87526 <tianchen> implementation of equals and hashCode
c89608b <tianchen> fix
8f2e1a2 <tianchen> hash table prototype
kszucs pushed a commit that referenced this pull request Jul 22, 2019
Related to [ARROW-5835](https://issues.apache.org/jira/browse/ARROW-5835).
Now is not implemented because byte array is not supported to be HashMap key.
One possible way is that wrap them with something to implement equals and hashcode.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4792 from tianchen92/ARROW-5835 and squashes the following commits:

f50a19e <tianchen> fix UNION regression
8267c2b <tianchen> fix style
a039bc1 <tianchen> Support Dictionary Encoding for binary type
kszucs pushed a commit that referenced this pull request Jul 22, 2019
… dictionary encoding

As discussed in #4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4846 from tianchen92/hasher and squashes the following commits:

2db7302 <tianchen> fix
5facc2a <tianchen> resolve comments
175192a <tianchen> fix test and style
7a87526 <tianchen> implementation of equals and hashCode
c89608b <tianchen> fix
8f2e1a2 <tianchen> hash table prototype
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Related to [ARROW-5835](https://issues.apache.org/jira/browse/ARROW-5835).
Now is not implemented because byte array is not supported to be HashMap key.
One possible way is that wrap them with something to implement equals and hashcode.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4792 from tianchen92/ARROW-5835 and squashes the following commits:

f50a19e <tianchen> fix UNION regression
8267c2b <tianchen> fix style
a039bc1 <tianchen> Support Dictionary Encoding for binary type
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
… dictionary encoding

As discussed in apache#4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4846 from tianchen92/hasher and squashes the following commits:

2db7302 <tianchen> fix
5facc2a <tianchen> resolve comments
175192a <tianchen> fix test and style
7a87526 <tianchen> implementation of equals and hashCode
c89608b <tianchen> fix
8f2e1a2 <tianchen> hash table prototype
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants