use HashMap instead of TreeMap for OnHeapHnswGraph neighbors#12248
use HashMap instead of TreeMap for OnHeapHnswGraph neighbors#12248jbellis wants to merge 13 commits intoapache:mainfrom
Conversation
Switch OnHeapHnswGraph from representing a single node's neighbors as a TreeMap, to a HashMap, and update its callers to no longer assume that NodeIterator results are ordered by ordinal.
…nd Collections.sort instead
msokolov
left a comment
There was a problem hiding this comment.
I left some style questions but overall seems reasonable - thank you! We only need the nodes sorted when we finally write the graph. In the meantime we want fast insertion and lookup and don't care about the sorting.
Could you say how you used that test harness? I looked at it, and didn't see a method that would measure and print out multiple iterations of indexing times as you showed in the overview.
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
Outdated
Show resolved
Hide resolved
|
(My mistake; it looks like tidy/spotlessApply don't care about imports; it must have been intellij's fault.) |
|
I saw NeighborQueue construction taking up a decent chunk of time on the profiler data, so I refactored HGSearcher a bit to only allocate once per call to [HashMap build only] [HashMap + NQ change, build only] |
|
I'm not entirely sure why just from inspection, but this seems to have broken some of the backwards-compatibility tests. What it means is this can no longer read indexes written by a prior point release. Maybe consider reverting the change to pre-allocate the NeighborQueue so we can move forward with the change from TreeMap to HashMap? |
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
|
I've added the types in, but I'd like to push back a bit on requiring Locale for string.formatted in test code -- the alternative isn't really using Locale and localizing the assertion messages, the realistic alternative is doing old school string concatenation with +, and that's just worse all around. |
|
^ I think that's everything, I'll look at what's going on w/ the tests for the NeighborQueue piece and create a new PR |
|
^ last commit addresses the legacy test failures, not clear to me why they didn't fail before on this branch, but with this change everything passes with and w/o the NQ commit |
I don't understand - can't we use |
Lucene's tests are executed with randomized settings, so to reproduce them you often have to provide the same test seed that was used. This gets printed out in a reproduce command in the failure message; see where it has "-Dtests.seed=xxxxx" |
|
also - in the future I recommend not using force-push with github PRs since it loses all the history and (I think) can erase the comments that were tied to specific commits. At any rate it makes things difficult for reviewers since we have to re-review the entire contribution rather than being able to see what changed since the last review |
|
My bad, I missed that one. |
|
Thanks, Michael! |
Switches OnHeapHnswGraph from representing a single node's neighbors as a TreeMap, to a HashMap, and updates its callers to no longer assume that NodeIterator results are ordered by ordinal.
This means that looking up neighbors goes from an O(log N) operation to an O(1) operation for all upper levels. Only callers that need ordered neighbors need to pay the cost of sorting.
On my machine, building the graph and running the queries from the SIFT dataset at http://corpus-texmex.irisa.fr/ sees about a 4% speedup. These are dimension 128 vectors; the relative importance of looking up neighbors vs computing similarities will vary inversely with dimensionality.
(First I did three runs for each codebase, but there was enough variance that I then ran five more.)
[TreeMap]
Run 0 took 708.876145328 seconds
Run 1 took 754.700354185 seconds
Run 2 took 710.236167725 seconds
Run 0 took 717.61476478 seconds
Run 1 took 742.494343683 seconds
Run 2 took 736.983143255 seconds
Run 3 took 719.305541186 seconds
Run 4 took 722.831859596 seconds
[HashMap]
Run 0 took 724.875610053 seconds
Run 1 took 682.65666933 seconds
Run 2 took 682.655977613 seconds
Run 0 took 716.165341298 seconds
Run 1 took 684.314657618 seconds
Run 2 took 686.456432263 seconds
Run 3 took 717.067567184 seconds
Run 4 took 702.009706983 seconds
The timing harness is here: https://github.com/jbellis/hnswdemo/tree/lucene-bench