LUCENE-9838: simd version of VectorUtil.dotProduct#18
LUCENE-9838: simd version of VectorUtil.dotProduct#18rmuir wants to merge 1 commit intoapache:mainfrom
Conversation
gradle/defaults-java.gradle
Outdated
| } | ||
|
|
||
| configure(project(":lucene:core")) { | ||
| plugins.withType(JavaPlugin) { |
There was a problem hiding this comment.
If this is MR-jar code then it'd be cleaner to create separate sourcesets for different jdk levels required and then apply these compiler options to each sourceset. This makes it possible to work with multiple "levels" of compatibility on the same source data.
You could also (conditionally) disable compilation of these extra sourcesets if the compiler version is lower than the sourceset's required version... this would enable people to work with the code without installing all the (alpha) JDKs.
I can take care of this if needed - let me know.
There was a problem hiding this comment.
The problem is currently: it is incubating API, so it will break with JDK17 again.
We have to wait that the incubating module get's into java.base / java.whatever, so it can be used and API is stable. The problem with MR-JARs is the following:
- it does not handle module system, so you would get ClassNotFound if you dont start with correct command line flag
- You can't say: "use this only is Java version exactly 16"
Like with my own Pull Request for MMAP: We should test this, e.g. on Jenkins, but not add to productive code.
There was a problem hiding this comment.
I feel the same for MMapDirectory v2: "it should really go into Lucene Core", with some trick, but at the moment I have to wait. Netty has something similar at moment, you have to download a separate JAR and add it to your classpath.
So one way to implement this: Add a separate lucene module that's optional and use some "plug in" mechanism (e.g., SPI). People who want to use it must use the exact correct version for their JDK version and must add some command line flag to use it (enable it in their JDK).
There was a problem hiding this comment.
For MMapDirectory my plan would be to have a separate lucene-mmap-foreign-jdk16.jar that compiled against JDK 16 that provides another directory. Code that wants to use it knows:
- you have to add the JAR file
- you have to hardcode new JDK16MMapDirectory() instead of FSDircetory.open()
- add command line flags
There was a problem hiding this comment.
Is it really impossible though? Wit MR-JAR could we define a class version for 16+, but define slow version for 17+? Effectively, we define a version of the code for only exactly version 16?
Furthermore, couldn't we check with static initializer and if the incubator module isn't available, link methodhandle to a slow version?
There was a problem hiding this comment.
I think: for this use case, we should not use MR-JARs at all. Just use a MethodHandle. It's just a single method and a few classes. Maybe you can construct this using a single MethodHandle using those combiners (MethodHandles.adapt.... and MH.guard, and all other methods).... Another MethodHandle would go to our slow version.
There was a problem hiding this comment.
I'll just repeat what I think I'm hearing since I'm slow ... It sounds like a way forward is something like this?
- Use reflection to search for this new implementation of
static float SuperDuperVectors.dotProduct(float[], float[])and/or fall back to the old way? - Provide a {{vector-api.jar}} with
SuperDuperVectorsin it (or some such) just to wrap this object, and maybe other Vector stuff like Euclidean norm. - Get the build sorted out so that the plugin can be built conditionally, depending on JDK used.
Then users can incorporate that jar and provide the proper JDK command line flag to include the incubating vector api support, and that's it?
There was a problem hiding this comment.
Not reflection, but a combined MethodHandle.
gradle/validation/error-prone.gradle
Outdated
|
|
||
| allprojects { prj -> | ||
| plugins.withType(JavaPlugin) { | ||
| prj.apply plugin: 'net.ltgt.errorprone' |
There was a problem hiding this comment.
Does it fails with JDK16, even if it's not used?
There was a problem hiding this comment.
yes, unfortunately it does:
$ ./gradlew -Pruntime.java.home=/home/rmuir/Downloads/jdk-16 compileJava
...
> Task :lucene:core:compileJava FAILED
Exception in thread "main" java.lang.IllegalAccessError: class com.google.errorprone.ErrorProneJavacPlugin (in unnamed module @0x31610302) cannot access class com.sun.tools.javac.api.BasicJavacTask (in module jdk.compiler) because module jdk.compiler does not export com.sun.tools.javac.api to unnamed module @0x31610302
at com.google.errorprone.ErrorProneJavacPlugin.init(ErrorProneJavacPlugin.java:38)
at jdk.compiler/com.sun.tools.javac.api.BasicJavacTask.initPlugin(BasicJavacTask.java:255)
at jdk.compiler/com.sun.tools.javac.api.BasicJavacTask.initPlugins(BasicJavacTask.java:229)
at jdk.compiler/com.sun.tools.javac.main.Main.compile(Main.java:292)
at jdk.compiler/com.sun.tools.javac.main.Main.compile(Main.java:176)
at jdk.compiler/com.sun.tools.javac.Main.compile(Main.java:64)
at jdk.compiler/com.sun.tools.javac.Main.main(Main.java:50)
There was a problem hiding this comment.
I'll handle this separately.
There was a problem hiding this comment.
I committed a different variant of jdk16 workaround and ran check, worked for me.
There was a problem hiding this comment.
I think the problem is that the errorprone plugin has some META-INF/service, so as soon as it appears on classpath it is hooked into javac.
So you latest fix seem to work around it (keep the dependency, but remove it from config, so it is not added to classpath). I did not fully understand you patch, but this looks fine.
| float res = 0; | ||
| // if the array size is large (2x platform vector size), its worth the overhead to vectorize | ||
| // vector loop is unrolled a single time (2 accumulators in parallel) | ||
| if (a.length >= 2 * SPECIES.length()) { |
There was a problem hiding this comment.
very elegant! We had played around with this API last summer and failed to achieve gains. Maybe it has incubated! Or maybe we fumbled the attempt; anyway this is awesome gains for a very small amount of code which is right at the hot spot for vector search. I'll leave some thoughts about what strategy to pursue for making this available on the JIRA, but as for the code, it's great
There was a problem hiding this comment.
@msokolov here's a few notes of troubles i ran into in case it helps for the next time:
- start with "full-performance" example in the JEP/javadocs and then iterate from there.
- don't use "single masked vector op" at the end. don't use masked ops at all :)
- don't put kernel "last" like current code in master. it must be "first" (alignment restrictions i think).
- loop structure is important, otherwise whole loop gets slow (bounds checks I think). See the tricks i had to do with upperBound: that's the only way i was able to get a fast loop.
I may revisit the existing scalar code in master and see if things can be improved there, too.
This version requires no build changes or MR-JAR. The user can opt-in to "testing" the JDK-16 incubating vector api in the usual fashion, by passing "--add-modules jdk.incubator.vector" on the command line. If they do this, the dot product is accelerated, otherwise it is not. User can also check that their setup is correct via a boolean VectorUtil.DOTPRODUCT_VECTORIZATION_SUPPORTED.
|
OK i added a version that has no mr-jar and no build-system changes, but still allows the user to opt-in to the new |
|
By the way, the precommit and tests pass (despite the pure evil happening here). The github checker is just flaky right now (read timeouts and 502 errors downloading the internet). You may just re-run it later when github's networking is again functional. |
|
|
||
| // org.apache.lucene.util.VectorUtilSIMD#dotProduct(float[], float[]) | ||
| private static final String SIMD_BASE64 = | ||
| "yv66vgAAADwAbQoAAgADBwAEDAAFAAYBABBqYXZhL2xhbmcvT2JqZWN0AQAGPGluaXQ+AQADKClW\n" |
There was a problem hiding this comment.
oh man, this is pretty funny, evil whatever. It's nice that it's self contained and doesn't require any build shenanigans. I guess it might be difficult to modify the source code though :) I'll run a test in a vector benchmark to see what happens there
There was a problem hiding this comment.
So I ran my usual HNSW benchmark using KnnGraphTester, which is 1M 256-dim vectors. This was run on my 2-core i7 laptop . Indexing performance improved from 412 sec to 292 sec, and search times improved somewhat less (latency column in table below). These benchmarks are kind of noisy, but they are predictive of what performance we'd expect in an application. Probably we aren't able to pump vectors through the L2 cache fast enough to keep up with the CPU at this point. We might be able to restructure the search code to try to do some pipelining?
JDK16 --add-modules jdk.incubator.vector
recall latency nDoc fanout maxConn beamWidth visited index ms
0.881 0.49 1000000 10 20 50 1154 291684
0.897 0.60 1000000 25 20 50 1262 0
0.913 0.63 1000000 50 20 50 1444 0
0.934 0.92 1000000 100 20 50 1791 0
JDK-11
0.884 0.63 1000000 10 20 50 1143 412111
0.888 0.66 1000000 25 20 50 1248 0
0.913 0.78 1000000 50 20 50 1434 0
0.929 1.00 1000000 100 20 50 1783 0
There was a problem hiding this comment.
Thanks for testing (high level) @msokolov ! I think you are correct that is expected, it would not translate exactly to 5x faster indexing or whatever, as it isn't the only thing that is happening.
From my inspection, there are a lot of other hotspots there, too.
But yeah, it is true that maybe we can start working those other hotspots off as well. For example, IMO it is silly with mmap directory for us to be decoding byte[] slowly into a float[] (readLEFloats or whatever). Vector API can use byte[] or even ByteBuffer directly (I assume any conversions are vectorized too, have not experimented with that).
There was a problem hiding this comment.
There was a problem hiding this comment.
I think we should have the source code of this class file available for regeneration. So anybody can compile with Java 11, but regeneration is only possible with exact Java 16 version.
As it currently looks like I would not be sure if we can release this.
If we don't add the original source code as a java file for regeneration, we should add source code here (as comment).
There was a problem hiding this comment.
But yeah, it is true that maybe we can start working those other hotspots off as well. For example, IMO it is silly with mmap directory for us to be decoding byte[] slowly into a float[] (readLEFloats or whatever). Vector API can use byte[] or even ByteBuffer directly (I assume any conversions are vectorized too, have not experimented with that).
It gets even worse with MMapDirectory version 2 for Java 16. So IMHO, once we are really on JDK 17 minimum, we should change the method signatures of IndexInpout and replace our void readLEFloats(float[]) by FloatVector readFloatVector(), on MMapDirectory this can use a ByteBuffer oder MemorySegmrnt directly on the mmapped contents (if platform endianess is LE). This would spare millions of native->heap arraycopy actions for nonsense.
There was a problem hiding this comment.
start with changing it to use byte[] instead of float[]. For the case of vector 256 this replaces 64 calls each to bytebuffer position, readInt, bswap, etc etc with a single readBytes() call. And it is easiest next step as there will be no alignment issues.
going from bytebuffer is harder.
| } | ||
|
|
||
| try { | ||
| Class<?> clazz = |
There was a problem hiding this comment.
This doesn't require additional permissions (custom class loader definition)?
There was a problem hiding this comment.
It's safe, because we catch Throwable. To be safe, maybe add AccessController around it. But the necessity for this requires more investigation, I am sure Robert checked this. :-)
I think, we don't need extra permissions, as we only work in same java package. The code here is copypasted from Expressions module.
There was a problem hiding this comment.
it requires createClassLoader which is documented as evil. sorry, i didnt try to make this PR "clean" or refine it.
I tried to work on solutions, and got very frustrated. this little function should be 5 lines of code in almost any programming language. But a combination of things here add up to make the java programming language TERRIBLE. I quickly through this together as an act of retaliation.
So I need a break from it as I am too frustrated. If you wish to take over the PR, please feel free!
| return (float) DOTPRODUCT.invokeExact(a, b); | ||
| } catch (RuntimeException e) { | ||
| throw e; | ||
| } catch (Throwable e) { |
There was a problem hiding this comment.
I'd add another catch with Error.
|
|
||
| // org.apache.lucene.util.VectorUtilSIMD#dotProduct(float[], float[]) | ||
| private static final String SIMD_BASE64 = | ||
| "yv66vgAAADwAbQoAAgADBwAEDAAFAAYBABBqYXZhL2xhbmcvT2JqZWN0AQAGPGluaXQ+AQADKClW\n" |
There was a problem hiding this comment.
I think we should have the source code of this class file available for regeneration. So anybody can compile with Java 11, but regeneration is only possible with exact Java 16 version.
As it currently looks like I would not be sure if we can release this.
If we don't add the original source code as a java file for regeneration, we should add source code here (as comment).
| super(parent); | ||
| } | ||
|
|
||
| public Class<?> define(byte[] code) { |
There was a problem hiding this comment.
this signature is strange as it takes the bytes, but not the class name. Either take class name, too, or alternatively decode pass the base64 directly here: Base64.getMimeDecoder().decode(SIMD_BASE64), so method gets parameterless.
Ideally (as we are on Java 16), we could have used the new anonymous classes. But this code is also not visible, damn!
| } | ||
|
|
||
| try { | ||
| Class<?> clazz = |
There was a problem hiding this comment.
It's safe, because we catch Throwable. To be safe, maybe add AccessController around it. But the necessity for this requires more investigation, I am sure Robert checked this. :-)
I think, we don't need extra permissions, as we only work in same java package. The code here is copypasted from Expressions module.
| Class<?> clazz = | ||
| new Loader(VectorUtil.class.getClassLoader()) | ||
| .define(Base64.getMimeDecoder().decode(SIMD_BASE64)); | ||
| impl = MethodHandles.lookup().findStatic(clazz, "dotProduct", DOTPRODUCT_TYPE); |
There was a problem hiding this comment.
Move MethodHandles.lookup() to top of static block, as we need it multiple times.
|
Another idea: We can remove the Methodhandles with the crazy try-catch block, if we'd define a static functional interface in this class, that is implemented by VectorUtil itsself, but also by the base64-encoded class. I am not sure if it's worth the trouble, as it makes regenerating the base64 harder, but then we only need the Loader and use |
|
We should add a small test that verifies the following: If JDK version is exactly 16 (not 15, not 17 - as we don't know if this works later anymore when incubator releases theclass), then the boolean should be true. Of course we then have to add the module-option to the test classpath (which we really should do). Currenty we have no testing at all. |
|
Looks like openjdk is just teasing us here. Incubating API is not really intended to transition to a real API anytime soon: https://openjdk.java.net/jeps/8261663 So we may expect java 21 or java 24 before we can use this stuff :( |
Actually, this is good news in the second incubator JEP (but unrelated to that issue): "On condition that the Foreign-Memory Access API exits incubation, loading and storing vectors from and to a MemorySegment." -> so it looks like the Foreign Memory Access API is planned to leave incubation in java 17. So at least: I AM HAPPY! |
|
Maybe leave this pull request open as a DRAFT (like mine). So we can keep an eye on it. |
See https://issues.apache.org/jira/browse/LUCENE-9838 for benchmarks and details.
I'm not sure practically what we can/should do with this PR, but the performance improvement is large enough for someone using vectors that maybe we should explore options?
For now the PR is just a hack, make it easier for people to try out and play with at least. I borrowed @uschindler build logic from one of his pull requests: apache/lucene-solr#2176