LUCENE-9047: Move the Directory APIs to be little endian (take 2)#107
LUCENE-9047: Move the Directory APIs to be little endian (take 2)#107iverase merged 10 commits intoapache:mainfrom
Conversation
All backwards codec should use this version instead the one in core.
…ecs to create IndexInput / IndexOutput. This will allow to wrap those objects when changing the endianness of the directory API.
…refore we have no information if the file in Big or little endian. In order to make this easy, we always read / write this file on Big endian. In addition codec headers and footers are writen in big endian as well.
…ddition it introduces the IndexInput / IndexOutput wrapper for backwards codecs.
jpountz
left a comment
There was a problem hiding this comment.
+1 to the approach, the change looks great to me
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/store/DirectoryUtil.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java
Outdated
Show resolved
Hide resolved
dweiss
left a comment
There was a problem hiding this comment.
I peeked at the patch and it looks good to me.
| * <p>Unlike PackedInts, it optimizes for read i/o operations and supports > 2B values. Example | ||
| * usage: | ||
| * | ||
| * <pre class="prettyprint"> |
There was a problem hiding this comment.
You can use code blocks too - then html entities don't need to be escaped..
There was a problem hiding this comment.
This is a copy / paste from DirectWriter so I would prefer to change it in a follow up PR.
| try { | ||
| return in.readLong(offset + (index << 3)); | ||
| } catch (IOException e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Same as above, maybe a follow up PR? this is already pretty big.
|
I will set the PR ready to review as feedback has been positive so far. I want to stress that the most interesting part is how to deal with reading In addition as this files have a codec header / footer, then all headers / footers will be written using BE order. Maybe @rmuir and @mikemccand have an opinion here. |
Agreed, let's try to land the current patch! I think its fine that |
|
Thanks @rmuir! I will wait until Monday, if there is no more feedback I will proceed merging the current patch. |
rmuir
left a comment
There was a problem hiding this comment.
Thanks for all the effort reworking this change: looks great!
zacharymorn
left a comment
There was a problem hiding this comment.
Just trying to see how it's being implemented & the changes look great!
|
Hi @iverase If you fix it, make sure to run the replicator tests with -Dtests.nightly=true, so it runs all tests (takes a few minutes) I will backport the test fixes also to 8.x later. It's great that we fixed it yesterday :-) The NRT tests were not running since now YEARS! |
|
This is great - by the way! I will make a new PR based on apache/lucene-solr#2176 to check how the little endian by default improves the new MMapDirectory v2 that will hopefully go into Java 17! :-) |
Here is a proposal for changing the directory API to be little endian meanwhile keeping backwards compatibility. This effort build on top of the efforts of having new codecs for Lucene 9.0. In order to illustrate the approach, the PR is divided in 5 commits. Here is the explanation for each commit:
commit 1: When changing the Directory API, the ways we write and read integers from DirectReader / DirectWriter needs to change as well. Therefore we need to make a copy of those classes to backwards codecs and make sure those codecs are using that version instead of the one in core. This applies as well to DirectMonotonicReader / DirectMonotonicWriter.
commit 2: I am still proposing to use IndexInput / IndexOutput wrappers to handle backwards compatibility. In order to make it easier to read, this commit wraps all directory calls in backward codecs that create an IndexInput / IndexOutput.
commit 3: There is one file that we need to open that it does not belong to a codec and it might be written in big endian or little endian. The file is
segment.gen. In order to make it easy, I took the approach to write this file always using big endian encoding. In addition I am doing the same for codec headers / footers. This can be improved but this approach helped me moving forward at this point.commit 4: This commit actually changes the Directory endianness and updates the DirectReader / DirectWriter integer packers to work with this endianness. It introduces the IndexInput / IndexOutput wrappers for backwards codecs.
commit 5: Last changes to fix the last failing tests. In particular adds backwards compatibility for FST.
The idea with this PR is to agree in the procedure. If we can agree my proposal is to. add commit 1 and 2 first as they are only refactors. Then we can focus in the last commits.