LUCENE-9047: Move the Directory APIs to be little endian (take 2) by iverase · Pull Request #107 · apache/lucene

iverase · 2021-04-26T08:59:08Z

Here is a proposal for changing the directory API to be little endian meanwhile keeping backwards compatibility. This effort build on top of the efforts of having new codecs for Lucene 9.0. In order to illustrate the approach, the PR is divided in 5 commits. Here is the explanation for each commit:

commit 1: When changing the Directory API, the ways we write and read integers from DirectReader / DirectWriter needs to change as well. Therefore we need to make a copy of those classes to backwards codecs and make sure those codecs are using that version instead of the one in core. This applies as well to DirectMonotonicReader / DirectMonotonicWriter.

commit 2: I am still proposing to use IndexInput / IndexOutput wrappers to handle backwards compatibility. In order to make it easier to read, this commit wraps all directory calls in backward codecs that create an IndexInput / IndexOutput.

commit 3: There is one file that we need to open that it does not belong to a codec and it might be written in big endian or little endian. The file is segment.gen. In order to make it easy, I took the approach to write this file always using big endian encoding. In addition I am doing the same for codec headers / footers. This can be improved but this approach helped me moving forward at this point.

commit 4: This commit actually changes the Directory endianness and updates the DirectReader / DirectWriter integer packers to work with this endianness. It introduces the IndexInput / IndexOutput wrappers for backwards codecs.

commit 5: Last changes to fix the last failing tests. In particular adds backwards compatibility for FST.

The idea with this PR is to agree in the procedure. If we can agree my proposal is to. add commit 1 and 2 first as they are only refactors. Then we can focus in the last commits.

All backwards codec should use this version instead the one in core.

…ecs to create IndexInput / IndexOutput. This will allow to wrap those objects when changing the endianness of the directory API.

…refore we have no information if the file in Big or little endian. In order to make this easy, we always read / write this file on Big endian. In addition codec headers and footers are writen in big endian as well.

…ddition it introduces the IndexInput / IndexOutput wrapper for backwards codecs.

jpountz

+1 to the approach, the change looks great to me

lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java

lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/store/DirectoryUtil.java

lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java

dweiss

I peeked at the patch and it looks good to me.

dweiss · 2021-04-27T08:02:36Z

...ne/backward-codecs/src/java/org/apache/lucene/backward_codecs/packed/LegacyDirectWriter.java

+ * <p>Unlike PackedInts, it optimizes for read i/o operations and supports &gt; 2B values. Example
+ * usage:
+ *
+ * <pre class="prettyprint">


You can use code blocks too - then html entities don't need to be escaped..

This is a copy / paste from DirectWriter so I would prefer to change it in a follow up PR.

dweiss · 2021-04-27T08:02:53Z

...ne/backward-codecs/src/java/org/apache/lucene/backward_codecs/packed/LegacyDirectReader.java

+      try {
+        return in.readLong(offset + (index << 3));
+      } catch (IOException e) {
+        throw new RuntimeException(e);


UncheckedIOException(e)?

Here and in other places.

Same as above, maybe a follow up PR? this is already pretty big.

iverase · 2021-04-28T14:42:52Z

I will set the PR ready to review as feedback has been positive so far. I want to stress that the most interesting part is how to deal with reading segment.gen. This file does not belong to a codec so we open it blind without knowing the endianness. Therefore the approach I have taken is to write that file always big endian as we are doing until now.

In addition as this files have a codec header / footer, then all headers / footers will be written using BE order.

Maybe @rmuir and @mikemccand have an opinion here.

rmuir · 2021-04-28T17:49:36Z

I will set the PR ready to review as feedback has been positive so far. I want to stress that the most interesting part is how to deal with reading segment.gen. This file does not belong to a codec so we open it blind without knowing the endianness. Therefore the approach I have taken is to write that file always big endian as we are doing until now.

Agreed, let's try to land the current patch! I think its fine that segments_N and codec headers just stay bigendian, as the code for this is nicely self-contained.

iverase · 2021-04-29T07:18:26Z

Thanks @rmuir! I will wait until Monday, if there is no more feedback I will proceed merging the current patch.

rmuir

Thanks for all the effort reworking this change: looks great!

zacharymorn

Just trying to see how it's being implemented & the changes look great!

uschindler · 2021-05-03T11:51:05Z

Hi @iverase
the merge broke tests: We fixed over the weekend the NRT tests. They had a bug in the security policy, so no NRT Test wasn't running at all. So you did not see the test failure.

If you fix it, make sure to run the replicator tests with -Dtests.nightly=true, so it runs all tests (takes a few minutes)

I will backport the test fixes also to 8.x later. It's great that we fixed it yesterday :-) The NRT tests were not running since now YEARS!

uschindler · 2021-05-03T13:32:46Z

This is great - by the way!

I will make a new PR based on apache/lucene-solr#2176 to check how the little endian by default improves the new MMapDirectory v2 that will hopefully go into Java 17! :-)

iverase added 5 commits April 26, 2021 10:31

Make a copy of DirectReader / DirectWriter to backwards codec.

05bdebc

All backwards codec should use this version instead the one in core.

Introduce a DirectoryUtil that wraps all the calls from backwards cod…

f8ed4fb

…ecs to create IndexInput / IndexOutput. This will allow to wrap those objects when changing the endianness of the directory API.

This commit actually changes the Directory API to little endian. In a…

20676d0

…ddition it introduces the IndexInput / IndexOutput wrapper for backwards codecs.

Last fixes: FST needs to be backwards compatible.

1e6bd2f

iverase requested review from dweiss and jpountz April 26, 2021 08:59

iverase mentioned this pull request Apr 26, 2021

LUCENE-9047: Move the Directory APIs to be little endian apache/lucene-solr#2094

Closed

iverase marked this pull request as draft April 26, 2021 09:01

Ouch: missing license headers

8deca69

jpountz approved these changes Apr 27, 2021

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java Outdated Show resolved Hide resolved

lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/store/DirectoryUtil.java Outdated Show resolved Hide resolved

jpountz reviewed Apr 27, 2021

View reviewed changes

lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java Outdated Show resolved Hide resolved

dweiss reviewed Apr 27, 2021

View reviewed changes

iverase added 2 commits April 28, 2021 16:33

Ieration on review comments

28fdfa8

Merge branch 'main' into littleEndian

555645b

iverase marked this pull request as ready for review April 28, 2021 14:38

rmuir approved these changes Apr 29, 2021

View reviewed changes

zacharymorn approved these changes May 1, 2021

View reviewed changes

iverase added 2 commits May 3, 2021 07:16

Update CHANGES.txt

a177a7e

Merge branch 'main' into littleEndian

d66e8ba

iverase merged commit b84e0c2 into apache:main May 3, 2021

iverase deleted the littleEndian branch May 3, 2021 05:49

iverase mentioned this pull request May 3, 2021

LUCENE-9047: Write checksum as big endian in NRT replicator #123

Merged

uschindler mentioned this pull request Jun 7, 2021

Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32) #173

Closed

Conversation

iverase commented Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dweiss left a comment

Choose a reason for hiding this comment

Uh oh!

dweiss Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

iverase Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

dweiss Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

dweiss Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

iverase Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

iverase commented Apr 28, 2021

Uh oh!

rmuir commented Apr 28, 2021

Uh oh!

iverase commented Apr 29, 2021

Uh oh!

rmuir left a comment

Choose a reason for hiding this comment

Uh oh!

zacharymorn left a comment

Choose a reason for hiding this comment

Uh oh!

uschindler commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uschindler commented May 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

iverase commented Apr 26, 2021 •

edited

Loading

uschindler commented May 3, 2021 •

edited

Loading