Skip to content

LUCENE-9047: Move the Directory APIs to be little endian (take 2)#107

Merged
iverase merged 10 commits intoapache:mainfrom
iverase:littleEndian
May 3, 2021
Merged

LUCENE-9047: Move the Directory APIs to be little endian (take 2)#107
iverase merged 10 commits intoapache:mainfrom
iverase:littleEndian

Conversation

@iverase
Copy link
Contributor

@iverase iverase commented Apr 26, 2021

Here is a proposal for changing the directory API to be little endian meanwhile keeping backwards compatibility. This effort build on top of the efforts of having new codecs for Lucene 9.0. In order to illustrate the approach, the PR is divided in 5 commits. Here is the explanation for each commit:

commit 1: When changing the Directory API, the ways we write and read integers from DirectReader / DirectWriter needs to change as well. Therefore we need to make a copy of those classes to backwards codecs and make sure those codecs are using that version instead of the one in core. This applies as well to DirectMonotonicReader / DirectMonotonicWriter.

commit 2: I am still proposing to use IndexInput / IndexOutput wrappers to handle backwards compatibility. In order to make it easier to read, this commit wraps all directory calls in backward codecs that create an IndexInput / IndexOutput.

commit 3: There is one file that we need to open that it does not belong to a codec and it might be written in big endian or little endian. The file is segment.gen. In order to make it easy, I took the approach to write this file always using big endian encoding. In addition I am doing the same for codec headers / footers. This can be improved but this approach helped me moving forward at this point.

commit 4: This commit actually changes the Directory endianness and updates the DirectReader / DirectWriter integer packers to work with this endianness. It introduces the IndexInput / IndexOutput wrappers for backwards codecs.

commit 5: Last changes to fix the last failing tests. In particular adds backwards compatibility for FST.

The idea with this PR is to agree in the procedure. If we can agree my proposal is to. add commit 1 and 2 first as they are only refactors. Then we can focus in the last commits.

iverase added 5 commits April 26, 2021 10:31
All backwards codec should use this version instead the one in core.
…ecs to create IndexInput / IndexOutput. This will allow to wrap those objects when changing the endianness of the directory API.
…refore we have no information if the file in Big or little endian.

In order to make this easy, we always read / write this file on Big endian. In addition codec headers and footers are writen in big endian as well.
…ddition it introduces the IndexInput / IndexOutput wrapper for backwards codecs.
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the approach, the change looks great to me

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I peeked at the patch and it looks good to me.

* <p>Unlike PackedInts, it optimizes for read i/o operations and supports &gt; 2B values. Example
* usage:
*
* <pre class="prettyprint">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use code blocks too - then html entities don't need to be escaped..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy / paste from DirectWriter so I would prefer to change it in a follow up PR.

try {
return in.readLong(offset + (index << 3));
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UncheckedIOException(e)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, maybe a follow up PR? this is already pretty big.

@iverase iverase marked this pull request as ready for review April 28, 2021 14:38
@iverase
Copy link
Contributor Author

iverase commented Apr 28, 2021

I will set the PR ready to review as feedback has been positive so far. I want to stress that the most interesting part is how to deal with reading segment.gen. This file does not belong to a codec so we open it blind without knowing the endianness. Therefore the approach I have taken is to write that file always big endian as we are doing until now.

In addition as this files have a codec header / footer, then all headers / footers will be written using BE order.

Maybe @rmuir and @mikemccand have an opinion here.

@rmuir
Copy link
Member

rmuir commented Apr 28, 2021

I will set the PR ready to review as feedback has been positive so far. I want to stress that the most interesting part is how to deal with reading segment.gen. This file does not belong to a codec so we open it blind without knowing the endianness. Therefore the approach I have taken is to write that file always big endian as we are doing until now.

Agreed, let's try to land the current patch! I think its fine that segments_N and codec headers just stay bigendian, as the code for this is nicely self-contained.

@iverase
Copy link
Contributor Author

iverase commented Apr 29, 2021

Thanks @rmuir! I will wait until Monday, if there is no more feedback I will proceed merging the current patch.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the effort reworking this change: looks great!

Copy link
Contributor

@zacharymorn zacharymorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to see how it's being implemented & the changes look great!

@iverase iverase merged commit b84e0c2 into apache:main May 3, 2021
@iverase iverase deleted the littleEndian branch May 3, 2021 05:49
@uschindler
Copy link
Contributor

uschindler commented May 3, 2021

Hi @iverase
the merge broke tests: We fixed over the weekend the NRT tests. They had a bug in the security policy, so no NRT Test wasn't running at all. So you did not see the test failure.

If you fix it, make sure to run the replicator tests with -Dtests.nightly=true, so it runs all tests (takes a few minutes)

I will backport the test fixes also to 8.x later. It's great that we fixed it yesterday :-) The NRT tests were not running since now YEARS!

@uschindler
Copy link
Contributor

This is great - by the way!

I will make a new PR based on apache/lucene-solr#2176 to check how the little endian by default improves the new MMapDirectory v2 that will hopefully go into Java 17! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants