Use IOContext#RANDOM when appropriate.#13267
Merged
jpountz merged 2 commits intoapache:branch_9xfrom Apr 5, 2024
Merged
Conversation
This switches the following files to `IOContext#RANDOM`: - Stored fields data file. - Term vectors data file. - HNSW graph. - Temporary file storing vectors at merge time that we use to construct the merged HNSW graph. - Vector data files, including quantized data files. I hesitated using `IOContext.RANDOM` on terms, since they have a random access pattern when running term queries, but a more sequential access pattern when running multi-term queries. I erred on the conservative side and did not switch them to `IOContext.RANDOM` for now. For simplicity, I'm only touching the current codec, not previous codecs.
uschindler
approved these changes
Apr 4, 2024
Contributor
uschindler
left a comment
There was a problem hiding this comment.
Looks fine.
I think changing the default to random (using a sysprop) can be done optionally once we have enough benchmarks.
| vectorDataInput = | ||
| segmentWriteState.directory.openInput( | ||
| tempVectorData.getName(), segmentWriteState.context); | ||
| tempVectorData.getName(), IOContext.DEFAULT.withRandomAccess()); |
Contributor
There was a problem hiding this comment.
Here you could also use IOContext.RANDOM directly.
| this.flushInfo = ctxt.flushInfo; | ||
| this.readOnce = readOnce; | ||
| this.randomAccess = ctxt.randomAccess; | ||
| this.randomAccess = false; |
Contributor
There was a problem hiding this comment.
I think this should be documented.
uschindler
reviewed
Apr 4, 2024
lucene/CHANGES.txt
Outdated
| * GITHUB#13222: Vector data files - including quantized vectors, HNSW graphs, | ||
| stored fields data files and term vectors data files now advise the | ||
| underlying `Directory` to optimize for random-access. On recent Java versions | ||
| (>= 21) and POSIX systems, `MMapDirectory` would use MADV_RANDOM for these |
Contributor
There was a problem hiding this comment.
It is a bit unclear: both conditions must apply. "On Linux and macOS systems using recent Java versions (>=21), MMapDirectory would use POSIX_MADV_RANDOM"
Contributor
There was a problem hiding this comment.
Maybe combine this with above entry about madvise.
uschindler
approved these changes
Apr 5, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This switches the following files to
IOContext#RANDOM:I hesitated using
IOContext.RANDOMon terms, since they have a random access pattern when running term queries, but a more sequential access pattern when running multi-term queries. I erred on the conservative side and did not switch them toIOContext.RANDOMfor now.For simplicity, I'm only touching the current codec, not previous codecs.