Skip to content

Use IOContext#RANDOM when appropriate.#13267

Merged
jpountz merged 2 commits intoapache:branch_9xfrom
jpountz:random_io_context_9x
Apr 5, 2024
Merged

Use IOContext#RANDOM when appropriate.#13267
jpountz merged 2 commits intoapache:branch_9xfrom
jpountz:random_io_context_9x

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 4, 2024

This switches the following files to IOContext#RANDOM:

  • Stored fields data file.
  • Term vectors data file.
  • HNSW graph.
  • Temporary file storing vectors at merge time that we use to construct the merged HNSW graph.
  • Vector data files, including quantized data files.

I hesitated using IOContext.RANDOM on terms, since they have a random access pattern when running term queries, but a more sequential access pattern when running multi-term queries. I erred on the conservative side and did not switch them to IOContext.RANDOM for now.

For simplicity, I'm only touching the current codec, not previous codecs.

This switches the following files to `IOContext#RANDOM`:
 - Stored fields data file.
 - Term vectors data file.
 - HNSW graph.
 - Temporary file storing vectors at merge time that we use to construct the merged HNSW graph.
 - Vector data files, including quantized data files.

I hesitated using `IOContext.RANDOM` on terms, since they have a random access
pattern when running term queries, but a more sequential access pattern when
running multi-term queries. I erred on the conservative side and did not switch
them to `IOContext.RANDOM` for now.

For simplicity, I'm only touching the current codec, not previous codecs.
@jpountz jpountz added this to the 9.11.0 milestone Apr 4, 2024
@jpountz jpountz requested a review from uschindler April 4, 2024 17:51
Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine.

I think changing the default to random (using a sysprop) can be done optionally once we have enough benchmarks.

vectorDataInput =
segmentWriteState.directory.openInput(
tempVectorData.getName(), segmentWriteState.context);
tempVectorData.getName(), IOContext.DEFAULT.withRandomAccess());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could also use IOContext.RANDOM directly.

this.flushInfo = ctxt.flushInfo;
this.readOnce = readOnce;
this.randomAccess = ctxt.randomAccess;
this.randomAccess = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be documented.

* GITHUB#13222: Vector data files - including quantized vectors, HNSW graphs,
stored fields data files and term vectors data files now advise the
underlying `Directory` to optimize for random-access. On recent Java versions
(>= 21) and POSIX systems, `MMapDirectory` would use MADV_RANDOM for these
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unclear: both conditions must apply. "On Linux and macOS systems using recent Java versions (>=21), MMapDirectory would use POSIX_MADV_RANDOM"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe combine this with above entry about madvise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants