Skip to content

Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors #13985

Merged
ChrisHegarty merged 11 commits intoapache:mainfrom
shatejas:force-merge-fix
Nov 22, 2024
Merged

Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors #13985
ChrisHegarty merged 11 commits intoapache:mainfrom
shatejas:force-merge-fix

Conversation

@shatejas
Copy link
Contributor

@shatejas shatejas commented Nov 9, 2024

The change is needed to be able to reduce the force merge time. Lucene99FlatVectorsReader is opened with IOContext.RANDOM, this optimizes searches with madvise as RANDOM. For merges we need sequential access and ability to preload pages to be able to shorten the merge time.

The change updates the ReadAdvice.SEQUENTIAL before the merge starts and reverts it to ReadAdvice.RANDOM at the end of the merge for
Lucene99FlatVectorsReader.

Description

Benchmarking results coming up. Opening as a draft to get initial feedback

Related issues

#13920

@shatejas
Copy link
Contributor Author

shatejas commented Nov 9, 2024

cc: @uschindler @jpountz @navneet1v

@shatejas shatejas force-pushed the force-merge-fix branch 2 times, most recently from ffb35a1 to 515edb3 Compare November 9, 2024 09:59
@ChrisHegarty
Copy link
Contributor

@shatejas I'm curious how much this actually helps, and I know that you said that benchmark results would be posted.

I do like that we can update the ReadAdvice on an index input 👍 What is unclear is how much the sequential advise here helps over something like a load (similar to MemorySegment::load), that touches every page.

@uschindler
Copy link
Contributor

Hi,
I am currently on travel, so I can't review this. Will look into it posisbly later this week. Greetings from Costa Rica!

@shatejas
Copy link
Contributor Author

I'm curious how much this actually helps, and I know that you said that benchmark results would be posted.

@ChrisHegarty Preliminary results showed approximately 40mins (~13%) reduction in total force merge time for 10m dataset. You should be able to find details here opensearch-project/k-NN#2134 (comment).
Though the setup does not use LuceneHNSW it does use Lucene99FlatVectorsReader.

What is unclear is how much the sequential advise here helps over something like a load (similar to MemorySegment::load), that touches every page.

I did try preload which does MemorySegment::load and there was an improvement in force merge time. The advantage I see with updateAdvice is being able to delay the loading of vectors till needed.

@ChrisHegarty
Copy link
Contributor

Thanks @shatejas

For clarity, the bottleneck that is being fixed here is with the reading of all vector data from the to-be-merged segments, when copying that data to the new segment - all non-deleted vectors are accessed sequential, just once, as they are copied. We can then switch back to random access when building the graph.

@ChrisHegarty
Copy link
Contributor

Generally, I think that the direction in this PR is good. I wanna help get it moved forward. I'll do some local testing and perf runs to verify the impact. I can also commit some tests and improvements directly to the PR branch. @shatejas ?

@shatejas
Copy link
Contributor Author

Generally, I think that the direction in this PR is good. I wanna help get it moved forward. I'll do some local testing and perf runs to verify the impact. I can also commit some tests and improvements directly to the PR branch. @shatejas ?

Feel free to commit changes, added you as a collaborator to the shatejas/lucene repo. Do let me know if you need any other access

I will be working on the benchmarks within next few days.

reading IndexInput

The change is needed to be able to reduce the force merge time.
Lucene99FlatVectorsReader is opened with IOContext.RANDOM, this optimizes
searches with madvise as RANDOM. For merges we need sequential access and
ability to preload pages to be able to shorten the merge time.

The change updates the ReadAdvice.SEQUENTIAL before the merge starts and reverts it
to ReadAdvice.RANDOM at the end of the merge for
Lucene99FlatVectorsReader.
@ChrisHegarty ChrisHegarty marked this pull request as ready for review November 15, 2024 12:36
@ChrisHegarty ChrisHegarty changed the title Introduces IndexInput#updateReadAdvice to change the readadvice while Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors Nov 15, 2024
@ChrisHegarty
Copy link
Contributor

I added some asserts to the asserting wrappers to ensure that finishMerge is always called. I think that the code is in good shape now. Pending some benchmark runs..

@shatejas no need to force-push. If merged, we collapse the commits into a single one.

@shatejas
Copy link
Contributor Author

Benchmarks

Setup 1 - Opensearch cluster

Ran with opensearch benchmarks

Total data nodes - 3
Total shards - 6 (2 per node), no replicas
Memory - 128gb
vCPU - 16

Dataset used: cohere-10m

Baseline - OS 2.18 and lucene 9.12
Candidate - OS 2.16 and lucene 9.12 with readAdvice changes

Why was this tested with lucene 9.12?
Opensearch is not using lucene >9.12 for any of its version. Upgrading it to use lucene 10 requires significant changes. For candidate, required commits were cherry-picked

Run 1: sequence of operations: delete-index -> create-index -> add documents -> force-merge -> search

Results
Force-merge(ms) Force-merge(hrs) Search p50 Search p90 Search p99
Baseline 15795889.88313920 4hrs 23 mins 9.6 10.8 14.7
Candidate 15204143.95724240 4hrs 13mins 10.7 12.0 15.0

Run 2: Search performed on already indexed data from above run

Search p50 Search p90 Search p99
Baseline 9.7 10.6 12.1
Candidate 10.4 11.3 12.5

Setup 2: Used lucene-utils knnPerfTest.py

Baseline - Lucene main
Candidate - Lucene main with current commit

Baseline

recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB)
0.644 0.428 50000 10 64 64 250 no 18.97 2635.18 1.89 1 20.62

Candidate

recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB)
0.644 0.436 50000 10 64 64 250 no 20.20 2474.76 1.77 1 20.62

There is a small affect on search latencies, its hard to say if its due to the change or just a fluctuation in the runs. I couldn't think of a reason that would of search latencies

@jpountz @ChrisHegarty thoughts?

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restoring sequential advice is the right thing to do here, and has clear perf benefits. The code LGTM. lemme know if u need anything further to get this merged. And backported to 10.x

@shatejas
Copy link
Contributor Author

Thanks @ChrisHegarty!

I do need help with merging this. I have updated CHANGES.txt. Do let me know the process of backporting it - if it needs cherry-picking and raising a new PR on 10.x branch, I can do it.

Let me know if I need anything else to merge this

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Nov 20, 2024

The org.apache.lucene.index.TestConcurrentMergeScheduler.testNoWaitClose test hits a new assert that I added - sorry. I need to look to see if it is a test issue or more of a design issue with finishMerge.

./gradlew test --tests TestConcurrentMergeScheduler.testNoWaitClose -Dtests.file.encoding=UTF-8 -Dtests.iters=10

@shatejas
Copy link
Contributor Author

The org.apache.lucene.index.TestConcurrentMergeScheduler.testNoWaitClose test hits a new assert that I added - sorry. I need to look to see if it is a test issue or more of a design issue with finishMerge.

./gradlew test --tests TestConcurrentMergeScheduler.testNoWaitClose -Dtests.file.encoding=UTF-8 -Dtests.iters=10

I took a look at the failure. Here is what I found, mergeInstanceCount does not decrement if the merge is aborted between getMergeInstance and when the merge starts (ref). In that scenario finishMerge is never called and the readers are closed in the finally block failing the assert mergeInstanceCount == 0 in AssertingKnnVectorsReader

@ChrisHegarty
Copy link
Contributor

I broke the tests again, by adding more asserts, as I'm a little uncomfortable with the finishMergeCount <= 0. Now I broke something else. I'll get to fixing this, or backout the extra asserts. For reference this reproduces:

./gradlew :lucene:core:test --tests "org.apache.lucene.codecs.perfield.TestPerFieldKnnVectorsFormat.testByteVectorScorerIteration" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=C47EA6F578AC0A04 -Ptests.iters=1 -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=128 -Ptests.forceintegervectors=true -Dtests.iters=10

@shatejas
Copy link
Contributor Author

shatejas commented Nov 21, 2024

I broke the tests again, by adding more asserts, as I'm a little uncomfortable with the finishMergeCount <= 0. Now I broke something else. I'll get to fixing this, or backout the extra asserts. For reference this reproduces:

Took a look, The mismatch between mergeInstanceCount and mergeInstance is because mergeInstanceCount is being updated in parent and mergeInstance is updated to true during getMergeInstance(). If search is happening during merge there is a good chance that it runs into a scenario where mergeInstanceCount = 1 but mergeInstance is false.

Considering the test has commit() call and then search, I think its running into the scenario

@ChrisHegarty
Copy link
Contributor

..
Took a look, The mismatch between mergeInstanceCount and mergeInstance is because mergeInstanceCount is being updated in parent and mergeInstance is updated to true during getMergeInstance(). If search is happening during merge there is a good chance that it runs into a scenario where mergeInstanceCount = 1 but mergeInstance is false.

Considering the test has commit() call and then search, I think its running into the scenario

Oh yeah. I get it, and the asserts were too strong. I backed them off a bit. High level, I wanna ensure that we're asserting the right execution model, since the backing memory holding the vector data is shared across merge and search at the same time. I think that this is fine, search may access the memory while still in sequential, but when merge finishes it'll switch back.

@ChrisHegarty ChrisHegarty merged commit 46204f6 into apache:main Nov 22, 2024
ChrisHegarty pushed a commit that referenced this pull request Nov 22, 2024
… merging vectors (#13985)

The commit updates the ReadAdvice.SEQUENTIAL before the merge starts and reverts it to ReadAdvice.RANDOM at the end of the merge for Lucene99FlatVectorsReader.
@shatejas
Copy link
Contributor Author

Thanks a lot @ChrisHegarty for adding tight tests and merging this!

@uschindler
Copy link
Contributor

Thanks @ChrisHegarty for taking care. If I have any additional comments about API I will open another followup PR.
Still in 🇨🇷, flying back this evening.

@msokolov
Copy link
Contributor

msokolov commented Dec 1, 2024

The impact on search times is surprising, as you said. Can you clarify one thing a about the benchmark setup: does it perform searches concurrently with indexing (and merging) on the same box, or are they completely separate?

jimczi added a commit to jimczi/lucene that referenced this pull request Dec 17, 2024
@shatejas
Copy link
Contributor Author

The impact on search times is surprising, as you said. Can you clarify one thing a about the benchmark setup: does it perform searches concurrently with indexing (and merging) on the same box, or are they completely separate?

searches were after the completion of indexing and force-merge

benchaplin pushed a commit to benchaplin/lucene that referenced this pull request Dec 31, 2024
… merging vectors (apache#13985)

The commit updates the ReadAdvice.SEQUENTIAL before the merge starts and reverts it to ReadAdvice.RANDOM at the end of the merge for Lucene99FlatVectorsReader.
public void finishMerge() throws IOException {
// This makes sure that the access pattern hint is reverted back since HNSW implementation
// needs it
this.vectorData.updateReadAdvice(ReadAdvice.RANDOM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too opinionated. If we want to have this swapping behavior I think we need to find a way to cache the original ReadAdvice and restore it, since we don't know what it was - that decision is made by the Directory and its hints and iocontexts and sysprops and so on. But I don't think we currently have IOContext.getReadAdvice()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this make sense. We should flip back to the read advise which was originally present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directory and its hints and iocontexts and sysprops and so on. But I don't think we currently have IOContext.getReadAdvice()?

At the time of the change the initial advice was random so this is reverting back to random. But I agree, we can simply cache it and revert it from the cached value

@msokolov
Copy link
Contributor

msokolov commented Aug 8, 2025

I'm really not sure about this change any more -- looking back at the (merging) performance improvement, it was not very large, and somewhat offset by the mysterious search slowdowns. At the same time, to get this working we need to add lots of stuff: reference counting to ensure proper closing, access to memory mapping hints. Are we still convinced it is worth it?

@shatejas
Copy link
Contributor Author

it was not very large, and somewhat offset by the mysterious search slowdowns

We can switch to an approach where the madvise is not changed at all. The approach leverages prefetch, read advise still changes but the reads are not as aggressive as sequential. A prefetch call after each read (inside merge) will make sure that merges don't slow down. At the same time this minimizes the search perf impact considering prefetch is not aggressive.

#14076 (comment)

@msokolov
Copy link
Contributor

Perhaps if we have a case where there is no random access (from Lucene) at all and we are only using Lucene to store the vector data - any searh indexing is being done by a native plugin (I think this is what you are targeting?) then we don't really want to be switching back and forth between access modes, but rather we'd want to set to SEQUENTIAL and leave it that way, ideally? In that case, what if the reader could recognize that it has no associated HNSW graph (somehow) and provide some kind of hint to the directory that it could then use to decide to use sequential? Also, wondering if SEQUENTIAL is really that much better than NORMAL? If not, then we could simply revert this given that we are retruning to NORMAL as the default.

@msokolov
Copy link
Contributor

I'll open an issue for backporting #14702 to 10x to fix this

@shatejas
Copy link
Contributor Author

shatejas commented Aug 19, 2025

Perhaps if we have a case where there is no random access (from Lucene) at all and we are only using Lucene to store the vector data - any searh indexing is being done by a native plugin (I think this is what you are targeting?) then we don't really want to be switching back and forth between access modes

In certain scenarios, random access is critical for optimal performance. Our initial hypothesis, which we are currently validating through benchmarks, is that random access lookups are significantly more efficient than sequential lookup even for flat vector data under high memory pressure, especially when doing a exact search on filtered documents. This is because random access avoids data prefetching, thereby reducing memory swapping.

To provide users with greater control, its best to allow them to configure initial IOContext value based on their specific workloads, rather than keeping it a fixed constant. This approach will offer flexibility while maintaining sensible defaults.

Furthermore, we can mitigate the impact on search performance during merges by implementing a dedicated prefetch functionality for vector merges. This eliminates the need to switch between access methods, ensuring a minimal impact on ongoing searches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants