Embeddings: optimize memory usage during decode#54972
Conversation
| return nil, err | ||
| } | ||
|
|
||
| ei.Embeddings = make([]int8, 0, numChunks*ei.ColumnDimension) |
There was a problem hiding this comment.
First change: this we should be using embeddingsChunkSize instead of ColumnDimension because each chunk is not a vector, but rather a chunk of 10_000 floats.
|
|
||
| ei.Embeddings = make([]int8, 0, numChunks*ei.ColumnDimension) | ||
| ei.Embeddings = make([]int8, 0, numChunks*embeddingsChunkSize) | ||
| embeddingsBuf := make([]float32, 0, embeddingsChunkSize) |
There was a problem hiding this comment.
Second change: move the decode buffer outside the loop so we don't allocate on every iteration.
| return nil, err | ||
| } | ||
| ei.Embeddings = append(ei.Embeddings, Quantize(embeddingSlice)...) | ||
| ei.Embeddings = append(ei.Embeddings, Quantize(embeddingsBuf, quantizeBuf)...) |
There was a problem hiding this comment.
Third change: update Quantize to accept an optional buffer, which it will use if it's large enough to fit the output.
| } | ||
| } | ||
|
|
||
| func BenchmarkCustomRepoEmbeddingIndexDownload(b *testing.B) { |
There was a problem hiding this comment.
I added this benchmark to prove to myself that this cuts down the allocations. See PR description for results.
keegancsmith
left a comment
There was a problem hiding this comment.
I guess on the switch to a third party vector store we won't keep this store around? IE it isn't worth making improvements to how we store stuff?
|
Yeah, there is a lot more room for improvement here, but hopefully (🤞) this won't be around too much longer |
|
The backport to To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.1 5.1
# Navigate to the new working tree
cd .worktrees/backport-5.1
# Create a new branch
git switch --create backport-54972-to-5.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 2880b4ffa0b8591d7a46ace259c8f522931dd617
# Push it to GitHub
git push --set-upstream origin backport-54972-to-5.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.1Then, create a pull request where the |
This makes memory usage ~constant during the decode process. Previously, we were undersizing the large allocation (which caused it to reallocate ~10 extra times) for the index and making a bunch of additional smaller allocations per loop. This meant we were spiking memory usage during decoding, and even if the garbage collector could keep up (which I wouldn't expect it to), we'd still need 2x the index size at the peak for live memory. This was causing an OOM. This PR accounts for a 2.2x reduction in allocated bytes and a 1.4x reduction in allocation count. This is quite significant when our indexes are many GBs in memory.
This makes memory usage ~constant during the decode process. Previously, we were undersizing the large allocation (which caused it to reallocate ~10 extra times) for the index and making a bunch of additional smaller allocations per loop. This meant we were spiking memory usage during decoding, and even if the garbage collector could keep up (which I wouldn't expect it to), we'd still need 2x the index size at the peak for live memory. This was causing an OOM. This PR accounts for a 2.2x reduction in allocated bytes and a 1.4x reduction in allocation count. This is quite significant when our indexes are many GBs in memory.
This makes memory usage ~constant during the decode process. Previously, we were undersizing the large allocation (which caused it to reallocate ~10 extra times) for the index and making a bunch of additional smaller allocations per loop. This meant we were spiking memory usage during decoding, and even if the garbage collector could keep up (which I wouldn't expect it to), we'd still need 2x the index size at the peak for live memory. This was causing an OOM.
Before:
After:
That's a 2.2x reduction in allocated bytes and a 1.4x reduction in allocation count. This is quite significant when our indexes are many GBs in memory.
Test plan
A new benchmark to demonstrate better memory behavior, a new quick test on the quantize changes, existing tests on encode/decode, and manual E2E tests for generating and searching with an embeddings index.