Skip to content

Update queue dont keep ops in ram#7951

Merged
IvanPleshkov merged 7 commits intodevfrom
update-queue-dont-keep-ops-in-ram-showcase
Jan 27, 2026
Merged

Update queue dont keep ops in ram#7951
IvanPleshkov merged 7 commits intodevfrom
update-queue-dont-keep-ops-in-ram-showcase

Conversation

@IvanPleshkov
Copy link
Contributor

@IvanPleshkov IvanPleshkov commented Jan 20, 2026

Summary

This PR reduces the RAM usage of the pending operations in Update Worker.

Description

Update worker has a channel with pending operations. Each pending has an OperationData, which contains the request data.

In case of bulk upload the channel may be huge and it consumes a lot of RAM.

This PR changes the logic of the Update Worker, and instead of storing the request in RAM, the Update Worker reads the request from the WAL instead.

Performance

With buffering, there were no performance drops found.

Using the vector db benchmark, the size of the update didnt reach the buffer size limit.
Using BFB, I got a full update queue using params --threads=8 --num-vectors=1000000. But still, the difference between runs is larger than the difference between branches.

Without buffering (by changing DEFAULT_UPDATE_QUEUE_RAM_BUFFER to zero), there is a performance drop. Because of deserialization, local measurements shown that deserialization might take 1-2ms of a large batched upsert. It's a fine price for unlimited update queue where ram usage has a higher priority than deserialization cost.

@IvanPleshkov IvanPleshkov requested a review from agourlay January 20, 2026 13:59
@IvanPleshkov IvanPleshkov force-pushed the update-queue-dont-keep-ops-in-ram-showcase branch 2 times, most recently from 22b7ed6 to 7078662 Compare January 26, 2026 00:25
@IvanPleshkov IvanPleshkov changed the base branch from dev to save-applied-seq January 26, 2026 00:25
@IvanPleshkov IvanPleshkov changed the title Update queue dont keep ops in ram showcase Update queue dont keep ops in ram Jan 26, 2026
cound only pending in update worker operations

fix typo

use channel size instead of wal index
@IvanPleshkov IvanPleshkov force-pushed the update-queue-dont-keep-ops-in-ram-showcase branch from 5a254d8 to 87bcd5a Compare January 26, 2026 12:54
@IvanPleshkov IvanPleshkov changed the base branch from save-applied-seq to dev January 26, 2026 12:56
let keep_operation_in_ram = pending_operations_count < DEFAULT_UPDATE_QUEUE_RAM_BUFFER;
let operation = keep_operation_in_ram.then_some(Box::new(operation.operation));

channel_permit.send(UpdateSignal::Operation(OperationData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is at least 152 bytes per operation, even for just a pointer to the WAL.

Just the hw_measurements field takes 120 bytes.

With that, we should probably box it too. An alternative would be an enum that satisfies all variants, but I cannot come up with a good one with these three fields.

I think it's worth a little bit of effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this measurement, I didnt attach it to the PR description.
Just a boxing wont help because we still need this field, box will move a memory without a profit. But I can try as a separate PR to refactor hw counter, where I can try to reduce Arc count here (by arc-ing the whole structure):

pub struct HwSharedDrain {
    pub(crate) cpu_counter: Arc<AtomicUsize>,
    pub(crate) payload_io_read_counter: Arc<AtomicUsize>,
    pub(crate) payload_io_write_counter: Arc<AtomicUsize>,
    pub(crate) payload_index_io_read_counter: Arc<AtomicUsize>,
    pub(crate) payload_index_io_write_counter: Arc<AtomicUsize>,
    pub(crate) vector_io_read_counter: Arc<AtomicUsize>,
    pub(crate) vector_io_write_counter: Arc<AtomicUsize>,
}

Is it fine for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a boxing wont help because we still need this field, box will move a memory without a profit. But I can try as a separate PR to refactor hw counter, where I can try to reduce Arc count here (by arc-ing the whole structure):

Good point. Even for values that we drop from memory will still have their HW aggregator attached. Thanks for the correction.

If this is only to report hardware statistics to wait=true responses. Maybe it'd be worth making wait=true and using the queue mutually exclusive. 🤔

But I can try as a separate PR to refactor hw counter, where I can try to reduce Arc count here (by arc-ing the whole structure):

Not sure on that one. Let's focus on the current feature first.

@IvanPleshkov IvanPleshkov marked this pull request as ready for review January 26, 2026 15:58
@coderabbitai

This comment was marked as resolved.

@IvanPleshkov IvanPleshkov requested a review from timvisee January 26, 2026 16:09
let update_sender = self.update_sender.load();
let pending_operations_count = update_sender
.max_capacity()
.saturating_sub(update_sender.capacity());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a little comment that capacity is actually remaining available slots.
This is rather confusing from Tokio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I also was confused when I met capacity. Added a comment

/// Maximum number of operations which are stored in RAM in update worker queue.
/// If there are more pending operations, operation data
/// will be read from WAL when processing the operation.
pub const DEFAULT_UPDATE_QUEUE_RAM_BUFFER: usize = 50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any strict reason for lowering this number from 100 or 200?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the changes measurable. And you did a measurement where you found an upload speed drop.
I replied to your message with bfb and reverted this constant back to 200. Now this PR does not affect dev perf while the update queue is 100.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a performance degradation which I don't expect.

Using bfb -t8 -p8 -b1 --num-vectors=250000 --skip-wait-index --indexing-threshold 0:

  • on PR: 50 seconds upsert time (~4800/s)
  • on dev: 38 seconds upset time (~6500/s)

@IvanPleshkov
Copy link
Contributor Author

I'm seeing a performance degradation which I don't expect.

Using bfb -t8 -p8 -b1 --num-vectors=250000 --skip-wait-index --indexing-threshold 0:

  • on PR: 50 seconds upsert time (~4800/s)
  • on dev: 38 seconds upset time (~6500/s)

Checked locally. Maybe my hardware was not good enough to measure the drop, as I have around 29K/s on both branches.

I described in PR that deserialization has a huge price, and while reading from WAL cannot be avoided. Your measurement is more expected than mine.
As a solution, I propose increasing the buffer size up to the default update queue size. In this case, the behaviour won't be changed in comparison with dev. But we are ready to increase the update queue up to millions in the next PRs

@timvisee
Copy link
Member

Checked locally. Maybe my hardware was not good enough to measure the drop, as I have around 29K/s on both branches.

Note that I tested on a debug build.

I described in PR that deserialization has a huge price, and while reading from WAL cannot be avoided. Your measurement is more expected than mine.
As a solution, I propose increasing the buffer size up to the default update queue size. In this case, the behaviour won't be changed in comparison with dev. But we are ready to increase the update queue up to millions in the next PRs

Are you suggesting that the 'lock, deserialize, unlock's on the WAL are causing this? Yeah I'll give setting the buffer size to unlimited a shot to see if it shows a different picture.

I'll also run two release builds to see how they behave.

@IvanPleshkov
Copy link
Contributor Author

Checked locally. Maybe my hardware was not good enough to measure the drop, as I have around 29K/s on both branches.

Note that I tested on a debug build.

I described in PR that deserialization has a huge price, and while reading from WAL cannot be avoided. Your measurement is more expected than mine.
As a solution, I propose increasing the buffer size up to the default update queue size. In this case, the behaviour won't be changed in comparison with dev. But we are ready to increase the update queue up to millions in the next PRs

Are you suggesting that the 'lock, deserialize, unlock's on the WAL are causing this? Yeah I'll give setting the buffer size to unlimited a shot to see if it shows a different picture.

I'll also run two release builds to see how they behave.

To make sure that the reason is a deserialization, I measured possible drops (WAL blocking, IO, deserialization) using regular times and got milliseconds for a deserialization even in a release build. All other possible drop points were much smaller

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@lib/collection/src/shards/local_shard/shard_ops.rs`:
- Around line 74-80: Typo in the comment near pending_operations_count: change
"regarging" to "regarding" in the comment that explains Sender::capacity
behavior; update the comment referencing update_sender, max_capacity(), and
capacity() so it reads "regarding tokio docs" (or rephrase to "as per the Tokio
docs") to clear the codespell failure.

@timvisee
Copy link
Member

Checked locally. Maybe my hardware was not good enough to measure the drop, as I have around 29K/s on both branches.

Note that I tested on a debug build.

I described in PR that deserialization has a huge price, and while reading from WAL cannot be avoided. Your measurement is more expected than mine.
As a solution, I propose increasing the buffer size up to the default update queue size. In this case, the behaviour won't be changed in comparison with dev. But we are ready to increase the update queue up to millions in the next PRs

Are you suggesting that the 'lock, deserialize, unlock's on the WAL are causing this? Yeah I'll give setting the buffer size to unlimited a shot to see if it shows a different picture.

I'll also run two release builds to see how they behave.

@IvanPleshkov

Using debug build and bfb -t8 -p8 -b1 --num-vectors=250000 --skip-wait-index --indexing-threshold 0:

  • on PR(limit: 50): 50 seconds upsert time (~4800/s)
  • on PR(limit: 200): 38 seconds upsert time (~6500/s)
  • on PR(limit: 1m): 38 seconds upsert time (~6500/s)
  • on dev: 38 seconds upset time (~6500/s)

Using perf build and bfb -t8 -p8 -b1 --num-vectors=1000000 --skip-wait-index --indexing-threshold 0:

  • on pr(limit: 50): 33 seconds upsert time
  • on pr(limit: 200): 33 seconds upsert time
  • on pr(limit: 1m): 33 seconds upsert time
  • on dev: 33 seconds upsert time

It shows the problem goes away in release builds, at least for this work load.

Any objection for reverting the queue size limit from 50 to 200, to at least make the debug build case better?

@IvanPleshkov
Copy link
Contributor Author

Checked locally. Maybe my hardware was not good enough to measure the drop, as I have around 29K/s on both branches.

Note that I tested on a debug build.

I described in PR that deserialization has a huge price, and while reading from WAL cannot be avoided. Your measurement is more expected than mine.
As a solution, I propose increasing the buffer size up to the default update queue size. In this case, the behaviour won't be changed in comparison with dev. But we are ready to increase the update queue up to millions in the next PRs

Are you suggesting that the 'lock, deserialize, unlock's on the WAL are causing this? Yeah I'll give setting the buffer size to unlimited a shot to see if it shows a different picture.
I'll also run two release builds to see how they behave.

@IvanPleshkov

Using debug build and bfb -t8 -p8 -b1 --num-vectors=250000 --skip-wait-index --indexing-threshold 0:

  • on PR(limit: 50): 50 seconds upsert time (~4800/s)
  • on PR(limit: 200): 38 seconds upsert time (~6500/s)
  • on PR(limit: 1m): 38 seconds upsert time (~6500/s)
  • on dev: 38 seconds upset time (~6500/s)

Using perf build and bfb -t8 -p8 -b1 --num-vectors=1000000 --skip-wait-index --indexing-threshold 0:

  • on pr(limit: 50): 33 seconds upsert time
  • on pr(limit: 200): 33 seconds upsert time
  • on pr(limit: 1m): 33 seconds upsert time
  • on dev: 33 seconds upsert time

It shows the problem goes away in release builds, at least for this work load.

Any objection for reverting the queue size limit from 50 to 200, to at least make the debug build case better?

No objections, I have already pushed the buffer size limit changes (to 500, after 100 it's no matter how large it is)

@IvanPleshkov IvanPleshkov merged commit 8db893a into dev Jan 27, 2026
15 checks passed
@IvanPleshkov IvanPleshkov deleted the update-queue-dont-keep-ops-in-ram-showcase branch January 27, 2026 14:18
IvanPleshkov added a commit that referenced this pull request Jan 27, 2026
* update queue dont keep ops in ram showcase

* always load operation from WAL

* revert operations buffering

cound only pending in update worker operations

fix typo

use channel size instead of wal index

* remove result expect

* decrease buffering const

* review remarks

* are you happy codespell
@agourlay agourlay added this to the Update queue milestone Jan 28, 2026
@coderabbitai coderabbitai bot mentioned this pull request Feb 4, 2026
generall pushed a commit that referenced this pull request Feb 9, 2026
* update queue dont keep ops in ram showcase

* always load operation from WAL

* revert operations buffering

cound only pending in update worker operations

fix typo

use channel size instead of wal index

* remove result expect

* decrease buffering const

* review remarks

* are you happy codespell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants