Conversation
22b7ed6 to
7078662
Compare
0d8a486 to
b6a9a36
Compare
cound only pending in update worker operations fix typo use channel size instead of wal index
5a254d8 to
87bcd5a
Compare
| let keep_operation_in_ram = pending_operations_count < DEFAULT_UPDATE_QUEUE_RAM_BUFFER; | ||
| let operation = keep_operation_in_ram.then_some(Box::new(operation.operation)); | ||
|
|
||
| channel_permit.send(UpdateSignal::Operation(OperationData { |
There was a problem hiding this comment.
This struct is at least 152 bytes per operation, even for just a pointer to the WAL.
Just the hw_measurements field takes 120 bytes.
With that, we should probably box it too. An alternative would be an enum that satisfies all variants, but I cannot come up with a good one with these three fields.
I think it's worth a little bit of effort.
There was a problem hiding this comment.
Thanks for this measurement, I didnt attach it to the PR description.
Just a boxing wont help because we still need this field, box will move a memory without a profit. But I can try as a separate PR to refactor hw counter, where I can try to reduce Arc count here (by arc-ing the whole structure):
pub struct HwSharedDrain {
pub(crate) cpu_counter: Arc<AtomicUsize>,
pub(crate) payload_io_read_counter: Arc<AtomicUsize>,
pub(crate) payload_io_write_counter: Arc<AtomicUsize>,
pub(crate) payload_index_io_read_counter: Arc<AtomicUsize>,
pub(crate) payload_index_io_write_counter: Arc<AtomicUsize>,
pub(crate) vector_io_read_counter: Arc<AtomicUsize>,
pub(crate) vector_io_write_counter: Arc<AtomicUsize>,
}
Is it fine for you?
There was a problem hiding this comment.
Just a boxing wont help because we still need this field, box will move a memory without a profit. But I can try as a separate PR to refactor hw counter, where I can try to reduce Arc count here (by arc-ing the whole structure):
Good point. Even for values that we drop from memory will still have their HW aggregator attached. Thanks for the correction.
If this is only to report hardware statistics to wait=true responses. Maybe it'd be worth making wait=true and using the queue mutually exclusive. 🤔
But I can try as a separate PR to refactor hw counter, where I can try to reduce Arc count here (by arc-ing the whole structure):
Not sure on that one. Let's focus on the current feature first.
This comment was marked as resolved.
This comment was marked as resolved.
| let update_sender = self.update_sender.load(); | ||
| let pending_operations_count = update_sender | ||
| .max_capacity() | ||
| .saturating_sub(update_sender.capacity()); |
There was a problem hiding this comment.
Can you add a little comment that capacity is actually remaining available slots.
This is rather confusing from Tokio.
There was a problem hiding this comment.
Good point, I also was confused when I met capacity. Added a comment
| /// Maximum number of operations which are stored in RAM in update worker queue. | ||
| /// If there are more pending operations, operation data | ||
| /// will be read from WAL when processing the operation. | ||
| pub const DEFAULT_UPDATE_QUEUE_RAM_BUFFER: usize = 50; |
There was a problem hiding this comment.
Any strict reason for lowering this number from 100 or 200?
There was a problem hiding this comment.
To make the changes measurable. And you did a measurement where you found an upload speed drop.
I replied to your message with bfb and reverted this constant back to 200. Now this PR does not affect dev perf while the update queue is 100.
Checked locally. Maybe my hardware was not good enough to measure the drop, as I have around 29K/s on both branches. I described in PR that deserialization has a huge price, and while reading from WAL cannot be avoided. Your measurement is more expected than mine. |
Note that I tested on a debug build.
Are you suggesting that the 'lock, deserialize, unlock's on the WAL are causing this? Yeah I'll give setting the buffer size to unlimited a shot to see if it shows a different picture. I'll also run two release builds to see how they behave. |
To make sure that the reason is a deserialization, I measured possible drops (WAL blocking, IO, deserialization) using regular times and got milliseconds for a deserialization even in a release build. All other possible drop points were much smaller |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@lib/collection/src/shards/local_shard/shard_ops.rs`:
- Around line 74-80: Typo in the comment near pending_operations_count: change
"regarging" to "regarding" in the comment that explains Sender::capacity
behavior; update the comment referencing update_sender, max_capacity(), and
capacity() so it reads "regarding tokio docs" (or rephrase to "as per the Tokio
docs") to clear the codespell failure.
Using debug build and
Using perf build and
It shows the problem goes away in release builds, at least for this work load. Any objection for reverting the queue size limit from 50 to 200, to at least make the debug build case better? |
No objections, I have already pushed the buffer size limit changes (to 500, after 100 it's no matter how large it is) |
* update queue dont keep ops in ram showcase * always load operation from WAL * revert operations buffering cound only pending in update worker operations fix typo use channel size instead of wal index * remove result expect * decrease buffering const * review remarks * are you happy codespell
* update queue dont keep ops in ram showcase * always load operation from WAL * revert operations buffering cound only pending in update worker operations fix typo use channel size instead of wal index * remove result expect * decrease buffering const * review remarks * are you happy codespell
Summary
This PR reduces the RAM usage of the pending operations in Update Worker.
Description
Update worker has a channel with pending operations. Each pending has an
OperationData, which contains the request data.In case of bulk upload the channel may be huge and it consumes a lot of RAM.
This PR changes the logic of the Update Worker, and instead of storing the request in RAM, the Update Worker reads the request from the WAL instead.
Performance
With buffering, there were no performance drops found.
Using the vector db benchmark, the size of the update didnt reach the buffer size limit.
Using BFB, I got a full update queue using params
--threads=8 --num-vectors=1000000. But still, the difference between runs is larger than the difference between branches.Without buffering (by changing
DEFAULT_UPDATE_QUEUE_RAM_BUFFERto zero), there is a performance drop. Because of deserialization, local measurements shown that deserialization might take 1-2ms of a large batched upsert. It's a fine price for unlimited update queue where ram usage has a higher priority than deserialization cost.