Reduce search spikes, don't lock Gridstore bitmask for duration of flush#8169
Reduce search spikes, don't lock Gridstore bitmask for duration of flush#8169
Conversation
| return Err(GridstoreError::FlushCancelled); | ||
| }; | ||
|
|
||
| let mut bitmask_guard = bitmask.upgradable_read(); |
There was a problem hiding this comment.
Holding the read guard for the whole operation time guarantees that bitmask and tracker contents are consistent. It doesn't seems to be true anymore. Can it be a problem?
There was a problem hiding this comment.
It should not be a problem.
The bitmask must be set before putting the mapping in the tracker. And mappings from the tracker must be freed before being freed from the bitmask. That is the case. These operations don't have to be atomic.
coszio
left a comment
There was a problem hiding this comment.
This looks OK to me, but was any testing performed to validate this is the source of latency spikes?
We're doing that separately. @JojiiOfficial and @IvanPleshkov are looking into it. |
agourlay
left a comment
There was a problem hiding this comment.
LGTM.
Did a pass with crasher 👍
Supersedes #8167
In the Gridstore flusher, a lock on the bitmask was held across the full duration of the flusher including IO. On big flushes, it means we could hold the lock for a long time.
It blocks writes on Gridstore and can in turn lock reads as well. We assume this has been causing spikes in searches.
It turns out we don't have to hold the lock. In stead we can lock in two separate stages and make the critical sections short.
As far as I can tell this change is sound. I've not been able to come up with a problematic scenario.
Thanks @IvanPleshkov for finding this one.
All Submissions:
devbranch. Did you create your branch fromdev?Changes to Core Features: