Conversation
Patch `tonic` and `hyper` crates to expose `max_local_error_reset_streams`, and *disable* it when creating internal gRPC connections
|
Some more context for the future 🦾 🤖 🦿 if we hit this again: We hit a security feature here that helps us clean up borked connections. Since this is for internal cluster communication and we manage both sides of the connection we can disable this security feature, which is what this PR takes care of. We mainly hit this because we drop connections before waiting on the result. We count an error when we drop a connection before we receive it's response, and eventually hit a limit. This happens a lot in fanning out reads which races with the local replica. Forcefully dropping these connections once we get a result is much easier and fits better with our implementation than waiting for all responses and dropping connections gracefully. |
|
How can this fix be validated? :) |
bfb \
--uri $QDRANT_HOST \
-n 10M \
-d 512 \
--skip-setup \
--search \
--keywords 5000 \
--rps 300@generall used this collection setup (not sure if critical, though, search is what repro the bug) bfb \
--uri $QDRANT_HOST \
-n 10M \
-d 512 \
--shards 9 \
--replication-factor 2 \
--on-disk-vectors true \
--keywords 5000 \
--hnsw-m 0 \
--hnsw-payload-m 16 \
--tenants true \
-b 10 \
--timeout 60 \
--rps 100 |
|
important detail is fan-out-factor |
Repro with default setup, without any explicit fan-out factor setting on my machine. Behavior with and without fix is different:
|
|
It seems to fix problem on my repro setup |
Patch
tonicandhypercrates to exposemax_local_error_reset_streams, and disable it when creating internal gRPC connections.All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --workspace --all-featurescommand?Changes to Core Features: