feat: predictive active blocks for routing without load metrics#1731
feat: predictive active blocks for routing without load metrics#1731
Conversation
alec-flowers
left a comment
There was a problem hiding this comment.
I think another comparison point should be too compare against SGLangs rust router. They also do a form of predictive routing.
Their benchmark they dump 1k requests all at once, and our old python router couldn't handle that load. I would be interested to see how we do in such a scenario know with this predictive router + kv cache events.
I assume we will actually still not do super great because the KV Cache Events are going to be delayed. With such a bursty test the best thing that you can do is have good prediction.
|
@alec-flowers we can set the overlap weight for 0 for super bursty patterns |
ah yes, adding this as a TODO for a future PR |

Overview:
The core is really a simple data structure,
HashMap<SequenceHash, HashSet<RequestId>>, storing the active blocks. This should make all the operations that we need O(1), namely, reading and writing the number of active blocks. This is simply extended to multi-workers by letting each worker have one per OS thread, and various read / write requests are performed via channels.This data structure is held locked during the read, best worker compute, and update cycle during scheduling. This avoids race conditions and staleness, and is tested to give considerably better results empirically
The
KvPushRouterhas visibility into the output stream, so it is able to update the active blocks when an output token is generated, and free (or deref) the corresponding active blocks when the output stream is completed.Turns out the performance is not bad at all! 8 x 8b model, L40S backend, 7000 ISL, 100 ISL, 10 prefix prompts of half ISL

Same as above but with varied ISL (introducing realistic randomness), where KV routing is expected to perform slightly better and it did

And with varied OSL (even more randomness)

(Note that the Python bindings for
KvRouteris removed for now as it is not currently being used, and will be reworked / reintroduced in future PRs)Closes #1723
Summary by CodeRabbit
New Features
Refactor
Bug Fixes