feat: predictive active blocks for routing without load metrics by PeaBrane · Pull Request #1731 · ai-dynamo/dynamo

PeaBrane · 2025-07-02T10:11:49Z

Overview:

The core is really a simple data structure, HashMap<SequenceHash, HashSet<RequestId>>, storing the active blocks. This should make all the operations that we need O(1), namely, reading and writing the number of active blocks. This is simply extended to multi-workers by letting each worker have one per OS thread, and various read / write requests are performed via channels.
This data structure is held locked during the read, best worker compute, and update cycle during scheduling. This avoids race conditions and staleness, and is tested to give considerably better results empirically
The KvPushRouter has visibility into the output stream, so it is able to update the active blocks when an output token is generated, and free (or deref) the corresponding active blocks when the output stream is completed.

Turns out the performance is not bad at all! 8 x 8b model, L40S backend, 7000 ISL, 100 ISL, 10 prefix prompts of half ISL

Same as above but with varied ISL (introducing realistic randomness), where KV routing is expected to perform slightly better and it did

And with varied OSL (even more randomness)

(Note that the Python bindings for KvRouter is removed for now as it is not currently being used, and will be reworked / reintroduced in future PRs)

Closes #1723

Summary by CodeRabbit

New Features
- Added advanced context-aware scheduling and token tracking for request streams, improving resource management and efficiency.
- Introduced a new configuration option for controlling worker selection randomness via a "temperature" setting.
- Enhanced metrics with new predictive load tracking and improved endpoint collection and filtering.
Refactor
- Simplified and updated scheduling logic, consolidating configuration parameters and improving concurrency safety.
- Deprecated several legacy Prometheus metrics and streamlined update logic for active KV blocks.
Bug Fixes
- Improved handling of worker selection to support deterministic behavior when randomness is disabled.

alec-flowers

I think another comparison point should be too compare against SGLangs rust router. They also do a form of predictive routing.

Their benchmark they dump 1k requests all at once, and our old python router couldn't handle that load. I would be interested to see how we do in such a scenario know with this predictive router + kv cache events.

I assume we will actually still not do super great because the KV Cache Events are going to be delayed. With such a bursty test the best thing that you can do is have good prediction.

PeaBrane · 2025-07-08T00:17:28Z

@alec-flowers we can set the overlap weight for 0 for super bursty patterns

PeaBrane · 2025-07-08T00:20:58Z

@tedzhouhk yes it should. It just recovers the pure load balancing behavior. Can just set the weight to 0.
I can quickly test

@PeaBrane do we also need to inform sequence.rs whether kv reuse is enabled? the number of active blocks is different if enabled/disabled kv reuse.

ah yes, adding this as a TODO for a future PR

PeaBrane · 2025-07-08T07:16:19Z

looks like the relative TTFT is better under a smaller concurrency of 10 instead of 20 but the ITL is worse, suggesting that conc may still play a major factor in the router performance; having a decode queue can potentially close this gap

initial commit, data structure, lightly tested

1e867f6

pull-request-size Bot added the size/L label Jul 2, 2025

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 10:11 Inactive

github-actions Bot added the feat label Jul 2, 2025

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 10:12 Inactive

clippy

415cf39

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 10:24 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 10:25 Inactive

multi workers, one per thread

8c49f44

pull-request-size Bot added size/XL and removed size/L labels Jul 2, 2025

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 20:31 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 20:32 Inactive

update_workers instead of remove_worker

87c1cb0

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 22:41 Inactive

poll_worker_ids in metrics aggregator

2b7b2c1

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 22:50 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB July 2, 2025 22:52 Inactive

potential blocks

50c9371

copy-pr-bot Bot temporarily deployed to GITLAB July 3, 2025 01:48 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB July 3, 2025 01:49 Inactive

predictive load metrics (note no request updates yet at all)

6cee1fe

pull-request-size Bot added size/XXL and removed size/XL labels Jul 3, 2025

copy-pr-bot Bot temporarily deployed to GITLAB July 3, 2025 04:46 Inactive

small clippy

ccd87de

copy-pr-bot Bot temporarily deployed to GITLAB July 3, 2025 04:49 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB July 3, 2025 04:51 Inactive

passed rust compiler

bee206c