Skip to content

feat: predictive active blocks for routing without load metrics#1731

Merged
PeaBrane merged 47 commits intomainfrom
rupei/router-predictive
Jul 8, 2025
Merged

feat: predictive active blocks for routing without load metrics#1731
PeaBrane merged 47 commits intomainfrom
rupei/router-predictive

Conversation

@PeaBrane
Copy link
Copy Markdown
Contributor

@PeaBrane PeaBrane commented Jul 2, 2025

Overview:

  1. The core is really a simple data structure, HashMap<SequenceHash, HashSet<RequestId>>, storing the active blocks. This should make all the operations that we need O(1), namely, reading and writing the number of active blocks. This is simply extended to multi-workers by letting each worker have one per OS thread, and various read / write requests are performed via channels.

  2. This data structure is held locked during the read, best worker compute, and update cycle during scheduling. This avoids race conditions and staleness, and is tested to give considerably better results empirically

  3. The KvPushRouter has visibility into the output stream, so it is able to update the active blocks when an output token is generated, and free (or deref) the corresponding active blocks when the output stream is completed.

Turns out the performance is not bad at all! 8 x 8b model, L40S backend, 7000 ISL, 100 ISL, 10 prefix prompts of half ISL
predictive_unnormalized_waiting

Same as above but with varied ISL (introducing realistic randomness), where KV routing is expected to perform slightly better and it did
varied_isl

And with varied OSL (even more randomness)
osl_varied

(Note that the Python bindings for KvRouter is removed for now as it is not currently being used, and will be reworked / reintroduced in future PRs)

Closes #1723

Summary by CodeRabbit

  • New Features

    • Added advanced context-aware scheduling and token tracking for request streams, improving resource management and efficiency.
    • Introduced a new configuration option for controlling worker selection randomness via a "temperature" setting.
    • Enhanced metrics with new predictive load tracking and improved endpoint collection and filtering.
  • Refactor

    • Simplified and updated scheduling logic, consolidating configuration parameters and improving concurrency safety.
    • Deprecated several legacy Prometheus metrics and streamlined update logic for active KV blocks.
  • Bug Fixes

    • Improved handling of worker selection to support deterministic behavior when randomness is disabled.

Comment thread lib/llm/src/kv_router/scheduler.rs Outdated
Comment thread lib/llm/src/kv_router/scheduler.rs Outdated
Comment thread lib/llm/src/kv_router/scoring.rs
Comment thread lib/llm/src/kv_router/scoring.rs Outdated
Comment thread lib/llm/src/kv_router/sequence.rs Outdated
Comment thread lib/llm/src/kv_router/sequence.rs
Comment thread lib/llm/src/kv_router/sequence.rs Outdated
Comment thread lib/llm/src/kv_router/sequence.rs
Comment thread lib/llm/src/kv_router/sequence.rs
Copy link
Copy Markdown
Contributor

@alec-flowers alec-flowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another comparison point should be too compare against SGLangs rust router. They also do a form of predictive routing.

Their benchmark they dump 1k requests all at once, and our old python router couldn't handle that load. I would be interested to see how we do in such a scenario know with this predictive router + kv cache events.

I assume we will actually still not do super great because the KV Cache Events are going to be delayed. With such a bursty test the best thing that you can do is have good prediction.

@PeaBrane
Copy link
Copy Markdown
Contributor Author

PeaBrane commented Jul 8, 2025

@alec-flowers we can set the overlap weight for 0 for super bursty patterns

@PeaBrane
Copy link
Copy Markdown
Contributor Author

PeaBrane commented Jul 8, 2025

@tedzhouhk yes it should. It just recovers the pure load balancing behavior. Can just set the weight to 0.
I can quickly test

@PeaBrane do we also need to inform sequence.rs whether kv reuse is enabled? the number of active blocks is different if enabled/disabled kv reuse.

ah yes, adding this as a TODO for a future PR

@PeaBrane
Copy link
Copy Markdown
Contributor Author

PeaBrane commented Jul 8, 2025

looks like the relative TTFT is better under a smaller concurrency of 10 instead of 20 but the ITL is worse, suggesting that conc may still play a major factor in the router performance; having a decode queue can potentially close this gap
conc_10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: An Accurate, No-Change in Framework, and Router-Engine Communication-Free Method to Approximate Active KV Blocks in KV-Aware Router

4 participants