Conversation
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
/rerun-stage stage-c-test-8-gpu-h20 |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the handling of parallel information and data parallel (DP) rank resolution within the disaggregation system. The changes aim to remove redundant checks and centralize the process of determining DP ranks for pending requests, leading to a cleaner and potentially more efficient request processing flow. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
✅ Triggered |
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for resolving data parallelism (DP) ranks and ensuring parallel information in the disaggregated prefill/decode system. The changes centralize the DP rank resolution logic into the _resolve_pending_reqs method to avoid redundant calls, and move the call to ensure_parallel_info to an earlier stage. These changes aim to improve code clarity and maintainability. The implementation appears correct and aligns with the stated goals.
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
add() now does cache-only lookup (fast path). All network I/O (try_ensure_parallel_info + query_prefill_dp_ranks) happens in _resolve_pending_reqs() during the scheduling cycle, with retry tracking and abort after max attempts.
ensure_parallel_info and consolidate rank mapping into PrefillServerInfo
ensure_parallel_info and consolidate rank mapping into PrefillServerInfotry_ensure_parallel_info in pending queue, consolidate rank mapping into PrefillServerInfo
add(req)
│
┌───────────────┴───────────────┐
│ fake transfer? │ cache miss
│ prefill info cached │
│ + dp_rank known? │
▼ ▼
fast path pending_reqs ◄───────────────┐
(cache hit) │ │
│ │ every scheduling │
│ │ cycle │
│ ▼ │
│ ┌──── Pass 1: ensure prefill info ─────┐ │
│ │ (per bootstrap_addr) │ │
│ │ │ │
│ ▼ ▼ │
│ try_ensure_parallel_info fetch failed │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ cached? │── yes ─────► │ count++ │ │
│ │ │ │ │ │
│ │ fetch │── ok ───────► │ rank mapping │ │
│ │ │ │ + ready │ │
│ │ │ │ │ │
│ │ │ │ < 30 ? │ │
│ │ │ │ yes ───────┼──► remaining
│ │ │ │ no ───────┼──► ABORT
│ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ Pass 2: resolve dp rank │
│ (per req in ready_addrs) │
│ │ │
│ ├─ dp_rank preset? ─────► resolved │
│ ├─ dp_size == 1? ─────► resolved │
│ ├─ follow_bootstrap? ─────► resolved │
│ │ │
│ ▼ (none of above) │
│ batch query dp_ranks (HTTP) │
│ │ │
│ ├─ rank found ─────► resolved │
│ └─ not found ─────► remaining ───────┘
│ (next cycle)
│
└──────────────┬──────────────┘
│ resolved
▼
create receiver & enqueue
│
▼
handshake → prealloc → transfer → decodecharts genearated with claude |
…nsolidate rank mapping into `PrefillServerInfo` (sgl-project#20785) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…nsolidate rank mapping into `PrefillServerInfo` (sgl-project#20785) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…nsolidate rank mapping into `PrefillServerInfo` (sgl-project#20785) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…nsolidate rank mapping into `PrefillServerInfo` (sgl-project#20785) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…nsolidate rank mapping into `PrefillServerInfo` (sgl-project#20785) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Summary
add()becomes a pure cache lookup — no network calls, no blockingtry_ensure_parallel_info()is only called in_resolve_pending_reqs()during the scheduling cycle, with retry tracking and abort after max attemptsCommonKVManager._resolve_rank_mapping()and stored onPrefillServerInfo, instead of recomputed per-receiverBackground
add()previously calledensure_parallel_info()which blocks up to 30s (5 retries × 1s sleep + 5s HTTP timeout)ensure_parallel_infopath andCommonKVReceiver.__init__(), with identical resultsChanges
Non-blocking ensure (
conn.py)ensure_parallel_info()(retry loop +time.sleep) withtry_ensure_parallel_info()— single attempt, no sleepPending queue as the single ensure site (
decode.py)add()fast path: cache-only_resolve_prefill_dp_rank()→ enqueue or append topending_reqs_resolve_pending_reqs()is the only place that callstry_ensure_parallel_info(), split into two passes:_ensure_prefill_info(): per-addr non-blocking fetch with retry counter, abort after 30 cyclesquery_prefill_dp_ranksfallbackUnified rank mapping (
conn.py)_resolve_rank_mapping()onCommonKVManager— computes TP/CP/PP mapping once and stores onPrefillServerInfoCommonKVReceiver.__init__()now reads pre-computed mapping instead of recomputing