Skip to content

feat: vllm engine tensor parallel and pipeline parallel#16

Merged
grahamking merged 3 commits into
mainfrom
gk-vllm-multi
Mar 4, 2025
Merged

feat: vllm engine tensor parallel and pipeline parallel#16
grahamking merged 3 commits into
mainfrom
gk-vllm-multi

Conversation

@grahamking

Copy link
Copy Markdown
Contributor

Needs more testing but good enough for now. I get the same results with this as with vllm serve.

@github-actions

github-actions Bot commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Test Results

 2 files   2 suites   52s ⏱️
77 tests 77 ✅ 0 💤 0 ❌
99 runs  98 ✅ 1 💤 0 ❌

Results for commit 04bca65.

♻️ This comment has been updated with latest results.

Comment thread launch/tio/README.md Outdated
Needs more testing but good enough for now. I get the same results with
this as with `vllm serve`.

@biswapanda biswapanda left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@rmccorm4 rmccorm4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🚀

So if I had 16 GPUs on 2x8GPU nodes, I could specify TP 16 on head node, omit TP on worker node, and the 8+8 split would be figured out internally?

@grahamking

Copy link
Copy Markdown
Contributor Author

Nice 🚀

So if I had 16 GPUs on 2x8GPU nodes, I could specify TP 16 on head node, omit TP on worker node, and the 8+8 split would be figured out internally?

With vllm you do TP 8, PP 2 for 2x8. World size is tp + pp.

https://github.com/vllm-project/vllm/blob/ae122b1cbde96c871fb74611363e04eecfbcce03/docs/source/serving/distributed_serving.md#running-vllm-on-multiple-nodes

With sglang you do what you suggested, TP 16, nodes 2 and it divides TP by num_nodes. Those two projects are both very similar and very different at the same time.

@grahamking grahamking deleted the gk-vllm-multi branch March 4, 2025 23:47
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
ShounakRay added a commit to ShounakRay/fuzzy-dynamo that referenced this pull request Mar 20, 2026
…ers)

LaTeX book covering transformer inference, KV caching, distributed
serving, Dynamo architecture, KV-aware routing, token blocks, network
codecs, fuzzing methodology, and 16 bugs found across 36 fuzz targets.

All bug references use ch10 canonical numbering (ai-dynamo#1-ai-dynamo#16), all cross-
references resolve, all arithmetic verified. Includes TikZ diagrams,
whynotbox pedagogical environments, and marginnotes throughout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ShounakRay added a commit to ShounakRay/fuzzy-dynamo that referenced this pull request Mar 20, 2026
…iteups for 14 open bugs

- Reformat all 16 issue markdowns to match GitHub bug template (Describe the Bug,
  Steps to Reproduce, Expected/Actual Behavior, Environment, Additional Context)
- Add upstream/ directory with per-bug subfolders containing issue.md, pr.md, and
  discovery.md for each of the 16 bugs found via fuzzing
- Add fixes/ directory with annotated patch files and regression tests for all 14
  open bugs
- Create 14 minimal fix branches (fix/*-{bug#}) off upstream/main, each with a
  single commit touching only the affected file
- Verified all 14 bugs still present in upstream/main as of 50af343
- Confirmed 2 bugs already fixed upstream (ai-dynamo#15 RadixTree scoring, ai-dynamo#16 TwoPartCodec overflow)
ryanolson added a commit that referenced this pull request May 2, 2026
…(R-B Slice 3)

New `ConditionalDisaggCoordinator` in `disagg/coordinator/coordinator.rs`
holds the unified per-request state map (`DashMap<String, Arc<CdRequest>>`)
and ports the prefill flow off `PrefillCoordinatorImpl` against the
canonical types defined in Slice 2.

Public API mirrors `PrefillCoordinatorImpl` so tests swap by
constructor only:
- `new` / `new_with_watchdog`
- `observer` / `observer_callback`
- `active_count` / `status_for` / `has_active_request`
- `commit_output_blocks`
- `cleanup_failed_request` (single canonical sink — uses
  `CdRequest::failed_g1_block_ids()` from Slice 2)
- `PrefillCoordinator` trait impl (`ensure_started`, `on_usaa`,
  `observe_forward`, `on_request_finished`)

Lifecycle watcher uses `spawn_lifecycle_watcher` (the helper
from Slice 1).  `on_request_finished`'s observer-residual
finalize-deferral is preserved verbatim, including the 10s
watchdog that prevents the `commit_output_blocks` ↔
`session.finalize()` race (learning #16).

Decode-role bits are present on `CdRequest` but unreachable
through this slice's code paths — Slice 4 wires them.

Coexistence (Slice 3 only): `PrefillCoordinatorImpl` stays alive
because production wiring (`prefill_leader.rs`, `init.rs`,
`unified.rs`) still references it; `cd_loopback`, `cd_decode_e2e`,
`cd_bidirectional_e2e`, and the unified suites also still
construct it.  Slice 5 cuts production over; Slice 7 deletes it.

Deviation: `ConditionalDecodeG2Observer::new()` widened from
private to `pub` so the new coordinator (sibling module) can
construct one.  No behavior change.

Tests:
- cd_prefill_e2e: harness swapped to ConditionalDisaggCoordinator;
  all 11 tests pass against the new coord.
- lib unit: 197 → 198 (new `size_of_is_nonzero` constructor probe).
- All other suites unchanged.
- Total CD-scope: 229 active tests passing.
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request Jun 4, 2026
/ai-dynamo#8/ai-dynamo#15/ai-dynamo#16)

Test-only additions for the seams the review flagged as untested.

ai-dynamo#7 _project_scale_to with a real apply outcome (4 cases):
  both components changed → full ScalingDecision; both equal current →
  None (PSM-equivalent no-change); single-component proposal → other count
  stays None; non-apply execute_action → None. Previously every adapter.tick
  test hit only the None/empty path, so a regression in the projection /
  no-change detection would have shipped silently.

ai-dynamo#8 _tick_input_to_context + FPM encoding:
  build a TickInput with traffic (incl kv_hit_rate), worker counts (incl
  scaling-in-progress flags), and a real ForwardPassMetrics; assert the
  PipelineContext.observations mapping and that the FPM bytes decode back
  (key format "<worker_id>/<dp_rank>", canonical encoder). This is the
  ingress glue where the add_observations P1 + the projection live.

ai-dynamo#15 registry mutation during an in-flight tick:
  suspend a PROPOSE plugin mid-gather (asyncio.Event), register a new
  plugin while suspended, release, and assert the late plugin did NOT join
  the in-flight stage (pre-tick snapshot) and the tick completed cleanly —
  then a fresh tick picks it up. Exercises the no-locks invariant that
  scheduler.py/server.py document but no test covered.

ai-dynamo#16 test_tick_diagnostics_extended scope note:
  clarify in the module docstring that plugin_overrides / reconcile_reasons
  / held_over_plugins have no production populator in this PR; these tests
  lock the dataclass contract (defaults / no shared-mutable aliasing), not
  live behavior.

835 planner tests pass (+6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants