feat: OAI compatible endpoints for TRTLLM by NVShreyas · Pull Request #14 · ai-dynamo/dynamo

NVShreyas · 2025-03-04T19:16:32Z

What does the PR do?

Adds OAI compatible HTTP endpoints to the TRTLLM example.
- Integrates chat completions from Tanmay's branch.
Refactor some percent of duplicate code
Refactor input of LLMAPI to make it more extensible

…sm/trtllm-http

rmccorm4

Nice work 🚀

Co-authored-by: Tanmay Verma <tanmayv@nvidia.com> Co-authored-by: Tanmay Verma <tanmayv@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

…verlay Replaces the 4+4 PD-disaggregated TP=4 DP=4 layout with the single-role TP=8 EP=8 decode-only topology that 001 actually runs. Per Spencer 2026-05-30: the infrastructure deploy script should match what we run on 001 (so production = 001 verbatim), with SMC-SD enabled as the production default. Changes vs the prior 4+4 PD manifest: - Drop prefill service entirely (no disaggregation; NIXL/UCX transport on 002 was the failure mode anyway) - Decode: TP=8 EP=8, single 8-GPU role - moe-runner-backend flashinfer_trtllm, moe-a2a-backend none - --dsa-indexer-mode indexcache-hisa + --dsa-indexer-quantization nvfp4_e2m1_ue8m0 + --hisa-min-seq-len 1 - --json-model-override-args (index_topk=1024, freq=4, hisa params) - --enable-flashinfer-allreduce-fusion (auto on for DSV3.2 + TP>1 + no DP + single-node + moe_a2a_backend=none) - --num-continuous-decode-steps 2 - Drop --enable-dp-attention (the forward_idle DP path was the source of the ai-dynamo#14 launch-config bug chain; TP-only on SM_100 avoids it entirely) - Drop --enable-hisparse (extra path cost; HISA already in stack) - Drop nsa-prefill/decode-backend (replaced by dsa-*) Optimization env vars (top-level spec.envs, fully armed): SGLANG_NSA_NVFP4_HISA=1 SGLANG_NSA_NVFP4_HISA_CAND_SCORE_WMMA=1 SGLANG_NSA_NVFP4_HISA_MEAN_POOL_PREDECODE=1 SGLANG_NSA_NVFP4_HISA_MEAN_POOL_TRANSP=1 SGLANG_NSA_NVFP4_HISA_MEAN_POOL_FMA2=1 SGLANG_NSA_NVFP4_HISA_COMPRESSION_RATIO=4.0 SGLANG_HIGGS_DSA_TRTLLM_FP8=1 (only fires when HIGGS enabled) SGLANG_HIGGS_DSA_TRTLLM_DEQUANT_STREAM=1 SGLANG_HIGGS_DSA_TRTLLM_PINGPONG=1 SGLANG_HIGGS_DSA_TRTLLM_DEDICATED_STREAM=1 SGLANG_USE_SGL_NVFP4_FUSED_RMSNORM=1 SGLANG_MOE_NVFP4_DISPATCH=1 SGLANG_ENABLE_JIT_DEEPGEMM=1 SGLANG_ENABLE_SPEC_V2=1 SGLANG_SMC_DRAFT_FP8_GEMM_BACKEND=cutlass SGLANG_SMC_TARGET_VERIFY_GRAPH=0 (and the NCCL / UCX / HF defaults) SMC-SD overlay (decode args): --speculative-algorithm SMC --speculative-draft-model-path /models/smcsd/GLM-4-9B-0414-FP8-DeepSeekV32-OMP --speculative-draft-model-quantization fp8 --speculative-draft-attention-backend triton --smc-draft-kv-cache-dtype fp8_e4m3 --smc-n-particles 4 --smc-gamma 6 --smc-draft-temperature 0.7 --smc-target-temperature 1.0 --smc-resample-threshold 0.5 --smc-resample-method systematic + smcsd-draft-model volume + mount For ablation (002 iteration testing): strip --speculative-* args + smcsd-draft-model volume/mount. Production default keeps SMC-SD on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comment #14 from peilii: when the unified path moved the `build_sglang_logprob_kwargs` gate check from the legacy handler into the engine's `generate()`, the resulting `ValueError` was no longer guaranteed to surface as a 4xx invalid-request error. Root cause (static + unit-test reproduction): - `lib/bindings/python/rust/backend.rs::py_err_to_dynamo` maps a Python `ValueError`/`TypeError` to `DynamoError { error_type: Backend(BackendError::InvalidArgument) }` — under the `Backend()` wrapper, not the top-level `ErrorType::InvalidArgument`. - `lib/llm/src/http/service/openai.rs::find_dynamo_error_in_chain` used strict-equality matching, so the existing `ErrorType::InvalidArgument` check missed `Backend(InvalidArgument)`. The engine-side ValueError fell through to the catch-all 500 path. Reproduced as a unit test that, against the unpatched mapping, asserted `StatusCode 500` instead of `400`: thread '…test_backend_invalid_argument_surfaces_as_400' panicked assertion `left == right` failed left: 500 right: 400 Fix: add `find_invalid_argument_in_chain` that matches both `ErrorType::InvalidArgument` AND `ErrorType::Backend(BackendError::InvalidArgument)`. Both classes mean "the request was malformed" — whether validation tripped at the frontend or inside the engine is an implementation detail the client shouldn't have to care about. Regression test `test_backend_invalid_argument_surfaces_as_400` locks in the behavior. All 59 existing `http::service::openai` tests still pass; rustfmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tanmayv25 and others added 30 commits February 27, 2025 12:41

OpenAI API integration in llmapi

2d316bf

WIP

a775d46

WIP

8ba3326

Allow extra fields

bcb3097

Monolithic

072bb52

Merge branch 'main' into shreyasm/trtllm-http

eb11241

WIP: disagg re-architect

6a959c4

Merge remote-tracking branch 'origin/tanmayv-trtllm-http' into shreya…

e55f4d5

…sm/trtllm-http

Use the correct DisaggParams

29a008e

Fix

928cd61

Return raw response

c662b61

working completions for monolith

dff65fb

Fix the output

55e6399

wip

490636f

Fixes for id and chat processor

89aee6c

more fixes

7edbf0b

Fix disagg

096aae6

Merge branch 'tanmayv-trtllm-http' into shreyasm/trtllm-http

1687e43

add chat compl for disagg

59cf62c

make input config more flexible

8d8d047

working completions agg

3e02b6b

Another fix

ea0622d

Disagg chat completion fix

c5d7fbf

formatting

87c3190

Add documentation

04e534d

working disagg completions

9e910ef

fix llm api config

4463122

changes for ds

ef0e60f

Merge branch 'tanmayv-trtllm-http' into shreyasm/trtllm-http

1287080

cleanup

6c3b9cc

make vars private

4f1da3c

NVShreyas temporarily deployed to GITLAB March 4, 2025 22:14 — with GitHub Actions Inactive

Merge branch 'main' into shreyasm/trtllm-http

12d53bd

NVShreyas temporarily deployed to GITLAB March 4, 2025 22:15 — with GitHub Actions Inactive

NVShreyas temporarily deployed to GITLAB March 4, 2025 22:25 — with GitHub Actions Inactive

fix completion object name

3f2db02

NVShreyas temporarily deployed to GITLAB March 4, 2025 23:40 — with GitHub Actions Inactive

Merge branch 'main' into shreyasm/trtllm-http

e0a3ad3

NVShreyas temporarily deployed to GITLAB March 4, 2025 23:40 — with GitHub Actions Inactive

NVShreyas temporarily deployed to GITLAB March 4, 2025 23:46 — with GitHub Actions Inactive

rmccorm4 approved these changes Mar 4, 2025

View reviewed changes

more neutral prompt

7c6488a

NVShreyas temporarily deployed to GITLAB March 5, 2025 00:11 — with GitHub Actions Inactive

Merge branch 'main' into shreyasm/trtllm-http

fe62e8b

NVShreyas temporarily deployed to GITLAB March 5, 2025 00:12 — with GitHub Actions Inactive

NVShreyas temporarily deployed to GITLAB March 5, 2025 00:13 — with GitHub Actions Inactive

GuanLuo approved these changes Mar 5, 2025

View reviewed changes

NVShreyas merged commit b7cca38 into main Mar 5, 2025

NVShreyas deleted the shreyasm/trtllm-http branch March 5, 2025 01:24

tanmayv25 mentioned this pull request Apr 15, 2026

DEP: Backend Interface -- LLMEngine ABC and Worker #8251

Open

kaim-eng mentioned this pull request May 10, 2026

feat(planner): power-aware planner — Phase 1+2+3 + Phase 4 pipeclean #9369

Open

kaim-eng mentioned this pull request May 18, 2026

feat(planner): AIC closed-loop optimizer #9685

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OAI compatible endpoints for TRTLLM#14

feat: OAI compatible endpoints for TRTLLM#14
NVShreyas merged 62 commits into
mainfrom
shreyasm/trtllm-http

NVShreyas commented Mar 4, 2025

Uh oh!

rmccorm4 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

NVShreyas commented Mar 4, 2025

What does the PR do?

Uh oh!

rmccorm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants