Skip to content

feat: OAI compatible endpoints for TRTLLM#14

Merged
NVShreyas merged 62 commits into
mainfrom
shreyasm/trtllm-http
Mar 5, 2025
Merged

feat: OAI compatible endpoints for TRTLLM#14
NVShreyas merged 62 commits into
mainfrom
shreyasm/trtllm-http

Conversation

@NVShreyas

Copy link
Copy Markdown
Contributor

What does the PR do?

  • Adds OAI compatible HTTP endpoints to the TRTLLM example.
    • Integrates chat completions from Tanmay's branch.
  • Refactor some percent of duplicate code
  • Refactor input of LLMAPI to make it more extensible

@rmccorm4 rmccorm4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🚀

@NVShreyas NVShreyas merged commit b7cca38 into main Mar 5, 2025
@NVShreyas NVShreyas deleted the shreyasm/trtllm-http branch March 5, 2025 01:24
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
Co-authored-by: Tanmay Verma <tanmayv@nvidia.com>
Co-authored-by: Tanmay Verma <tanmayv@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
SpencerGarnets added a commit to ai-blaise/dynamo-prod-k8s that referenced this pull request May 30, 2026
…verlay

Replaces the 4+4 PD-disaggregated TP=4 DP=4 layout with the
single-role TP=8 EP=8 decode-only topology that 001 actually runs.
Per Spencer 2026-05-30: the infrastructure deploy script should
match what we run on 001 (so production = 001 verbatim), with
SMC-SD enabled as the production default.

Changes vs the prior 4+4 PD manifest:
  - Drop prefill service entirely (no disaggregation; NIXL/UCX
    transport on 002 was the failure mode anyway)
  - Decode: TP=8 EP=8, single 8-GPU role
  - moe-runner-backend flashinfer_trtllm, moe-a2a-backend none
  - --dsa-indexer-mode indexcache-hisa + --dsa-indexer-quantization
    nvfp4_e2m1_ue8m0 + --hisa-min-seq-len 1
  - --json-model-override-args (index_topk=1024, freq=4, hisa params)
  - --enable-flashinfer-allreduce-fusion (auto on for DSV3.2 +
    TP>1 + no DP + single-node + moe_a2a_backend=none)
  - --num-continuous-decode-steps 2
  - Drop --enable-dp-attention (the forward_idle DP path was the
    source of the ai-dynamo#14 launch-config bug chain; TP-only on SM_100
    avoids it entirely)
  - Drop --enable-hisparse (extra path cost; HISA already in stack)
  - Drop nsa-prefill/decode-backend (replaced by dsa-*)

Optimization env vars (top-level spec.envs, fully armed):
  SGLANG_NSA_NVFP4_HISA=1
  SGLANG_NSA_NVFP4_HISA_CAND_SCORE_WMMA=1
  SGLANG_NSA_NVFP4_HISA_MEAN_POOL_PREDECODE=1
  SGLANG_NSA_NVFP4_HISA_MEAN_POOL_TRANSP=1
  SGLANG_NSA_NVFP4_HISA_MEAN_POOL_FMA2=1
  SGLANG_NSA_NVFP4_HISA_COMPRESSION_RATIO=4.0
  SGLANG_HIGGS_DSA_TRTLLM_FP8=1 (only fires when HIGGS enabled)
  SGLANG_HIGGS_DSA_TRTLLM_DEQUANT_STREAM=1
  SGLANG_HIGGS_DSA_TRTLLM_PINGPONG=1
  SGLANG_HIGGS_DSA_TRTLLM_DEDICATED_STREAM=1
  SGLANG_USE_SGL_NVFP4_FUSED_RMSNORM=1
  SGLANG_MOE_NVFP4_DISPATCH=1
  SGLANG_ENABLE_JIT_DEEPGEMM=1
  SGLANG_ENABLE_SPEC_V2=1
  SGLANG_SMC_DRAFT_FP8_GEMM_BACKEND=cutlass
  SGLANG_SMC_TARGET_VERIFY_GRAPH=0
  (and the NCCL / UCX / HF defaults)

SMC-SD overlay (decode args):
  --speculative-algorithm SMC
  --speculative-draft-model-path /models/smcsd/GLM-4-9B-0414-FP8-DeepSeekV32-OMP
  --speculative-draft-model-quantization fp8
  --speculative-draft-attention-backend triton
  --smc-draft-kv-cache-dtype fp8_e4m3
  --smc-n-particles 4
  --smc-gamma 6
  --smc-draft-temperature 0.7
  --smc-target-temperature 1.0
  --smc-resample-threshold 0.5
  --smc-resample-method systematic
  + smcsd-draft-model volume + mount

For ablation (002 iteration testing): strip --speculative-* args +
smcsd-draft-model volume/mount. Production default keeps SMC-SD on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request Jun 4, 2026
Comment #14 from peilii: when the unified path moved the
`build_sglang_logprob_kwargs` gate check from the legacy handler into
the engine's `generate()`, the resulting `ValueError` was no longer
guaranteed to surface as a 4xx invalid-request error.

Root cause (static + unit-test reproduction):

- `lib/bindings/python/rust/backend.rs::py_err_to_dynamo` maps a
  Python `ValueError`/`TypeError` to
  `DynamoError { error_type: Backend(BackendError::InvalidArgument) }`
  — under the `Backend()` wrapper, not the top-level
  `ErrorType::InvalidArgument`.
- `lib/llm/src/http/service/openai.rs::find_dynamo_error_in_chain`
  used strict-equality matching, so the existing
  `ErrorType::InvalidArgument` check missed
  `Backend(InvalidArgument)`. The engine-side ValueError fell through
  to the catch-all 500 path.

Reproduced as a unit test that, against the unpatched mapping,
asserted `StatusCode 500` instead of `400`:

    thread '…test_backend_invalid_argument_surfaces_as_400' panicked
    assertion `left == right` failed
      left: 500
     right: 400

Fix: add `find_invalid_argument_in_chain` that matches both
`ErrorType::InvalidArgument` AND
`ErrorType::Backend(BackendError::InvalidArgument)`. Both classes
mean "the request was malformed" — whether validation tripped at the
frontend or inside the engine is an implementation detail the client
shouldn't have to care about.

Regression test `test_backend_invalid_argument_surfaces_as_400`
locks in the behavior. All 59 existing `http::service::openai`
tests still pass; rustfmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request Jun 4, 2026
Comment #14 from peilii: when the unified path moved the
`build_sglang_logprob_kwargs` gate check from the legacy handler into
the engine's `generate()`, the resulting `ValueError` was no longer
guaranteed to surface as a 4xx invalid-request error.

Root cause (static + unit-test reproduction):

- `lib/bindings/python/rust/backend.rs::py_err_to_dynamo` maps a
  Python `ValueError`/`TypeError` to
  `DynamoError { error_type: Backend(BackendError::InvalidArgument) }`
  — under the `Backend()` wrapper, not the top-level
  `ErrorType::InvalidArgument`.
- `lib/llm/src/http/service/openai.rs::find_dynamo_error_in_chain`
  used strict-equality matching, so the existing
  `ErrorType::InvalidArgument` check missed
  `Backend(InvalidArgument)`. The engine-side ValueError fell through
  to the catch-all 500 path.

Reproduced as a unit test that, against the unpatched mapping,
asserted `StatusCode 500` instead of `400`:

    thread '…test_backend_invalid_argument_surfaces_as_400' panicked
    assertion `left == right` failed
      left: 500
     right: 400

Fix: add `find_invalid_argument_in_chain` that matches both
`ErrorType::InvalidArgument` AND
`ErrorType::Backend(BackendError::InvalidArgument)`. Both classes
mean "the request was malformed" — whether validation tripped at the
frontend or inside the engine is an implementation detail the client
shouldn't have to care about.

Regression test `test_backend_invalid_argument_surfaces_as_400`
locks in the behavior. All 59 existing `http::service::openai`
tests still pass; rustfmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request Jun 4, 2026
Comment #14 from peilii: when the unified path moved the
`build_sglang_logprob_kwargs` gate check from the legacy handler into
the engine's `generate()`, the resulting `ValueError` was no longer
guaranteed to surface as a 4xx invalid-request error.

Root cause (static + unit-test reproduction):

- `lib/bindings/python/rust/backend.rs::py_err_to_dynamo` maps a
  Python `ValueError`/`TypeError` to
  `DynamoError { error_type: Backend(BackendError::InvalidArgument) }`
  — under the `Backend()` wrapper, not the top-level
  `ErrorType::InvalidArgument`.
- `lib/llm/src/http/service/openai.rs::find_dynamo_error_in_chain`
  used strict-equality matching, so the existing
  `ErrorType::InvalidArgument` check missed
  `Backend(InvalidArgument)`. The engine-side ValueError fell through
  to the catch-all 500 path.

Reproduced as a unit test that, against the unpatched mapping,
asserted `StatusCode 500` instead of `400`:

    thread '…test_backend_invalid_argument_surfaces_as_400' panicked
    assertion `left == right` failed
      left: 500
     right: 400

Fix: add `find_invalid_argument_in_chain` that matches both
`ErrorType::InvalidArgument` AND
`ErrorType::Backend(BackendError::InvalidArgument)`. Both classes
mean "the request was malformed" — whether validation tripped at the
frontend or inside the engine is an implementation detail the client
shouldn't have to care about.

Regression test `test_backend_invalid_argument_surfaces_as_400`
locks in the behavior. All 59 existing `http::service::openai`
tests still pass; rustfmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants