feat: add OpenTelemetry tracing to DiffGenerator by jh-nv · Pull Request #21254 · sgl-project/sglang

jh-nv · 2026-03-24T02:48:53Z

Motivation

Enable OpenTelemetry tracing for the diffusion/multimodal generation subsystem (multimodal_gen) to achieve feature parity with the SRT (LLM) runtime, which already has comprehensive OTel instrumentation. This is the initial scaffolding — downstream consumers (e.g., Dynamo) need tracing support in the diffusion path, so we want to land the foundation early. Fine-grained span, rich span attributes (resolution, inference steps, guidance scale, etc.) and additional trace stages will be built on top in follow-up PRs.

Modifications

Server args (multimodal_gen/runtime/server_args.py): Added --enable-trace and --otlp-traces-endpoint CLI flags, mirroring SRT's configuration.
- Process-level tracing init:
  - DiffGenerator.from_server_args(): Initializes OTel exporter with service name sglang-diffusion and thread label DiffGenerator.
  - gpu_worker.py (run_scheduler_process): Initializes tracing per GPU worker with thread label DiffWorker_rank{N}.
  - launch_server.py: Passes tracing config through to worker processes.
- Request-level tracing:
  - entrypoints/utils.py (prepare_request): Creates TraceReqContext per request when tracing is enabled, linking to external W3C trace headers.
  - schedule_batch.py (Req): Added trace_ctx field, defaults to TraceNullContext (no-op) when tracing is disabled.
- Trace header extraction:
  - openai/image_api.py and openai/video_api.py: Extract traceparent/tracestate headers from incoming HTTP requests.
- Span instrumentation:
  - managers/scheduler.py: scheduler_dispatch span (level 1) wrapping the forward dispatch.
  - managers/gpu_worker.py: gpu_forward span (level 2) wrapping the pipeline forward pass.
  - diffusion_generator.py: Calls trace_req_finish() in the finally block to close spans.
- Reuses sglang.srt.observability.trace — no new tracing framework; all imports are lazy to avoid dependency bloat when tracing is disabled.
- Unit tests (multimodal_gen/test/unit/test_tracing.py): 252-line test suite covering TraceNullContext, TraceReqContext lifecycle, pickle serialization, external header linking, abort handling, and API signature
  verification.

Known limitations (planned follow-ups)

trace_req_finish() does not yet attach span attributes (model name, generation params, latency breakdown).
No /set_trace_level endpoint for dynamic trace verbosity control.
Mesh API not instrumented

Accuracy Tests

N/A — This change adds observability instrumentation only. No model forward code or kernel changes. Tracing is off by default (--enable-trace opt-in) and uses lazy imports, so there is zero impact on the inference
path when disabled.

Benchmarking and Profiling

N/A — When tracing is disabled (default), no tracing code is executed. When enabled, overhead is limited to OTel span creation/export which is asynchronous and batched.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-24T02:48:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

rmccorm4 · 2026-03-25T20:37:28Z

CC @KrishnanPrash @ishandhanani

ishandhanani · 2026-03-25T20:39:05Z

@sufeng-buaa can you take a look?

sufeng-buaa · 2026-03-26T01:50:21Z

@sufeng-buaa can you take a look?

ok, I'll review it in the next few days.

…cing

mickqian · 2026-04-04T03:42:40Z

@@ -12,15 +12,12 @@
 os.environ.setdefault("SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS", "50")


if this is not related to sglang-diffusion, please split into two PRs

This is to reduce duplication in test, and I use it in sglang-diffusion tracing test. To reduce the review cycles, can we do this in one PR?

@mickqian I previously added the tracing CI for the LLM section in this file. @jh-nv extracted some shared infrastructure from here for the diffusion tests. It might be better to keep it in a single PR?

…cing

sufeng-buaa · 2026-04-08T01:54:14Z

/tag-and-rerun-ci

…cing

…racing # Conflicts: # python/sglang/multimodal_gen/runtime/managers/scheduler.py # python/sglang/multimodal_gen/runtime/server_args.py

jh-nv · 2026-04-17T16:41:47Z

/rerun-failed-ci

jh-nv · 2026-04-20T00:48:21Z

/rerun-failed-ci

…at/diffgen-otel-tracing

jh-nv · 2026-04-21T01:46:02Z

/rerun-failed-ci

jh-nv · 2026-04-21T15:43:25Z

/rerun-failed-ci

jh-nv · 2026-04-22T15:29:52Z

/rerun-failed-ci

ishandhanani · 2026-04-23T16:25:12Z

All tests were ran locally. This change does not affect CI

jh-nv added 2 commits March 23, 2026 20:24

feat: add OpenTelemetry tracing to DiffGenerator

81522be

use saeme pattern as SRT to initialize trace_ctx

7fb2d95

jh-nv requested review from mickqian, ping1jing2 and yhyang201 as code owners March 24, 2026 02:48

github-actions Bot added the diffusion SGLang Diffusion label Mar 24, 2026

mickqian reviewed Mar 26, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/test/unit/test_tracing.py Outdated

sufeng-buaa reviewed Mar 30, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py

Comment thread python/sglang/multimodal_gen/runtime/entrypoints/diffusion_generator.py Outdated

Comment thread python/sglang/multimodal_gen/runtime/managers/gpu_worker.py Outdated

sufeng-buaa reviewed Mar 30, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/managers/scheduler.py

refactor: consolidate diffusion tracing into context managers

33dbd1e

jh-nv requested review from mickqian and sufeng-buaa March 31, 2026 00:45

sufeng-buaa reviewed Apr 1, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/utils/trace_wrapper.py Outdated

jh-nv added 4 commits April 2, 2026 11:50

consolidate DiffStageConfig

1b582f4

remove redundant test

e0b89b9

Merge remote-tracking branch 'origin/main' into feat/diffgen-otel-tra…

c997288

…cing

add diffusion tracing integration test and extract shared OTLP collector

316cc76

mickqian reviewed Apr 4, 2026

View reviewed changes

remove lazy import

69e43f9

jh-nv requested a review from mickqian April 6, 2026 17:12

Merge remote-tracking branch 'origin/main' into feat/diffgen-otel-tra…

2f4908f

…cing

yhyang201 approved these changes Apr 7, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 8, 2026

sufeng-buaa mentioned this pull request Apr 13, 2026

[Roadmap] roadmap of request tracing (2025 Q4 and 2026 Q1) #13511

Open

17 tasks

Merge remote-tracking branch 'origin/main' into feat/diffgen-otel-tra…

4b5e8f0

…cing

jh-nv requested a review from bingxche as a code owner April 17, 2026 01:21

github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file hicache Hierarchical Caching for SGLang blackwell SM100/SM120 labels Apr 17, 2026

Merge remote-tracking branch 'upstream/main' into feat/diffgen-otel-t…

3b88cb9

…racing # Conflicts: # python/sglang/multimodal_gen/runtime/managers/scheduler.py # python/sglang/multimodal_gen/runtime/server_args.py

jh-nv force-pushed the feat/diffgen-otel-tracing branch from eeeddfd to 3b88cb9 Compare April 17, 2026 01:25

Merge branch 'main' into feat/diffgen-otel-tracing

5a407bb

jh-nv added 3 commits April 17, 2026 16:15

[diffusion] feat: propagate OTel trace context across disagg roles

5367841

update

4d3d951

Merge branch 'main' into feat/diffgen-otel-tracing

316cea5

jh-nv added 5 commits April 20, 2026 11:44

Merge branch 'main' into feat/diffgen-otel-tracing

19a3e99

Merge branch 'main' into feat/diffgen-otel-tracing

8c20183

Merge branch 'main' into feat/diffgen-otel-tracing

78ca164

add timeout limit for new test

c5f8c78

Merge remote-tracking branch 'fork/feat/diffgen-otel-tracing' into fe…

f839232

…at/diffgen-otel-tracing

jh-nv added 2 commits April 21, 2026 12:43

Merge branch 'main' into feat/diffgen-otel-tracing

90a665d

Merge branch 'main' into feat/diffgen-otel-tracing

9124fe3

jh-nv added 3 commits April 22, 2026 13:29

update

75ef2b4

update

119a6c1

Merge branch 'main' into feat/diffgen-otel-tracing

3a8ad52

ishandhanani merged commit 86ed068 into sgl-project:main Apr 23, 2026
96 of 129 checks passed

yichiche mentioned this pull request Apr 28, 2026

[AMD] Fix CI RuntimeError: opentelemetry package is not installed #23940

Merged

5 tasks

hnyls2002 mentioned this pull request May 7, 2026

propagate pytest exit code from test __main__ entries #24487

Merged

		@@ -12,15 +12,12 @@
		os.environ.setdefault("SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS", "50")

Conversation

jh-nv commented Mar 24, 2026

Motivation

Modifications

Known limitations (planned follow-ups)

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

rmccorm4 commented Mar 25, 2026

Uh oh!

ishandhanani commented Mar 25, 2026

Uh oh!

sufeng-buaa commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

jh-nv Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa commented Apr 8, 2026

Uh oh!

jh-nv commented Apr 17, 2026

Uh oh!

jh-nv commented Apr 20, 2026

Uh oh!

jh-nv commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jh-nv commented Apr 21, 2026

Uh oh!

jh-nv commented Apr 22, 2026

Uh oh!

ishandhanani commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jh-nv commented Apr 21, 2026 •

edited

Loading