Skip to content

feat: add OpenTelemetry tracing to DiffGenerator#21254

Merged
ishandhanani merged 27 commits intosgl-project:mainfrom
jh-nv:feat/diffgen-otel-tracing
Apr 23, 2026
Merged

feat: add OpenTelemetry tracing to DiffGenerator#21254
ishandhanani merged 27 commits intosgl-project:mainfrom
jh-nv:feat/diffgen-otel-tracing

Conversation

@jh-nv
Copy link
Copy Markdown
Contributor

@jh-nv jh-nv commented Mar 24, 2026

Motivation

Enable OpenTelemetry tracing for the diffusion/multimodal generation subsystem (multimodal_gen) to achieve feature parity with the SRT (LLM) runtime, which already has comprehensive OTel instrumentation. This is the initial scaffolding — downstream consumers (e.g., Dynamo) need tracing support in the diffusion path, so we want to land the foundation early. Fine-grained span, rich span attributes (resolution, inference steps, guidance scale, etc.) and additional trace stages will be built on top in follow-up PRs.

Modifications

  • Server args (multimodal_gen/runtime/server_args.py): Added --enable-trace and --otlp-traces-endpoint CLI flags, mirroring SRT's configuration.
    • Process-level tracing init:
      • DiffGenerator.from_server_args(): Initializes OTel exporter with service name sglang-diffusion and thread label DiffGenerator.
      • gpu_worker.py (run_scheduler_process): Initializes tracing per GPU worker with thread label DiffWorker_rank{N}.
      • launch_server.py: Passes tracing config through to worker processes.
    • Request-level tracing:
      • entrypoints/utils.py (prepare_request): Creates TraceReqContext per request when tracing is enabled, linking to external W3C trace headers.
      • schedule_batch.py (Req): Added trace_ctx field, defaults to TraceNullContext (no-op) when tracing is disabled.
    • Trace header extraction:
      • openai/image_api.py and openai/video_api.py: Extract traceparent/tracestate headers from incoming HTTP requests.
    • Span instrumentation:
      • managers/scheduler.py: scheduler_dispatch span (level 1) wrapping the forward dispatch.
      • managers/gpu_worker.py: gpu_forward span (level 2) wrapping the pipeline forward pass.
      • diffusion_generator.py: Calls trace_req_finish() in the finally block to close spans.
    • Reuses sglang.srt.observability.trace — no new tracing framework; all imports are lazy to avoid dependency bloat when tracing is disabled.
    • Unit tests (multimodal_gen/test/unit/test_tracing.py): 252-line test suite covering TraceNullContext, TraceReqContext lifecycle, pickle serialization, external header linking, abort handling, and API signature
      verification.

Known limitations (planned follow-ups)

  • trace_req_finish() does not yet attach span attributes (model name, generation params, latency breakdown).
  • No /set_trace_level endpoint for dynamic trace verbosity control.
  • Mesh API not instrumented

Accuracy Tests

N/A — This change adds observability instrumentation only. No model forward code or kernel changes. Tracing is off by default (--enable-trace opt-in) and uses lazy imports, so there is zero impact on the inference
path when disabled.

Benchmarking and Profiling

N/A — When tracing is disabled (default), no tracing code is executed. When enabled, overhead is limited to OTel span creation/export which is asynchronous and batched.

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 24, 2026
@rmccorm4
Copy link
Copy Markdown

CC @KrishnanPrash @ishandhanani

@ishandhanani
Copy link
Copy Markdown
Collaborator

@sufeng-buaa can you take a look?

@sufeng-buaa
Copy link
Copy Markdown
Collaborator

@sufeng-buaa can you take a look?

ok, I'll review it in the next few days.

Comment thread python/sglang/multimodal_gen/test/unit/test_tracing.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py
Comment thread python/sglang/multimodal_gen/runtime/entrypoints/diffusion_generator.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/managers/gpu_worker.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/managers/scheduler.py
@jh-nv jh-nv requested review from mickqian and sufeng-buaa March 31, 2026 00:45
Comment thread python/sglang/multimodal_gen/runtime/utils/trace_wrapper.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/entrypoints/openai/image_api.py Outdated
@@ -12,15 +12,12 @@
os.environ.setdefault("SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS", "50")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is not related to sglang-diffusion, please split into two PRs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to reduce duplication in test, and I use it in sglang-diffusion tracing test. To reduce the review cycles, can we do this in one PR?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mickqian I previously added the tracing CI for the LLM section in this file. @jh-nv extracted some shared infrastructure from here for the diffusion tests. It might be better to keep it in a single PR?

@jh-nv jh-nv requested a review from mickqian April 6, 2026 17:12
@sufeng-buaa
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@jh-nv jh-nv requested a review from bingxche as a code owner April 17, 2026 01:21
@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file hicache Hierarchical Caching for SGLang blackwell SM100/SM120 labels Apr 17, 2026
…racing

# Conflicts:
#	python/sglang/multimodal_gen/runtime/managers/scheduler.py
#	python/sglang/multimodal_gen/runtime/server_args.py
@jh-nv jh-nv force-pushed the feat/diffgen-otel-tracing branch from eeeddfd to 3b88cb9 Compare April 17, 2026 01:25
@jh-nv
Copy link
Copy Markdown
Contributor Author

jh-nv commented Apr 17, 2026

/rerun-failed-ci

@jh-nv
Copy link
Copy Markdown
Contributor Author

jh-nv commented Apr 20, 2026

/rerun-failed-ci

@jh-nv
Copy link
Copy Markdown
Contributor Author

jh-nv commented Apr 21, 2026

/rerun-failed-ci

1 similar comment
@jh-nv
Copy link
Copy Markdown
Contributor Author

jh-nv commented Apr 21, 2026

/rerun-failed-ci

@jh-nv
Copy link
Copy Markdown
Contributor Author

jh-nv commented Apr 22, 2026

/rerun-failed-ci

@ishandhanani
Copy link
Copy Markdown
Collaborator

All tests were ran locally. This change does not affect CI

@ishandhanani ishandhanani merged commit 86ed068 into sgl-project:main Apr 23, 2026
96 of 129 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants