feat(observability): add OpenTelemetry tracing for pipeline parallelism by jiangyinzuo · Pull Request #23169 · sgl-project/sglang

jiangyinzuo · 2026-04-19T16:06:51Z

Motivation

Implement PP OpenTelemetry tracing as mentioned in roadmap #13511

Modifications

add pp_forward metrics

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces observability for pipeline parallelism by adding tracing for PP forward passes. It includes a new PP_FORWARD request stage and updates the scheduler to record timing statistics and metadata during batch execution. A potential issue was identified in set_time_batch where passing attributes as positional arguments could cause a TypeError for methods that do not support them.

gemini-code-assist · 2026-04-19T16:08:48Z

+        if attrs is None:
+            method(ts)
+        else:
+            method(ts, attrs)


Passing attrs as a positional argument to method is risky because many existing set_*_time methods in SchedulerReqTimeStats (like set_forward_entry_time) do not accept a second positional argument. If set_time_batch is called with attrs for one of those methods, it will raise a TypeError. It would be safer to pass attrs as a keyword argument, provided the target methods are updated to accept it.

ShangmingCai

Looks good. cc: @sufeng-buaa please double-check.

sufeng-buaa · 2026-04-21T02:39:14Z

    )
+
+    # pipeline parallelism
+    PP_FORWARD = RequestStageConfig(


The stage name is somewhat ambiguous. The segment you are tracing is actually the CPU-side run_batch, so run_batch_cpu would be a more appropriate name.

sufeng-buaa · 2026-04-21T02:40:11Z

    last_decode_scheduled_time: float = 0.0
    last_forward_entry_time: float = 0.0
    last_prefill_finished_time: float = 0.0
+    pp_forward_start_time: float = 0.0


Similarly, please rename it accordingly.

sufeng-buaa · 2026-04-21T02:42:41Z

+            "is_last_pp_rank": self.pp_group.is_last_rank,
+        }
+        if mb_id is not None:
+            attrs["pp_mb_id"] = mb_id


I think we can keep mb_id, and move the other scheduler-related attributes to the thread span.

sufeng-buaa · 2026-04-21T02:44:54Z

        with torch.profiler.record_function("run_batch"):
            with self.forward_stream_ctx:
                self.forward_stream.wait_stream(self.schedule_stream)
+                if trace_enabled:


If the function parameters are simple, there’s no need to check trace_enabled here.

sufeng-buaa · 2026-04-21T02:49:05Z

+    # pipeline parallelism
+    PP_FORWARD = RequestStageConfig(
+        "pp_forward",
+        level=2,


Under prefill_forward and decode_forward, there may be chunked_prefill and decode_loop, so this span will be attached one level deeper. Let’s set the level to 4 for now. I’ll refactor the levels consistently later.

sufeng-buaa · 2026-04-21T02:54:01Z

trace_set_thread_info and __create_thread_context do not include the PP rank information. Please help add it.

And please update scripts/convert_otel_2_perfetto.py

diff --git a/scripts/convert_otel_2_perfetto.py b/scripts/convert_otel_2_perfetto.py
index 3a82969a4..89534a38f 100644
--- a/scripts/convert_otel_2_perfetto.py
+++ b/scripts/convert_otel_2_perfetto.py
@@ -237,6 +237,10 @@ def generate_perfetto_span(engine_root_spans, smg_otel_spans, thread_meta_data):
             pid = int(thread_span["attributes"]["pid"])
             host_id = thread_span["attributes"]["host_id"]
             thread_name = f'{thread_span["attributes"]["host_id"][:8]}:{thread_span["attributes"]["thread_label"]}'
+            if "pp_rank" in thread_span["attributes"]:
+                thread_name += f"-PP{thread_span['attributes']['pp_rank']}"
+            if "dp_rank" in thread_span["attributes"]:
+                thread_name += f"-DP{thread_span['attributes']['dp_rank']}"
             if "tp_rank" in thread_span["attributes"]:
                 thread_name += f"-TP{thread_span['attributes']['tp_rank']}"

jiangyinzuo · 2026-04-26T08:08:38Z

@sufeng-buaa I have resolved the above reviews. Could you please review again?

sufeng-buaa · 2026-04-26T09:29:27Z

+                    attrs=attrs,
+                )
                result = self.run_batch(self.cur_batch, pp_proxy_tensors)
+                set_time_batch(


set_time_batch( self.cur_batch.reqs, "set_run_batch_cpu_start_time", trace_only=True, attrs={"pp_mb_id":mb_id} )

sufeng-buaa · 2026-04-26T09:33:27Z

+                    self.cur_batch.reqs,
+                    "set_run_batch_cpu_start_time",
+                    trace_only=True,
+                    attrs=attrs,


No attrs need to be passed in

sufeng-buaa · 2026-04-26T09:34:52Z

        mb_metadata: List[Optional[PPBatchMetadata]],
        last_rank_comm_queue: deque,
    ):
+        attrs = (


Simple attributes can be passed directly to set_time_batch()

jiangyinzuo · 2026-04-26T11:42:06Z

@sufeng-buaa OK, code simplified

sufeng-buaa · 2026-04-27T02:14:40Z

/tag-and-rerun-ci

ShangmingCai

Please fix this test.

Begin (11/17):
python3 /home/runner/work/sglang/sglang/test/registered/unit/observability/test_trace.py
.
.

Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
..E
======================================================================
ERROR: test_trace_thread_context (__main__.TestDataclasses)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/sglang/sglang/test/registered/unit/observability/test_trace.py", line 82, in test_trace_thread_context
    info = TraceThreadInfo("h", 1, "l", 0, 0)
TypeError: TraceThreadInfo.__init__() missing 1 required positional argument: 'pp_rank'

----------------------------------------------------------------------
Ran 3 tests in 0.000s

FAILED (errors=1)
.

Signed-off-by: Yinzuo Jiang <jiangyinzuo@foxmail.com>

jiangyinzuo · 2026-04-28T01:30:56Z

Please fix this test.

Begin (11/17):
python3 /home/runner/work/sglang/sglang/test/registered/unit/observability/test_trace.py
.
.

Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
..E
======================================================================
ERROR: test_trace_thread_context (__main__.TestDataclasses)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/sglang/sglang/test/registered/unit/observability/test_trace.py", line 82, in test_trace_thread_context
    info = TraceThreadInfo("h", 1, "l", 0, 0)
TypeError: TraceThreadInfo.__init__() missing 1 required positional argument: 'pp_rank'

----------------------------------------------------------------------
Ran 3 tests in 0.000s

FAILED (errors=1)
.

@ShangmingCai this unittest has been fixed, but CI still has some errors, do they related to this PR?

ShangmingCai · 2026-04-28T03:16:24Z

/rerun-failed-ci

jiangyinzuo · 2026-04-28T06:54:17Z

/rerun-failed-ci

…sm (sgl-project#23169) Signed-off-by: Yinzuo Jiang <jiangyinzuo@foxmail.com>

jiangyinzuo requested review from ShangmingCai, fzyzcjy, merrymercy and sufeng-buaa as code owners April 19, 2026 16:06

gemini-code-assist Bot reviewed Apr 19, 2026

View reviewed changes

jiangyinzuo changed the title ~~feat(observability): add OpenTelemetry tracing for pipeline parallism~~ feat(observability): add OpenTelemetry tracing for pipeline parallelism Apr 19, 2026

jiangyinzuo mentioned this pull request Apr 19, 2026

[Roadmap] roadmap of request tracing (2025 Q4 and 2026 Q1) #13511

Open

17 tasks

jiangyinzuo force-pushed the feat/pp-tace branch 2 times, most recently from 3c8f2b8 to a16345c Compare April 20, 2026 02:53

ShangmingCai reviewed Apr 20, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated

ShangmingCai reviewed Apr 20, 2026

View reviewed changes

sufeng-buaa reviewed Apr 21, 2026

View reviewed changes

jiangyinzuo force-pushed the feat/pp-tace branch from a16345c to 358e3cd Compare April 23, 2026 15:56

jiangyinzuo requested review from Ying1123, hnyls2002 and xiezhq-hermann as code owners April 23, 2026 15:56

jiangyinzuo force-pushed the feat/pp-tace branch 2 times, most recently from d1e2348 to a71c9e8 Compare April 26, 2026 08:05

sufeng-buaa reviewed Apr 26, 2026

View reviewed changes

jiangyinzuo force-pushed the feat/pp-tace branch 2 times, most recently from 7a5b731 to c974f66 Compare April 26, 2026 11:17

github-actions Bot added the run-ci label Apr 27, 2026

ShangmingCai reviewed Apr 27, 2026

View reviewed changes

feat(observability): add OpenTelemetry tracing for pipeline parallism

59d2aa2

Signed-off-by: Yinzuo Jiang <jiangyinzuo@foxmail.com>

jiangyinzuo force-pushed the feat/pp-tace branch from c974f66 to 59d2aa2 Compare April 27, 2026 13:11

ShangmingCai approved these changes Apr 28, 2026

View reviewed changes

sufeng-buaa approved these changes Apr 28, 2026

View reviewed changes

ShangmingCai merged commit 71160e4 into sgl-project:main Apr 28, 2026
257 of 285 checks passed

amd-bot mentioned this pull request Apr 28, 2026

[AMD] Support fp8 MLA for diffusion model #20319

Merged

5 tasks

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

feat(observability): add OpenTelemetry tracing for pipeline paralleli…

8f966e3

…sm (sgl-project#23169) Signed-off-by: Yinzuo Jiang <jiangyinzuo@foxmail.com>

Conversation

jiangyinzuo commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa commented Apr 21, 2026

Uh oh!

jiangyinzuo commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangyinzuo commented Apr 26, 2026

Uh oh!

sufeng-buaa commented Apr 27, 2026

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

jiangyinzuo commented Apr 28, 2026

Uh oh!

ShangmingCai commented Apr 28, 2026

Uh oh!

jiangyinzuo commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiangyinzuo commented Apr 19, 2026 •

edited

Loading

jiangyinzuo commented Apr 26, 2026 •

edited

Loading