[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 2. Please use English, otherwise it will be closed.

### Motivation

This issue implements "fine-grained profiling for PD with EP/DP/PP" as a sub-task of  [#8210](https://github.com/sgl-project/sglang/issues/8210) "[Roadmap] Distributed Serving Enhancement on 2025 H2." cc @stmatengss 

SGLang currently lacks fine-grained request tracing capabilities. After analyzing the performance of numerous LLM inference scenarios based on SGLang over the past few months, we found that request tracing functionality is crucial. Although the PyTorch Profiler also can trace execution, it cannot collect performance data over extended periods and is unable to observe parallelism.

With sglang tracing, we can obtain the following information:
- Latency of each execution segment within a request
- Parallel execution status of requests(TP/DP/PP/EP)
- Interactions between requests across multiple nodes and parallel threads(PD-Disaggregation)
- Thread behavior in parallel execution (e.g., whether requests are backing up, whether resources are underutilized)

This FR is a preview intended to demonstrate our output results and gather feedback for improvements.
We have implemented a PoC, but due to the following pending tasks, the official PR  will be submitted in 2~3 weeks:
- Comprehensive stability testing
- Design document completion
- Additional instrumentation for DP/PP and error handling

**ProposedSolutionHighlights** 

1. Modular Tracing Framework
> - Implemented a complete tracing package and provide a set of simple APIs. Developers can easily and flexibly customize their tracing workflows using these APIs. We have pre-instrumented key points in the request execution paths.

2. OpenTelemetry Integration
> - Generate Spans via OpenTelemetry APIs to natively support OpenTelemetry Collector integration.
> - Resolved the single-context tracking limitation in OpenTelemetry, enabling simultaneous tracing of multiple request contexts even when continuous batching causes misaligned request execution

3. Distributed System Support
> - Implemented multi-node tracking in PD-Disaggregation scenarios(mini-LB, prefill/decode nodes)
> - Implement intra-request concurrency tracing, such as TP. DP, PP, and EP tracking are under development.

4. Multi-Format Visualization
> - Jaeger/Zipkin: Request-centric view
> - Perfetto: Thread-centric view. We even implemented the capability to merge trace data with PyTorch Profiler data.

**Visualization Preview**
- jaeger

Organize requests as first-level directories, threads as second-level directories (for observing parallelism), and place execution segments at the third-level hierarchy. 
Below is a legend for PD-Disaggregation with TP=1.

<img width="2553" height="1001" alt="Image" src="https://github.com/user-attachments/assets/418a847a-1536-443c-8d89-ea9db59e3145" />

Below is a legend for non PD-Disaggregation with TP=2

<img width="2552" height="792" alt="Image" src="https://github.com/user-attachments/assets/7ff93159-8b45-4a09-b454-c51ef8e9b67c" />

- perfetto

Organize threads as first-level directories, place concurrently executing request segments on second-level lines, and use links to interconnect all execution segments of a single request.
Below is a legend for PD-Disaggregation with TP=1.
<img width="2306" height="770" alt="Image" src="https://github.com/user-attachments/assets/df41e88e-37ea-471c-8519-b78fa27ca2bf" />

<img width="2307" height="780" alt="Image" src="https://github.com/user-attachments/assets/02b79db0-a580-4262-a790-9b77d5a37f6a" />

**Real-World Impact**
Over the past few months, we have leveraged this functionality to address numerous challenges, such as resource capacity planning and long-tail latency analysis.


### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency #8965

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency #8965

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions