Checklist
Motivation
Motivation
In the current multimodal generation (diffusion) pipeline, text encoders such as T5, UMT5, and CLIP are loaded as plain PyTorch modules and executed with a straightforward
forward pass (see TextEncodingStage in runtime/pipelines_core/stages/text_encoding.py). While functional, this approach does not leverage the highly optimized serving
capabilities that SGLang's LLM runtime (srt) already provides.
Many of these text encoders are, at their core, large language models. For the text encoding use case, they only require the prefill phase (i.e., a single forward pass
to produce hidden states / embeddings) — effectively acting as encoders. In some pipeline configurations, certain LLM components may additionally require
autoregressive decoding (e.g., for captioning or re-captioning stages).
By routing these LLM workloads through SGLang's srt runtime, we can unlock several performance benefits that would directly accelerate end-to-end generation latency and
throughput.
Proposed Feature
Allow the multimodal generation pipeline to use SGLang's LLM runtime (srt) to serve LLM-based components, where:
- Encoder-only models (e.g., T5 / UMT5 for text encoding) are served in prefill-only mode — no autoregressive decoding is needed; the runtime returns hidden
states or pooled embeddings after a single forward pass.
- Encoder-decoder or decoder models that require autoregressive generation are served in the standard decode mode.
Expected Benefits
- RadixAttention / Prefix Caching: Repeated or partially overlapping prompts (common in batch generation or negative prompt reuse) can benefit from prefix cache hits,
avoiding redundant computation.
- Continuous Batching: Multiple encoding requests can be dynamically batched, improving GPU utilization and throughput under concurrent load.
- PagedAttention / Efficient KV Cache Management: Reduced memory fragmentation for large-batch encoding workloads.
- Quantization & Tensor Parallelism: Leverage
srt's existing support for model quantization and multi-GPU tensor parallelism, enabling serving of larger text
encoders with lower resource overhead.
- Unified Serving Infrastructure: A single SGLang deployment can serve both the LLM encoder components and the diffusion pipeline, simplifying operational complexity.
Current Behavior
Text encoders are loaded as raw torch.nn.Module instances inside GPUWorker and invoked directly in TextEncodingStage.encode_text():
# text_encoding.py L264-270
outputs: BaseEncoderOutput = text_encoder(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
use_cache=False,
)
This bypasses all srt runtime optimizations.
Possible Approach
- Introduce an option in PipelineConfig / ServerArgs to specify that certain LLM components should be served via srt (e.g., via a local or remote SGLang endpoint).
- The TextEncodingStage (and future stages requiring LLM inference) would call the srt runtime instead of directly invoking the Huggingface Transformers module.
Related resources
No response
Checklist
Motivation
Motivation
In the current multimodal generation (diffusion) pipeline, text encoders such as T5, UMT5, and CLIP are loaded as plain PyTorch modules and executed with a straightforward
forward pass (see
TextEncodingStageinruntime/pipelines_core/stages/text_encoding.py). While functional, this approach does not leverage the highly optimized servingcapabilities that SGLang's LLM runtime (
srt) already provides.Many of these text encoders are, at their core, large language models. For the text encoding use case, they only require the prefill phase (i.e., a single forward pass
to produce hidden states / embeddings) — effectively acting as encoders. In some pipeline configurations, certain LLM components may additionally require
autoregressive decoding (e.g., for captioning or re-captioning stages).
By routing these LLM workloads through SGLang's
srtruntime, we can unlock several performance benefits that would directly accelerate end-to-end generation latency andthroughput.
Proposed Feature
Allow the multimodal generation pipeline to use SGLang's LLM runtime (
srt) to serve LLM-based components, where:states or pooled embeddings after a single forward pass.
Expected Benefits
avoiding redundant computation.
srt's existing support for model quantization and multi-GPU tensor parallelism, enabling serving of larger textencoders with lower resource overhead.
Current Behavior
Text encoders are loaded as raw
torch.nn.Moduleinstances insideGPUWorkerand invoked directly inTextEncodingStage.encode_text():This bypasses all srt runtime optimizations.
Possible Approach
Related resources
No response