Skip to content

[RFC] Native gRPC Server for SGLang in Rust #22558

@alexnails

Description

@alexnails

Checklist

Motivation

SGLang's primary API surface is a FastAPI/Uvicorn HTTP server. The existing smg-grpc-servicer package provides an alternative standalone gRPC server (via --grpc-mode), but it replaces the HTTP server entirely and still performs all request parsing, and response serialization in Python under the GIL. Neither path addresses the fundamental bottleneck:

  • Serialization overhead. Every request/response traverses JSON serialization, Python dict construction, and Pydantic validation — all under the GIL.
  • No native streaming contract. HTTP SSE is a text-based protocol bolted onto HTTP/1.1. gRPC server-streaming over HTTP/2 provides multiplexed, binary-framed, flow-controlled streams with typed messages.
  • GIL contention at the API boundary. HTTP path acquires the GIL for request parsing, tokenization, argument normalization, and response serialization. SMG gRPC avoids server-side tokenization for Generate/Embed (expects tokenized input) but still uses Python for request conversion, sampling param normalization, and scheduler IPC serialization. Under high concurrency these become the bottleneck — not the GPU.
  • Dual-protocol gap. With smg-grpc-servicer, operators must choose HTTP or gRPC — they cannot serve both simultaneously from the same process. Production deployments (Kubernetes, Triton-adjacent stacks, microservice meshes) often need both.

Why Rust + In-Process?

An in-process Rust gRPC server (via PyO3/Maturin) serves both protocols simultaneously while progressively moving GIL-bound work out of Python:

Concern HTTP (current) smg-grpc-servicer Native Rust gRPC (this RFC)
Protocol HTTP/1.1 + JSON gRPC (standalone) gRPC (alongside HTTP)
Process model In-process In-process (replaces HTTP) In-process (additive)
Dual-protocol HTTP only gRPC only HTTP + gRPC simultaneously
GIL for tokenize Yes No No (Rust tokenizers crate)
GIL for request parse Yes Yes No (proto → dict in Rust)
GIL for response delivery Yes (SSE serialization) Yes Minimal (brief callback only)

Proposed Change

Architecture Overview

A new sglang-grpc Rust crate (built with Maturin as a Python extension module) embeds a Tonic gRPC server that runs in a background thread with its own Tokio runtime. It communicates with the existing Python TokenizerManager through a thin RuntimeHandle bridge, progressively reducing GIL acquisition across five phases.

                    ┌──────────────────────────────────────────────────┐
                    │                  SGLang Process                   │
                    │                                                   │
   gRPC clients ───►│  ┌─────────────────┐    ┌─────────────────────┐  │
                    │  │  Rust gRPC       │    │  Python             │  │
                    │  │  (Tonic/Tokio)   │───►│  RuntimeHandle      │  │
                    │  │                  │    │  (grpc_bridge.py)   │  │
                    │  │  • Proto decode  │    │                     │  │
                    │  │  • Rust tokenize │    │  TokenizerManager   │  │
                    │  │  • Dict build    │    │  Scheduler          │  │
                    │  │  • Stream via    │    │  DetokenizerManager │  │
                    │  │    crossbeam ch. │◄───│                     │  │
                    │  └─────────────────┘    └─────────────────────┘  │
                    │                                                   │
   HTTP clients ───►│  FastAPI / Uvicorn (unchanged)                   │
                    └──────────────────────────────────────────────────┘

Proto Definition

A new proto/sglang/runtime/v1/sglang.proto defines the full service contract:

SGLang-native RPCs (typed proto messages):

  • TextGenerate / Generate — server-streaming text/token generation
  • TextEmbed / Embed — unary embedding
  • Classify — unary classification
  • Tokenize / Detokenize — local tokenization (no inference)
  • HealthCheck, GetModelInfo, GetServerInfo, ListModels, GetLoad
  • Abort, FlushCache, PauseGeneration, ContinueGeneration

OpenAI-compatible RPCs (JSON pass-through):

  • ChatComplete / Complete — server-streaming (SSE → gRPC stream)
  • OpenAIEmbed, OpenAIClassify, Score, Rerank — unary

Admin RPCs:

  • StartProfile / StopProfile, UpdateWeightsFromDisk

Directory Layout

proto/sglang/runtime/v1/
  sglang.proto                          # Service + message definitions

rust/sglang-grpc/
  Cargo.toml                            # Rust dependencies (tonic, pyo3, tokenizers, etc.)
  build.rs                              # tonic-build proto compilation
  pyproject.toml                        # Maturin build config
  src/
    lib.rs                              # PyO3 module: start_server(), GrpcServerHandle
    server.rs                           # Tonic service impl (all RPCs)
    bridge.rs                           # PyBridge: channels, callbacks, PyO3 ↔ Python
    tokenizer.rs                        # Rust-native tokenizer (HF tokenizers crate)
    sglang_grpc/__init__.py             # Python package re-export

python/sglang/srt/entrypoints/
  grpc_bridge.py                        # RuntimeHandle: async bridge to TokenizerManager
  http_server.py                        # Modified: starts gRPC alongside HTTP

test/registered/core/
  test_grpc_server.py                   # Integration tests (gRPC + HTTP coexistence)

Key Design Decisions

  1. Dual-protocol by default. When the sglang-grpc package is installed, the gRPC server starts automatically alongside HTTP (on --port + 10000). No flag required. Opt out with --disable-grpc. If the package is not installed, the server runs HTTP-only with an info-level log message.
  2. Callback-based response channel. Rust creates a per-request crossbeam::bounded(64) channel, passes a PyO3 callback object to Python. Python's async generators invoke the callback to push chunks into the channel. Rust's Tokio tasks drain chunks via spawn_blocking(|| receiver.recv()). The GIL is held by the Python thread only during the brief callback invocation — the Rust/Tokio side never acquires the GIL to read responses.
  3. Consolidated dict submission. Instead of per-field PyO3 kwargs, Rust builds a HashMap<String, serde_json::Value> in Rust, converts it to a Python dict in a single Python::with_gil block, and passes it to GenerateReqInput(**dict). Dict construction, callback creation, and the submit_request call all happen within one GIL acquisition rather than requiring separate GIL round-trips per field.
  4. Rust-native tokenization with Python fallback. The tokenizers crate (same Rust library underlying Python's tokenizers package) is loaded at startup from tokenizer.json. Tokenize/Detokenize RPCs execute entirely in Rust with zero GIL. Falls back to Python if tokenizer.json is unavailable.
  5. OpenAI pass-through. OpenAI-compat RPCs send raw JSON bytes to Python, which handles Pydantic parsing and template application. This preserves full compatibility without duplicating the complex OpenAI serving logic in Rust.
  6. ZMQ as Phase 4 transport with shared memory as a future option. Phase 4 uses ZMQ IPC for Rust→Scheduler communication. ZMQ adds ~5-10μs per request (kernel syscall + msgpack ser/de) vs ~200ns for shared memory ring buffers. This overhead is negligible against inference latency (10-100ms) and ZMQ is already in the codebase with native Python bindings. The transport layer is abstracted behind a serialization boundary — both sides produce/consume [u8] byte slices. Swapping to shared memory later requires only replacing the transport implementation, not the serialization format or request lifecycle. We start with ZMQ because: (a) Scheduler already speaks it, (b) it has built-in backpressure (HWM), observability, and error recovery, (c) the Python side needs zero new dependencies. Shared memory would require a custom ring buffer, a Python polling loop or ctypes wrapper, manual backpressure, and crash-safety logic for corrupted shared regions. This is a two-way door — if benchmarks show ZMQ syscall overhead matters at scale, we swap the transport without touching the protocol.

Plan (5 Phases)

Phase 1: Rust-Native Tokenization + Consolidated Python Bridge

Goal: Ship a working gRPC server that runs alongside HTTP. Eliminate GIL for Tokenize/Detokenize RPCs. Consolidate per-field PyO3 calls into single-dict submission. Risk: Low.

What's implemented (main...alexnails:sglang:alexnails/grpc):

Task Status (PR) Files
Define proto service contract (sglang.proto) Done proto/sglang/runtime/v1/sglang.proto
Rust crate scaffold (Cargo.toml, build.rs, pyproject.toml, Maturin config) Done rust/sglang-grpc/
Tonic gRPC server with all RPCs implemented Done rust/sglang-grpc/src/server.rs
PyBridge with crossbeam channels + callback pattern Done rust/sglang-grpc/src/bridge.rs
Rust-native tokenizer with Python fallback Done rust/sglang-grpc/src/tokenizer.rs
PyO3 module entry point (start_server, GrpcServerHandle) Done rust/sglang-grpc/src/lib.rs
Python RuntimeHandle (async bridge to TokenizerManager) Done python/sglang/srt/entrypoints/grpc_bridge.py
HTTP server integration (auto-start gRPC alongside HTTP) Done python/sglang/srt/entrypoints/http_server.py
Server args (--grpc-port, --disable-grpc, --grpc-modedeprecation) Done python/sglang/srt/server_args.py
Integration tests (25 test methods across 2 test classes, gRPC + HTTP coexistence) Done test/registered/core/test_grpc_server.py
Optional dependency in pyproject.toml Done python/pyproject.toml

Remaining Phase 1 tasks:

Task Status (PR) Description
CI wheel build Add sglang-grpc to the CI build matrix (maturin build + publish)
Proto lint / buf.yaml Add buf linting config for proto style enforcement
gRPC reflection Enable gRPC server reflection for grpcurl / grpc_cli discoverability
TLS / mTLS support Add --grpc-tls-cert / --grpc-tls-key flags for encrypted transport
Connection keepalive tuning Expose Tonic keepalive settings via server args
Python stub generation Generate and package sglang_pb2.py / sglang_pb2_grpc.py from the proto, or automate via build step
Benchmarks Comparative latency/throughput benchmarks vs HTTP (tokenize, generate, streaming)

Phase 2: Rust-Side Normalization and Validation

Goal: Move normalize_batch_and_arguments(), RID generation, field validation, and SamplingParams validation into Rust. Invalid requests never touch Python. Risk: Medium.

Task Status (PR) Description
Port SamplingParams validation to Rust Validate ranges (temperature >= 0, top_p in [0,1], max_new_tokens > 0, etc.) in Rust before converting to dict. Return Status::INVALID_ARGUMENT for bad params.
Port RID generation to Rust Already partially done (uuid::Uuid::new_v4 in server.rs). Formalize as the canonical RID source for all gRPC requests.
Port normalize_batch_and_arguments()to Rust Handle batch/single normalization for text[], input_ids[], and mixed inputs. Rust returns a Vec of individual request dicts.
Field validation Validate required fields (text or input_ids present, not both empty), string length limits, token count limits against context_len.
Error mapping Map Rust validation errors to gRPC status codes: INVALID_ARGUMENT for bad params, RESOURCE_EXHAUSTED for context overflow, FAILED_PRECONDITION for server not ready.
Unit tests Rust-side unit tests for validation logic (no Python needed).

GIL impact: After Phase 2, the GIL is only acquired for: (1) the single Python::with_gil block that builds the PyDict, creates GenerateReqInput(**dict), and calls into tokenizer_manager.generate_request(), and (2) Python-side tokenization and async dispatch within that coroutine. Invalid requests are rejected entirely in Rust before any GIL acquisition.


Phase 3: Rust Tokenization for Text-Input Generate

Goal: For the hot path (text-only, non-multimodal, non-LoRA generate requests), tokenize in Rust and call a Python fast-path that skips _tokenize_one_request(). GIL held only for ReqState + ZMQ send. Risk: Medium.

Task Status (PR) Description
Detect fast-path eligibility In Rust, check: text input (not input_ids), no image/video/audio data, no LoRA adapter, no custom chat template.
Rust tokenization for generate Use the Rust tokenizers crate to tokenize the prompt text, producing input_ids and input_text in Rust.
Python fast-path entry point Add TokenizerManager.submit_pretokenized_request(input_ids, input_text, sampling_params_dict, rid, ...) that skips _tokenize_one_request() and goes directly to _send_one_request().
Pad/truncate handling Handle context_len truncation in Rust (matching Python's behavior).
Special token handling Match Python's add_special_tokens behavior for the specific tokenizer (chat template BOS handling).
Equivalence tests Exhaustive tests comparing Rust fast-path output vs Python path for diverse inputs (unicode, long sequences, edge cases).

GIL impact: After Phase 3, text-only generate requests acquire the GIL only for: (1) ReqState registration, (2) ZMQ send to scheduler.


Phase 4: Direct ZMQ to Scheduler

Goal: For pre-tokenized input_ids requests, Rust sends directly to the Scheduler via ZMQ using msgpack serialization. GIL held only for ReqState registration. Transport is abstracted so ZMQ can be swapped for shared memory without protocol changes. Risk: High.

Task Status (PR) Description
ZMQ socket in Rust Open a ZMQ PUSH socket in Rust connecting to the Scheduler's IPC endpoint. Use zmq.rs or zeromq Rust crate.
Msgpack serialization Serialize TokenizedGenerateReqInput equivalent in Rust using rmp-serde, matching the exact format Python's Scheduler expects.
Transport Abstraction Define SchedulerTransport trait in Rust (send/recv over [u8]). ZMQ as first implementation. Shared memory ring buffer as future drop-in replacement if benchmarks justify it.
Dual-format recv loop in Scheduler Modify Scheduler's recv_requests() (currently uses recv_pyobj / pickle) to also accept msgpack-serialized requests from Rust. Discriminate via a 1-byte format tag prefix.
ReqState registration Still requires GIL: create ReqState object in Python, register in tokenizer_manager.rid_to_state. Explore moving to a Rust-side registry with Python callback for cleanup.
Backpressure Implement ZMQ HWM (high-water mark) and flow control to prevent Rust from overwhelming the Scheduler.
Fallback If ZMQ send fails, fall back to the Python path transparently.
Integration tests Test mixed workloads: some requests via Rust ZMQ, some via Python path, verify ordering and correctness.

GIL impact: After Phase 4, pre-tokenized requests acquire the GIL only for ReqState registration (~1 us).


Phase 5: Rust Response Loop (Full Python Bypass)

Goal: Rust receives responses directly from DetokenizerManager via a second ZMQ PULL socket. Zero GIL acquisition for the full request-response path. Risk: Very high.

Task Status (PR) Description
New IPC endpoint in DetokenizerManager Add a second ZMQ PUSH socket in DetokenizerManager that sends responses to a Rust-side PULL socket (in addition to the existing Python response path).
Response routing DetokenizerManager checks whether a RID was registered by Rust or Python and routes accordingly. Requires a shared RID registry or a routing tag in the request.
Rust-side response deserialization Deserialize DetokenizerManager's response format (msgpack or custom) in Rust. Extract text, meta_info, finish_reason.
Rust-side ReqState management Move ReqState tracking entirely to Rust. Python only needs to be notified for cleanup (LoRA adapter release, etc.) via a batched cleanup callback.
Streaming response assembly For streaming requests, Rust assembles incremental text deltas and pushes them directly to the gRPC stream via the crossbeam channel — no Python involved.
Metrics / observability Expose Rust-side request latency, queue depth, and throughput metrics via Prometheus endpoint or gRPC health service.
Graceful degradation If the Rust response loop encounters an error, fall back to the Python callback path for that request.
Full end-to-end tests Test the complete Rust-only path: gRPC request → Rust tokenize → Rust ZMQ to Scheduler → Scheduler → DetokenizerManager → Rust ZMQ response → gRPC stream.

GIL impact: After Phase 5, the full request-response hot path acquires zero GIL. Python is only involved for cold-path operations (model loading, LoRA management, OpenAI template application, multimodal preprocessing).


Phase Summary

"GIL-bound segments" counts the distinct stages of a request's lifecycle where Python holds the GIL, limiting concurrency with other requests. Rust-side Python::with_gil blocks count as one segment each; Python-internal work (tokenization, async dispatch) counts when it contends with other requests.

Phase GIL-Bound Segments (text generate hot path) What Moves to Rust Risk
1 (current) ~3 (1 Rust→Python submit with dict build + callback setup, 1 Python-side tokenize + async dispatch, 1 per response callback) Tokenize/Detokenize RPCs, proto decode, dict construction Low
2 ~3 (same as Phase 1, but invalid requests rejected before any GIL) Validation, normalization, RID gen Medium
3 ~2 (1 Rust→Python submit with pre-tokenized input, 1 per response callback) Text tokenization for generate hot path Medium
4 ~1 (ReqState registration only) ZMQ send to Scheduler High
5 0 (hot path) Response loop, ReqState management Very High

Usage

gRPC starts automatically alongside HTTP (default: port + 10000)

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct

Custom gRPC port

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --grpc-port 50051

Disable gRPC (HTTP only)

python -m  \
sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --disable-grpc

Client Example (Python + grpcio)

Generate Python stubs from the proto definition first:

python -m grpc_tools.protoc \
  -I proto \
  --python_out=. --grpc_python_out=. \
  proto/sglang/runtime/v1/sglang.protop

Then use the generated stubs:

import grpc
from sglang.runtime.v1 import sglang_pb2, sglang_pb2_grpc

channel = grpc.insecure_channel("localhost:40000")
stub = sglang_pb2_grpc.SglangServiceStub(channel)

TextGenerate (server-streaming)

request = sglang_pb2.TextGenerateRequest(
    text="Explain quantum computing in one sentence.",
    sampling_params=sglang_pb2.SamplingParams(
        temperature=0.7,
        max_new_tokens=64,
    ),
    stream=True,
)

for response in stub.TextGenerate(request):
    print(response.text, end="", flush=True)
    if response.finished:
        break

Tokenize (unary, Rust-native, zero GIL)

tok_resp = stub.Tokenize(sglang_pb2.TokenizeRequest(text="Hello, world!"))
print(f"\nTokens: {tok_resp.tokens}, Count: {tok_resp.count}")

Note: Python stub generation (`sglang_pb2.py`, `sglang_pb2_grpc.py`) is not yet automated. The integration tests use raw protobuf wire encoding via `grpcio` directly. Adding pre-generated or build-time-generated stubs is a remaining Phase 1 task.

tok_resp = stub.Tokenize(sglang_pb2.TokenizeRequest(text="Hello, world!"))
print(f"\nTokens: {tok_resp.tokens}, Count: {tok_resp.count}")

Backward Compatibility

tok_resp = stub.Tokenize(sglang_pb2.TokenizeRequest(text="Hello, world!"))
print(f"\nTokens: {tok_resp.tokens}, Count: {tok_resp.count}")
  • HTTP API is unchanged. All existing HTTP endpoints continue to work identically.
  • --grpc-mode is deprecated with a `DeprecationWarning`. It now sets `smg_grpc = True` internally, preserving the existing behavior of launching the standalone smg-grpc-servicer server instead of HTTP.
  • Port allocation. The gRPC port defaults to --port + 10000 (e.g., 30000 → 40000), avoiding conflicts with existing deployments.

Related Work

  • smg-grpc-servicer — Existing standalone gRPC server package (triggered via --grpc-mode / --smg-grpc). Runs in-process but replaces the HTTP server — operators must choose one protocol or the other. Using this does not allow us to have tight coupling for future performance wins across the stack. This RFC's native server complements it by running gRPC alongside HTTP and progressively moving GIL-bound work to Rust inline with great rust migration. Backward compat preserved via --smg-grpc.
  • sglang-router (Rust) — The existing Rust router crate demonstrates the PyO3/Maturin pattern in the SGLang ecosystem. This RFC follows the same build/packaging conventions.
  • vLLM gRPC — vLLM offers a gRPC server via grpc_server.py (Python grpcio). This RFC's Rust implementation provides lower per-request overhead by avoiding GIL acquisition for serialization and tokenization.
  • Transport alternatives considered Shared memory ring buffers offer lower per-request latency (~200ns vs ~5-10μs for ZMQ) by eliminating kernel syscalls, but require custom infrastructure on the Python side and provide no built-in backpressure, observability, or crash recovery. Apache Arrow Flight was considered for batch-oriented zero-copy data exchange but is unnecessarily complex for single-request submission. The ZMQ→shared memory path is preserved as a two-way door via a transport abstraction layer.

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions