[RFC] Native gRPC Server for SGLang in Rust

### Checklist

- [x] If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### **Motivation**

SGLang's primary API surface is a FastAPI/Uvicorn HTTP server. The existing `smg-grpc-servicer` package provides an alternative standalone gRPC server (via `--grpc-mode`), but it replaces the HTTP server entirely and still performs all request parsing, and response serialization in Python under the GIL. Neither path addresses the fundamental bottleneck:

* **Serialization overhead.** Every request/response traverses JSON serialization, Python dict construction, and Pydantic validation — all under the GIL.  
* **No native streaming contract.** HTTP SSE is a text-based protocol bolted onto HTTP/1.1. gRPC server-streaming over HTTP/2 provides multiplexed, binary-framed, flow-controlled streams with typed messages.  
* **GIL contention at the API boundary.** HTTP path acquires the GIL for request parsing, tokenization, argument normalization, and response serialization. SMG gRPC avoids server-side tokenization for Generate/Embed (expects tokenized input) but still uses Python for request conversion, sampling param normalization, and scheduler IPC serialization. Under high concurrency these become the bottleneck — not the GPU.  
* **Dual-protocol gap.** With smg-grpc-servicer, operators must choose HTTP *or* gRPC — they cannot serve both simultaneously from the same process. Production deployments (Kubernetes, Triton-adjacent stacks, microservice meshes) often need both.

### **Why Rust \+ In-Process?**

An in-process Rust gRPC server (via PyO3/Maturin) serves both protocols simultaneously while progressively moving GIL-bound work out of Python:

| Concern | HTTP (current) | smg-grpc-servicer | Native Rust gRPC (this RFC) |
| :---- | :---- | :---- | :---- |
| Protocol | HTTP/1.1 \+ JSON | gRPC (standalone) | gRPC (alongside HTTP) |
| Process model | In-process | In-process (replaces HTTP) | In-process (additive) |
| Dual-protocol | HTTP only | gRPC only | **HTTP \+ gRPC simultaneously** |
| GIL for tokenize | Yes | No | **No** (Rust `tokenizers` crate) |
| GIL for request parse | Yes | Yes | **No** (proto → dict in Rust) |
| GIL for response delivery | Yes (SSE serialization) | Yes | **Minimal** (brief callback only) |

---

## **Proposed Change**

### **Architecture Overview**

A new `sglang-grpc` Rust crate (built with Maturin as a Python extension module) embeds a Tonic gRPC server that runs in a background thread with its own Tokio runtime. It communicates with the existing Python `TokenizerManager` through a thin `RuntimeHandle` bridge, progressively reducing GIL acquisition across five phases.

```
                    ┌──────────────────────────────────────────────────┐
                    │                  SGLang Process                   │
                    │                                                   │
   gRPC clients ───►│  ┌─────────────────┐    ┌─────────────────────┐  │
                    │  │  Rust gRPC       │    │  Python             │  │
                    │  │  (Tonic/Tokio)   │───►│  RuntimeHandle      │  │
                    │  │                  │    │  (grpc_bridge.py)   │  │
                    │  │  • Proto decode  │    │                     │  │
                    │  │  • Rust tokenize │    │  TokenizerManager   │  │
                    │  │  • Dict build    │    │  Scheduler          │  │
                    │  │  • Stream via    │    │  DetokenizerManager │  │
                    │  │    crossbeam ch. │◄───│                     │  │
                    │  └─────────────────┘    └─────────────────────┘  │
                    │                                                   │
   HTTP clients ───►│  FastAPI / Uvicorn (unchanged)                   │
                    └──────────────────────────────────────────────────┘
```

### **Proto Definition**

A new `proto/sglang/runtime/v1/sglang.proto` defines the full service contract:

**SGLang-native RPCs** (typed proto messages):

* `TextGenerate` / `Generate` — server-streaming text/token generation  
* `TextEmbed` / `Embed` — unary embedding  
* `Classify` — unary classification  
* `Tokenize` / `Detokenize` — local tokenization (no inference)  
* `HealthCheck`, `GetModelInfo`, `GetServerInfo`, `ListModels`, `GetLoad`  
* `Abort`, `FlushCache`, `PauseGeneration`, `ContinueGeneration`

**OpenAI-compatible RPCs** (JSON pass-through):

* `ChatComplete` / `Complete` — server-streaming (SSE → gRPC stream)  
* `OpenAIEmbed`, `OpenAIClassify`, `Score`, `Rerank` — unary

**Admin RPCs:**

* `StartProfile` / `StopProfile`, `UpdateWeightsFromDisk`

### **Directory Layout**

```
proto/sglang/runtime/v1/
  sglang.proto                          # Service + message definitions

rust/sglang-grpc/
  Cargo.toml                            # Rust dependencies (tonic, pyo3, tokenizers, etc.)
  build.rs                              # tonic-build proto compilation
  pyproject.toml                        # Maturin build config
  src/
    lib.rs                              # PyO3 module: start_server(), GrpcServerHandle
    server.rs                           # Tonic service impl (all RPCs)
    bridge.rs                           # PyBridge: channels, callbacks, PyO3 ↔ Python
    tokenizer.rs                        # Rust-native tokenizer (HF tokenizers crate)
    sglang_grpc/__init__.py             # Python package re-export

python/sglang/srt/entrypoints/
  grpc_bridge.py                        # RuntimeHandle: async bridge to TokenizerManager
  http_server.py                        # Modified: starts gRPC alongside HTTP

test/registered/core/
  test_grpc_server.py                   # Integration tests (gRPC + HTTP coexistence)
```

### **Key Design Decisions**

1. **Dual-protocol by default.** When the `sglang-grpc` package is installed, the gRPC server starts automatically alongside HTTP (on `--port + 10000`). No flag required. Opt out with `--disable-grpc`. If the package is not installed, the server runs HTTP-only with an info-level log message.  
2. **Callback-based response channel.** Rust creates a per-request `crossbeam::bounded(64)` channel, passes a PyO3 callback object to Python. Python's async generators invoke the callback to push chunks into the channel. Rust's Tokio tasks drain chunks via `spawn_blocking(|| receiver.recv())`. The GIL is held by the Python thread only during the brief callback invocation — the Rust/Tokio side never acquires the GIL to read responses.  
3. **Consolidated dict submission.** Instead of per-field PyO3 kwargs, Rust builds a `HashMap<String, serde_json::Value>` in Rust, converts it to a Python dict in a single `Python::with_gil` block, and passes it to `GenerateReqInput(**dict)`. Dict construction, callback creation, and the `submit_request` call all happen within one GIL acquisition rather than requiring separate GIL round-trips per field.  
4. **Rust-native tokenization with Python fallback.** The `tokenizers` crate (same Rust library underlying Python's `tokenizers` package) is loaded at startup from `tokenizer.json`. Tokenize/Detokenize RPCs execute entirely in Rust with zero GIL. Falls back to Python if `tokenizer.json` is unavailable.  
5. **OpenAI pass-through.** OpenAI-compat RPCs send raw JSON bytes to Python, which handles Pydantic parsing and template application. This preserves full compatibility without duplicating the complex OpenAI serving logic in Rust.
6. **ZMQ as Phase 4 transport with shared memory as a future option**. Phase 4 uses ZMQ IPC for Rust→Scheduler communication. ZMQ adds ~5-10μs per request (kernel syscall + msgpack ser/de) vs ~200ns for shared memory ring buffers. This overhead is negligible against inference latency (10-100ms) and ZMQ is already in the codebase with native Python bindings. The transport layer is abstracted behind a serialization boundary — both sides produce/consume [u8] byte slices. Swapping to shared memory later requires only replacing the transport implementation, not the serialization format or request lifecycle. We start with ZMQ because: (a) Scheduler already speaks it, (b) it has built-in backpressure (HWM), observability, and error recovery, (c) the Python side needs zero new dependencies. Shared memory would require a custom ring buffer, a Python polling loop or ctypes wrapper, manual backpressure, and crash-safety logic for corrupted shared regions. This is a two-way door — if benchmarks show ZMQ syscall overhead matters at scale, we swap the transport without touching the protocol.
---

## **Plan (5 Phases)**

### **Phase 1: Rust-Native Tokenization \+ Consolidated Python Bridge**

**Goal:** Ship a working gRPC server that runs alongside HTTP. Eliminate GIL for Tokenize/Detokenize RPCs. Consolidate per-field PyO3 calls into single-dict submission. **Risk: Low.**

**What's implemented (https://github.com/sgl-project/sglang/compare/main...alexnails:sglang:alexnails/grpc):**

| Task | Status (PR) | Files |
| :---- | :---- | :---- |
| Define proto service contract (`sglang.proto`) | Done | `proto/sglang/runtime/v1/sglang.proto` |
| Rust crate scaffold (Cargo.toml, build.rs, pyproject.toml, Maturin config) | Done | `rust/sglang-grpc/` |
| Tonic gRPC server with all RPCs implemented | Done | `rust/sglang-grpc/src/server.rs` |
| PyBridge with crossbeam channels \+ callback pattern | Done | `rust/sglang-grpc/src/bridge.rs` |
| Rust-native tokenizer with Python fallback | Done | `rust/sglang-grpc/src/tokenizer.rs` |
| PyO3 module entry point (`start_server`, `GrpcServerHandle`) | Done | `rust/sglang-grpc/src/lib.rs` |
| Python `RuntimeHandle` (async bridge to TokenizerManager) | Done | `python/sglang/srt/entrypoints/grpc_bridge.py` |
| HTTP server integration (auto-start gRPC alongside HTTP) | Done | `python/sglang/srt/entrypoints/http_server.py` |
| Server args (`--grpc-port`, `--disable-grpc`, `--grpc-mode`deprecation) | Done | `python/sglang/srt/server_args.py` |
| Integration tests (25 test methods across 2 test classes, gRPC \+ HTTP coexistence) | Done  | `test/registered/core/test_grpc_server.py` |
| Optional dependency in pyproject.toml | Done | `python/pyproject.toml` |

**Remaining Phase 1 tasks:**

| Task | Status (PR) | Description |
| :---- | :---- | :---- |
| CI wheel build |  | Add `sglang-grpc` to the CI build matrix (maturin build \+ publish) |
| Proto lint / buf.yaml |  | Add buf linting config for proto style enforcement |
| gRPC reflection |  | Enable gRPC server reflection for `grpcurl` / `grpc_cli` discoverability |
| TLS / mTLS support |  | Add `--grpc-tls-cert` / `--grpc-tls-key` flags for encrypted transport |
| Connection keepalive tuning |  | Expose Tonic keepalive settings via server args |
| Python stub generation |  | Generate and package `sglang_pb2.py` / `sglang_pb2_grpc.py` from the proto, or automate via build step |
| Benchmarks |  | Comparative latency/throughput benchmarks vs HTTP (tokenize, generate, streaming) |

---

### **Phase 2: Rust-Side Normalization and Validation**

**Goal:** Move `normalize_batch_and_arguments()`, RID generation, field validation, and `SamplingParams` validation into Rust. Invalid requests never touch Python. **Risk: Medium.**

| Task | Status (PR) | Description |
| :---- | :---- | :---- |
| Port `SamplingParams` validation to Rust |  | Validate ranges (temperature \>= 0, top\_p in \[0,1\], max\_new\_tokens \> 0, etc.) in Rust before converting to dict. Return `Status::INVALID_ARGUMENT` for bad params. |
| Port RID generation to Rust |  | Already partially done (uuid::Uuid::new\_v4 in server.rs). Formalize as the canonical RID source for all gRPC requests. |
| Port `normalize_batch_and_arguments()`to Rust |  | Handle batch/single normalization for text\[\], input\_ids\[\], and mixed inputs. Rust returns a Vec of individual request dicts. |
| Field validation |  | Validate required fields (text or input\_ids present, not both empty), string length limits, token count limits against `context_len`. |
| Error mapping |  | Map Rust validation errors to gRPC status codes: `INVALID_ARGUMENT` for bad params, `RESOURCE_EXHAUSTED` for context overflow, `FAILED_PRECONDITION` for server not ready. |
| Unit tests |  | Rust-side unit tests for validation logic (no Python needed). |

**GIL impact:** After Phase 2, the GIL is only acquired for: (1) the single `Python::with_gil` block that builds the PyDict, creates `GenerateReqInput(**dict)`, and calls into `tokenizer_manager.generate_request()`, and (2) Python-side tokenization and async dispatch within that coroutine. Invalid requests are rejected entirely in Rust before any GIL acquisition.

---

### **Phase 3: Rust Tokenization for Text-Input Generate**

**Goal:** For the hot path (text-only, non-multimodal, non-LoRA generate requests), tokenize in Rust and call a Python fast-path that skips `_tokenize_one_request()`. GIL held only for `ReqState` \+ ZMQ send. **Risk: Medium.**

| Task | Status (PR) | Description |
| :---- | :---- | :---- |
| Detect fast-path eligibility |  | In Rust, check: text input (not input\_ids), no image/video/audio data, no LoRA adapter, no custom chat template. |
| Rust tokenization for generate |  | Use the Rust `tokenizers` crate to tokenize the prompt text, producing `input_ids` and `input_text` in Rust. |
| Python fast-path entry point |  | Add `TokenizerManager.submit_pretokenized_request(input_ids, input_text, sampling_params_dict, rid, ...)` that skips `_tokenize_one_request()` and goes directly to `_send_one_request()`. |
| Pad/truncate handling |  | Handle `context_len` truncation in Rust (matching Python's behavior). |
| Special token handling |  | Match Python's `add_special_tokens` behavior for the specific tokenizer (chat template BOS handling). |
| Equivalence tests |  | Exhaustive tests comparing Rust fast-path output vs Python path for diverse inputs (unicode, long sequences, edge cases). |

**GIL impact:** After Phase 3, text-only generate requests acquire the GIL only for: (1) `ReqState` registration, (2) ZMQ send to scheduler.

---

### **Phase 4: Direct ZMQ to Scheduler**

**Goal:** For pre-tokenized input_ids requests, Rust sends directly to the Scheduler via ZMQ using msgpack serialization. GIL held only for ReqState registration. Transport is abstracted so ZMQ can be swapped for shared memory without protocol changes. **Risk: High.**

| Task | Status (PR) | Description |
| :---- | :---- | :---- |
| ZMQ socket in Rust |  | Open a ZMQ PUSH socket in Rust connecting to the Scheduler's IPC endpoint. Use `zmq.rs` or `zeromq` Rust crate. |
| Msgpack serialization |  | Serialize `TokenizedGenerateReqInput` equivalent in Rust using `rmp-serde`, matching the exact format Python's Scheduler expects. |
| Transport Abstraction |  |Define SchedulerTransport trait in Rust (send/recv over [u8]). ZMQ as first implementation. Shared memory ring buffer as future drop-in replacement if benchmarks justify it. |
| Dual-format recv loop in Scheduler |  | Modify Scheduler's `recv_requests()` (currently uses `recv_pyobj` / pickle) to also accept msgpack-serialized requests from Rust. Discriminate via a 1-byte format tag prefix. |
| ReqState registration |  | Still requires GIL: create `ReqState` object in Python, register in `tokenizer_manager.rid_to_state`. Explore moving to a Rust-side registry with Python callback for cleanup. |
| Backpressure |  | Implement ZMQ HWM (high-water mark) and flow control to prevent Rust from overwhelming the Scheduler. |
| Fallback |  | If ZMQ send fails, fall back to the Python path transparently. |
| Integration tests |  | Test mixed workloads: some requests via Rust ZMQ, some via Python path, verify ordering and correctness. |

**GIL impact:** After Phase 4, pre-tokenized requests acquire the GIL only for `ReqState` registration (\~1 us).

---

### **Phase 5: Rust Response Loop (Full Python Bypass)**

**Goal:** Rust receives responses directly from `DetokenizerManager` via a second ZMQ PULL socket. Zero GIL acquisition for the full request-response path. **Risk: Very high.**

| Task | Status (PR) | Description |
| :---- | :---- | :---- |
| New IPC endpoint in DetokenizerManager |  | Add a second ZMQ PUSH socket in `DetokenizerManager` that sends responses to a Rust-side PULL socket (in addition to the existing Python response path). |
| Response routing |  | DetokenizerManager checks whether a RID was registered by Rust or Python and routes accordingly. Requires a shared RID registry or a routing tag in the request. |
| Rust-side response deserialization |  | Deserialize DetokenizerManager's response format (msgpack or custom) in Rust. Extract text, meta\_info, finish\_reason. |
| Rust-side ReqState management |  | Move ReqState tracking entirely to Rust. Python only needs to be notified for cleanup (LoRA adapter release, etc.) via a batched cleanup callback. |
| Streaming response assembly |  | For streaming requests, Rust assembles incremental text deltas and pushes them directly to the gRPC stream via the crossbeam channel — no Python involved. |
| Metrics / observability |  | Expose Rust-side request latency, queue depth, and throughput metrics via Prometheus endpoint or gRPC health service. |
| Graceful degradation |  | If the Rust response loop encounters an error, fall back to the Python callback path for that request. |
| Full end-to-end tests |  | Test the complete Rust-only path: gRPC request → Rust tokenize → Rust ZMQ to Scheduler → Scheduler → DetokenizerManager → Rust ZMQ response → gRPC stream. |

**GIL impact:** After Phase 5, the full request-response hot path acquires zero GIL. Python is only involved for cold-path operations (model loading, LoRA management, OpenAI template application, multimodal preprocessing).

---

## **Phase Summary**

"GIL-bound segments" counts the distinct stages of a request's lifecycle where Python holds the GIL, limiting concurrency with other requests. Rust-side `Python::with_gil` blocks count as one segment each; Python-internal work (tokenization, async dispatch) counts when it contends with other requests.

| Phase | GIL-Bound Segments (text generate hot path) | What Moves to Rust  | Risk  |
| :---- | :---- | :---- | :---- |
| **1** (current) | \~3 (1 Rust→Python submit with dict build \+ callback setup, 1 Python-side tokenize \+ async dispatch, 1 per response callback) | Tokenize/Detokenize RPCs, proto decode, dict construction | Low  |
| **2** | \~3 (same as Phase 1, but invalid requests rejected before any GIL) | Validation, normalization, RID gen | Medium  |
| **3** | \~2 (1 Rust→Python submit with pre-tokenized input, 1 per response callback) | Text tokenization for generate hot path | Medium |
| **4** | \~1 (ReqState registration only) | ZMQ send to Scheduler | High |
| **5** | **0** (hot path) | Response loop, ReqState management | Very High |

---

# **Usage**

## **gRPC starts automatically alongside HTTP (default: port \+ 10000\)**

```
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
```

## **Custom gRPC port**

```
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --grpc-port 50051
```

## **Disable gRPC (HTTP only)**

```
python -m  \
sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --disable-grpc
```

## **Client Example (Python \+ grpcio)**

Generate Python stubs from the proto definition first:

```
python -m grpc_tools.protoc \
  -I proto \
  --python_out=. --grpc_python_out=. \
  proto/sglang/runtime/v1/sglang.protop
```

Then use the generated stubs:

```
import grpc
from sglang.runtime.v1 import sglang_pb2, sglang_pb2_grpc

channel = grpc.insecure_channel("localhost:40000")
stub = sglang_pb2_grpc.SglangServiceStub(channel)
```

## **TextGenerate (server-streaming)**

```
request = sglang_pb2.TextGenerateRequest(
    text="Explain quantum computing in one sentence.",
    sampling_params=sglang_pb2.SamplingParams(
        temperature=0.7,
        max_new_tokens=64,
    ),
    stream=True,
)

for response in stub.TextGenerate(request):
    print(response.text, end="", flush=True)
    if response.finished:
        break
```

## **Tokenize (unary, Rust-native, zero GIL)**

```
tok_resp = stub.Tokenize(sglang_pb2.TokenizeRequest(text="Hello, world!"))
print(f"\nTokens: {tok_resp.tokens}, Count: {tok_resp.count}")
```

**Note**: Python stub generation (\`sglang\_pb2.py\`, \`sglang\_pb2\_grpc.py\`) is not yet automated. The integration tests use raw protobuf wire encoding via \`grpcio\` directly. Adding pre-generated or build-time-generated stubs is a remaining Phase 1 task.

```
tok_resp = stub.Tokenize(sglang_pb2.TokenizeRequest(text="Hello, world!"))
print(f"\nTokens: {tok_resp.tokens}, Count: {tok_resp.count}")
```

## **Backward Compatibility**

```
tok_resp = stub.Tokenize(sglang_pb2.TokenizeRequest(text="Hello, world!"))
print(f"\nTokens: {tok_resp.tokens}, Count: {tok_resp.count}")
```

* **HTTP API is unchanged.** All existing HTTP endpoints continue to work identically.  
* **`--grpc-mode` is deprecated with a \`DeprecationWarning\`**. It now sets \`smg\_grpc \= True\` internally, preserving the existing behavior of launching the standalone smg-grpc-servicer server instead of HTTP.  
* **Port allocation.** The gRPC port defaults to `--port + 10000` (e.g., 30000 → 40000), avoiding conflicts with existing deployments.

## **Related Work**

* **smg-grpc-servicer** — Existing standalone gRPC server package (triggered via `--grpc-mode` / `--smg-grpc`). Runs in-process but *replaces* the HTTP server — operators must choose one protocol or the other. Using this does not allow us to have tight coupling for future performance wins across the stack. This RFC's native server complements it by running gRPC *alongside* HTTP and progressively moving GIL-bound work to Rust inline with great rust migration. Backward compat preserved via `--smg-grpc`.  
* **sglang-router** (Rust) — The existing Rust router crate demonstrates the PyO3/Maturin pattern in the SGLang ecosystem. This RFC follows the same build/packaging conventions.  
* **vLLM gRPC** — vLLM offers a gRPC server via `grpc_server.py` (Python grpcio). This RFC's Rust implementation provides lower per-request overhead by avoiding GIL acquisition for serialization and tokenization.
* **Transport alternatives considered** Shared memory ring buffers offer lower per-request latency (~200ns vs ~5-10μs for ZMQ) by eliminating kernel syscalls, but require custom infrastructure on the Python side and provide no built-in backpressure, observability, or crash recovery. Apache Arrow Flight was considered for batch-oriented zero-copy data exchange but is unnecessarily complex for single-request submission. The ZMQ→shared memory path is preserved as a two-way door via a transport abstraction layer.

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Native gRPC Server for SGLang in Rust #22558

Checklist

Motivation

Why Rust + In-Process?

Proposed Change

Architecture Overview

Proto Definition

Directory Layout

Key Design Decisions

Plan (5 Phases)

Phase 1: Rust-Native Tokenization + Consolidated Python Bridge

Phase 2: Rust-Side Normalization and Validation

Phase 3: Rust Tokenization for Text-Input Generate

Phase 4: Direct ZMQ to Scheduler

Phase 5: Rust Response Loop (Full Python Bypass)

Phase Summary

Usage

gRPC starts automatically alongside HTTP (default: port + 10000)

Custom gRPC port

Disable gRPC (HTTP only)

Client Example (Python + grpcio)

TextGenerate (server-streaming)

Tokenize (unary, Rust-native, zero GIL)

Backward Compatibility

Related Work

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concern	HTTP (current)	smg-grpc-servicer	Native Rust gRPC (this RFC)
Protocol	HTTP/1.1 + JSON	gRPC (standalone)	gRPC (alongside HTTP)
Process model	In-process	In-process (replaces HTTP)	In-process (additive)
Dual-protocol	HTTP only	gRPC only	HTTP + gRPC simultaneously
GIL for tokenize	Yes	No	No (Rust `tokenizers` crate)
GIL for request parse	Yes	Yes	No (proto → dict in Rust)
GIL for response delivery	Yes (SSE serialization)	Yes	Minimal (brief callback only)

Task	Status (PR)	Files
Define proto service contract (`sglang.proto`)	Done	`proto/sglang/runtime/v1/sglang.proto`
Rust crate scaffold (Cargo.toml, build.rs, pyproject.toml, Maturin config)	Done	`rust/sglang-grpc/`
Tonic gRPC server with all RPCs implemented	Done	`rust/sglang-grpc/src/server.rs`
PyBridge with crossbeam channels + callback pattern	Done	`rust/sglang-grpc/src/bridge.rs`
Rust-native tokenizer with Python fallback	Done	`rust/sglang-grpc/src/tokenizer.rs`
PyO3 module entry point (`start_server`, `GrpcServerHandle`)	Done	`rust/sglang-grpc/src/lib.rs`
Python `RuntimeHandle` (async bridge to TokenizerManager)	Done	`python/sglang/srt/entrypoints/grpc_bridge.py`
HTTP server integration (auto-start gRPC alongside HTTP)	Done	`python/sglang/srt/entrypoints/http_server.py`
Server args (`--grpc-port`, `--disable-grpc`, `--grpc-mode`deprecation)	Done	`python/sglang/srt/server_args.py`
Integration tests (25 test methods across 2 test classes, gRPC + HTTP coexistence)	Done	`test/registered/core/test_grpc_server.py`
Optional dependency in pyproject.toml	Done	`python/pyproject.toml`

Task	Status (PR)	Description
CI wheel build		Add `sglang-grpc` to the CI build matrix (maturin build + publish)
Proto lint / buf.yaml		Add buf linting config for proto style enforcement
gRPC reflection		Enable gRPC server reflection for `grpcurl` / `grpc_cli` discoverability
TLS / mTLS support		Add `--grpc-tls-cert` / `--grpc-tls-key` flags for encrypted transport
Connection keepalive tuning		Expose Tonic keepalive settings via server args
Python stub generation		Generate and package `sglang_pb2.py` / `sglang_pb2_grpc.py` from the proto, or automate via build step
Benchmarks		Comparative latency/throughput benchmarks vs HTTP (tokenize, generate, streaming)

Task	Status (PR)	Description
Port `SamplingParams` validation to Rust		Validate ranges (temperature >= 0, top_p in [0,1], max_new_tokens > 0, etc.) in Rust before converting to dict. Return `Status::INVALID_ARGUMENT` for bad params.
Port RID generation to Rust		Already partially done (uuid::Uuid::new_v4 in server.rs). Formalize as the canonical RID source for all gRPC requests.
Port `normalize_batch_and_arguments()`to Rust		Handle batch/single normalization for text[], input_ids[], and mixed inputs. Rust returns a Vec of individual request dicts.
Field validation		Validate required fields (text or input_ids present, not both empty), string length limits, token count limits against `context_len`.
Error mapping		Map Rust validation errors to gRPC status codes: `INVALID_ARGUMENT` for bad params, `RESOURCE_EXHAUSTED` for context overflow, `FAILED_PRECONDITION` for server not ready.
Unit tests		Rust-side unit tests for validation logic (no Python needed).

Task	Status (PR)	Description
Detect fast-path eligibility		In Rust, check: text input (not input_ids), no image/video/audio data, no LoRA adapter, no custom chat template.
Rust tokenization for generate		Use the Rust `tokenizers` crate to tokenize the prompt text, producing `input_ids` and `input_text` in Rust.
Python fast-path entry point		Add `TokenizerManager.submit_pretokenized_request(input_ids, input_text, sampling_params_dict, rid, ...)` that skips `_tokenize_one_request()` and goes directly to `_send_one_request()`.
Pad/truncate handling		Handle `context_len` truncation in Rust (matching Python's behavior).
Special token handling		Match Python's `add_special_tokens` behavior for the specific tokenizer (chat template BOS handling).
Equivalence tests		Exhaustive tests comparing Rust fast-path output vs Python path for diverse inputs (unicode, long sequences, edge cases).

Task	Status (PR)	Description
ZMQ socket in Rust		Open a ZMQ PUSH socket in Rust connecting to the Scheduler's IPC endpoint. Use `zmq.rs` or `zeromq` Rust crate.
Msgpack serialization		Serialize `TokenizedGenerateReqInput` equivalent in Rust using `rmp-serde`, matching the exact format Python's Scheduler expects.
Transport Abstraction		Define SchedulerTransport trait in Rust (send/recv over [u8]). ZMQ as first implementation. Shared memory ring buffer as future drop-in replacement if benchmarks justify it.
Dual-format recv loop in Scheduler		Modify Scheduler's `recv_requests()` (currently uses `recv_pyobj` / pickle) to also accept msgpack-serialized requests from Rust. Discriminate via a 1-byte format tag prefix.
ReqState registration		Still requires GIL: create `ReqState` object in Python, register in `tokenizer_manager.rid_to_state`. Explore moving to a Rust-side registry with Python callback for cleanup.
Backpressure		Implement ZMQ HWM (high-water mark) and flow control to prevent Rust from overwhelming the Scheduler.
Fallback		If ZMQ send fails, fall back to the Python path transparently.
Integration tests		Test mixed workloads: some requests via Rust ZMQ, some via Python path, verify ordering and correctness.

Task	Status (PR)	Description
New IPC endpoint in DetokenizerManager		Add a second ZMQ PUSH socket in `DetokenizerManager` that sends responses to a Rust-side PULL socket (in addition to the existing Python response path).
Response routing		DetokenizerManager checks whether a RID was registered by Rust or Python and routes accordingly. Requires a shared RID registry or a routing tag in the request.
Rust-side response deserialization		Deserialize DetokenizerManager's response format (msgpack or custom) in Rust. Extract text, meta_info, finish_reason.
Rust-side ReqState management		Move ReqState tracking entirely to Rust. Python only needs to be notified for cleanup (LoRA adapter release, etc.) via a batched cleanup callback.
Streaming response assembly		For streaming requests, Rust assembles incremental text deltas and pushes them directly to the gRPC stream via the crossbeam channel — no Python involved.
Metrics / observability		Expose Rust-side request latency, queue depth, and throughput metrics via Prometheus endpoint or gRPC health service.
Graceful degradation		If the Rust response loop encounters an error, fall back to the Python callback path for that request.
Full end-to-end tests		Test the complete Rust-only path: gRPC request → Rust tokenize → Rust ZMQ to Scheduler → Scheduler → DetokenizerManager → Rust ZMQ response → gRPC stream.

Phase	GIL-Bound Segments (text generate hot path)	What Moves to Rust	Risk
1 (current)	~3 (1 Rust→Python submit with dict build + callback setup, 1 Python-side tokenize + async dispatch, 1 per response callback)	Tokenize/Detokenize RPCs, proto decode, dict construction	Low
2	~3 (same as Phase 1, but invalid requests rejected before any GIL)	Validation, normalization, RID gen	Medium
3	~2 (1 Rust→Python submit with pre-tokenized input, 1 per response callback)	Text tokenization for generate hot path	Medium
4	~1 (ReqState registration only)	ZMQ send to Scheduler	High
5	0 (hot path)	Response loop, ReqState management	Very High

[RFC] Native gRPC Server for SGLang in Rust #22558

Description

Checklist

Motivation

Why Rust + In-Process?

Proposed Change

Architecture Overview

Proto Definition

Directory Layout

Key Design Decisions

Plan (5 Phases)

Phase 1: Rust-Native Tokenization + Consolidated Python Bridge

Phase 2: Rust-Side Normalization and Validation

Phase 3: Rust Tokenization for Text-Input Generate

Phase 4: Direct ZMQ to Scheduler

Phase 5: Rust Response Loop (Full Python Bypass)

Phase Summary

Usage

gRPC starts automatically alongside HTTP (default: port + 10000)

Custom gRPC port

Disable gRPC (HTTP only)

Client Example (Python + grpcio)

TextGenerate (server-streaming)

Tokenize (unary, Rust-native, zero GIL)

Backward Compatibility

Related Work

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions