[Refactor] Benchmark Scripts Refactor

### Motivation

Current benchmark scripts (`bench_serving.py`) are too complex and may be hard to maintain and add new functions. Also, different benchmark scripts(`bench_serving`, `bench_one_batch`, `bench_one_batch_serving`, etc.) share some code, but not a good abstraction for now.

### Related resources

_No response_

---

## Analysis & Phased Refactoring Plan

I did a thorough analysis of the current benchmark code duplication and also reviewed PR #10409. Here's a summary and a concrete phased plan.

### Current State

`bench_serving.py` is **3415 lines** — a monolith containing 5 distinct responsibilities:

| Responsibility | Approx Lines | Examples |
|---|---|---|
| Backend request functions | ~700 | `async_request_sglang_generate`, `async_request_openai_completions`, `async_request_trt_llm`, etc. |
| Dataset loading | ~800 | `sample_sharegpt_requests`, `sample_random_requests`, `sample_image_requests`, `get_dataset`, etc. |
| Benchmark orchestration | ~500 | `benchmark()`, `run_benchmark()`, warmup, profiling |
| Metrics & result reporting | ~200 | `BenchmarkMetrics`, `calculate_metrics`, result printing/saving |
| CLI arg parsing | ~400 | argparse definitions in `__main__` |

Multiple other files import from this monolith to reuse dataset/utility functions:

```python
# bench_offline_throughput.py
from sglang.bench_serving import DatasetRow, get_dataset, get_tokenizer, sample_random_requests, set_ulimit

# bench_one_batch_server_internal.py  
from sglang.bench_serving import get_processor, get_tokenizer, sample_mmmu_requests, sample_random_requests

# benchmark/hicache/bench_serving.py
from sglang.bench_serving import get_tokenizer, remove_prefix, set_ulimit
```

Key duplication:
- `bench_offline_throughput.py` (476 lines) duplicates **~130 lines** of dataset-related CLI arg definitions from `bench_serving.py`
- `benchmark/hicache/bench_serving.py` (1029 lines) **copy-pastes** `RequestFuncInput`/`RequestFuncOutput` and `async_request_openai_completions` with modifications
- `bench_one_batch_server_internal.py` (891 lines) has its own `BenchArgs` with overlapping fields

### PR #10409 Review

PR #10409 created a `python/sglang/benchmark/` package with good design decisions worth adopting:

**Good ideas to keep:**
- Backend class hierarchy (`BaseBackendClient` + per-backend subclasses) — cleaner than a function dict
- Dataset loader class hierarchy (`BaseDatasetLoader` + per-dataset subclasses) — easy to extend
- Independent `metrics.py` for metrics calculation/printing/saving
- `BenchmarkRunner` class encapsulating warmup/profiling/dispatch

**Problems to avoid:**
- PR #10409 **copies** code instead of **moving** it — the old `bench_serving.py` is kept intact, meaning bug fixes need to be patched in two places
- Uses `SGLANG_BENCHMARK_V2` env var as feature flag — `bench_serving.py` actually gets **larger** (old code wrapped in `else` branch)
- Introduces `click` dependency — inconsistent with the rest of sglang which uses `argparse` (e.g., `ServerArgs`)
- Missing `custom` and `openai` dataset loaders — incomplete migration
- Doesn't address duplication in `bench_offline_throughput.py` or `bench_one_batch_server_internal.py`

### Proposed Plan: 5 Phased PRs

Each PR is independently mergeable and **never breaks** existing CLI usage (`python -m sglang.bench_serving`) or import paths (`from sglang.bench_serving import ...`). The mechanism: **move** code from `bench_serving.py` to `benchmark/` submodules, then add re-exports in `bench_serving.py`.

```
PR1 (lowest risk) ──► PR1.5 (low risk) ──► PR2 (medium risk) ──► PR3 (higher risk) ──► PR4 (low risk)
simple code move      API refactor + tests    backends               metrics + runner       dedup consumers
```

**Target structure:**
```
python/sglang/benchmark/
├── __init__.py
├── utils.py                       # get_tokenizer, download, set_ulimit, etc.
├── metrics.py                     # BenchmarkMetrics + calculate/print/save
├── runner.py                      # BenchmarkRunner (warmup/profiling/dispatch)
├── datasets/
│   ├── __init__.py                # DATASET_MAPPING + get_dataset()
│   ├── common.py                  # DatasetRow, BaseDatasetArgs, BaseDatasetLoader
│   ├── sharegpt.py, random.py, custom.py, openai_dataset.py,
│   ├── image.py, mmmu.py, mooncake.py, generated_shared_prefix.py
└── backends/
    ├── __init__.py                # BACKEND_MAPPING + get_backend_client()
    ├── base_client.py             # BaseBackendClient ABC, RequestFuncInput/Output
    ├── sglang_client.py, oai_client.py, oai_chat_client.py,
    ├── trt_client.py, truss_client.py, gserver_client.py
```

#### PR1(#19077): Extract utils + datasets (lowest risk) — simple code movement only

- Move utility functions (`get_tokenizer`, `get_processor`, `download_and_cache_file`, `set_ulimit`, `remove_prefix`, etc.) to `benchmark/utils.py`
- Move `DatasetRow` and all `sample_*` functions to `benchmark/datasets/` (one file per dataset)
- Move `get_dataset()` as-is — keep the original if-elif dispatch logic, no new abstractions
- Add re-exports in `bench_serving.py` — all existing `from sglang.bench_serving import ...` continue to work
- **Don't touch** `bench_offline_throughput.py` or other consumers yet
- **Don't introduce** class hierarchy (`BaseDatasetLoader`, `*DatasetLoader` classes, `DATASET_MAPPING`) — that belongs in PR1.5

#### PR1.5: Dataset API refactor + unit tests (low risk)

**API refactor** — introduce a three-layer architecture (DatasetArgs + Loader + Utils):
- Argparse definitions in `bench_serving.py` stay unchanged (flat namespace)
- Each dataset module defines a typed `*Args` dataclass with `from_args(cls, args)` to extract relevant fields, plus a `*Dataset` loader class that receives the typed config
- `datasets/__init__.py` becomes a thin registry (`DATASET_MAPPING` + `get_dataset()` that calls `args_class.from_args(args)` then `loader.load(config, tokenizer)`)

**Unit tests** — add CPU-only tests (`stage-a-cpu-only`) that call each dataset's sampling function with a lightweight tokenizer and validate the returned `List[DatasetRow]`.

#### PR2: Extract backends (medium risk)

- Move `RequestFuncInput`/`RequestFuncOutput` and all `async_request_*` functions to `benchmark/backends/`, wrapping each in a `BaseBackendClient` subclass
- Keep `ASYNC_REQUEST_FUNCS` dict in `bench_serving.py` as a wrapper
- Add re-exports for `RequestFuncInput`/`RequestFuncOutput`

#### PR3: Extract metrics + runner (higher risk)

- Move `BenchmarkMetrics`, `calculate_metrics()`, result printing/saving to `benchmark/metrics.py`
- Create `BenchmarkRunner` class in `benchmark/runner.py` to encapsulate the `benchmark()` async function logic
- `bench_serving.py` becomes a thin wrapper (~800 lines: re-exports + argparse + `run_benchmark` entry point)

#### PR4: Simplify downstream consumers (low risk)

- Update `bench_offline_throughput.py` to use `benchmark/datasets` directly, removing ~130 lines of duplicate dataset arg definitions
- Update `bench_one_batch_server_internal.py` imports to point to `benchmark/` modules

### Out of Scope

- `multimodal_gen/benchmarks/bench_serving.py` — independent diffusion model benchmark, minimal overlap
- Merging `bench_one_batch.py` with `bench_one_batch_server_internal.py` — fundamentally different (local model API vs HTTP server)
- `benchmark/hicache/bench_serving.py` — has its own `RequestFuncInput` variant with extra fields; modifying it risks breaking the hicache benchmark flow

### Design Decisions

- **Keep argparse** — consistent with the rest of sglang (`ServerArgs`, etc.), no new dependencies
- **Move, don't copy** — `bench_serving.py` code is deleted and replaced with imports + re-exports
- **No feature flags** — direct migration, backward compat via re-exports
- **All dataset loaders migrated** — including `custom` and `openai` that PR #10409 missed

---
*This proposal was generated with the assistance of Claude.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Benchmark Scripts Refactor #10177

Motivation

Related resources

Analysis & Phased Refactoring Plan

Current State

PR #10409 Review

Proposed Plan: 5 Phased PRs

PR1(#19077): Extract utils + datasets (lowest risk) — simple code movement only

PR1.5: Dataset API refactor + unit tests (low risk)

PR2: Extract backends (medium risk)

PR3: Extract metrics + runner (higher risk)

PR4: Simplify downstream consumers (low risk)

Out of Scope

Design Decisions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Responsibility	Approx Lines	Examples
Backend request functions	~700	`async_request_sglang_generate`, `async_request_openai_completions`, `async_request_trt_llm`, etc.
Dataset loading	~800	`sample_sharegpt_requests`, `sample_random_requests`, `sample_image_requests`, `get_dataset`, etc.
Benchmark orchestration	~500	`benchmark()`, `run_benchmark()`, warmup, profiling
Metrics & result reporting	~200	`BenchmarkMetrics`, `calculate_metrics`, result printing/saving
CLI arg parsing	~400	argparse definitions in `__main__`

[Refactor] Benchmark Scripts Refactor #10177

Description

Motivation

Related resources

Analysis & Phased Refactoring Plan

Current State

PR #10409 Review

Proposed Plan: 5 Phased PRs

PR1(#19077): Extract utils + datasets (lowest risk) — simple code movement only

PR1.5: Dataset API refactor + unit tests (low risk)

PR2: Extract backends (medium risk)

PR3: Extract metrics + runner (higher risk)

PR4: Simplify downstream consumers (low risk)

Out of Scope

Design Decisions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions