Skip to content

[Refactor] Benchmark Scripts Refactor #10177

@hnyls2002

Description

@hnyls2002

Motivation

Current benchmark scripts (bench_serving.py) are too complex and may be hard to maintain and add new functions. Also, different benchmark scripts(bench_serving, bench_one_batch, bench_one_batch_serving, etc.) share some code, but not a good abstraction for now.

Related resources

No response


Analysis & Phased Refactoring Plan

I did a thorough analysis of the current benchmark code duplication and also reviewed PR #10409. Here's a summary and a concrete phased plan.

Current State

bench_serving.py is 3415 lines — a monolith containing 5 distinct responsibilities:

Responsibility Approx Lines Examples
Backend request functions ~700 async_request_sglang_generate, async_request_openai_completions, async_request_trt_llm, etc.
Dataset loading ~800 sample_sharegpt_requests, sample_random_requests, sample_image_requests, get_dataset, etc.
Benchmark orchestration ~500 benchmark(), run_benchmark(), warmup, profiling
Metrics & result reporting ~200 BenchmarkMetrics, calculate_metrics, result printing/saving
CLI arg parsing ~400 argparse definitions in __main__

Multiple other files import from this monolith to reuse dataset/utility functions:

# bench_offline_throughput.py
from sglang.bench_serving import DatasetRow, get_dataset, get_tokenizer, sample_random_requests, set_ulimit

# bench_one_batch_server_internal.py  
from sglang.bench_serving import get_processor, get_tokenizer, sample_mmmu_requests, sample_random_requests

# benchmark/hicache/bench_serving.py
from sglang.bench_serving import get_tokenizer, remove_prefix, set_ulimit

Key duplication:

  • bench_offline_throughput.py (476 lines) duplicates ~130 lines of dataset-related CLI arg definitions from bench_serving.py
  • benchmark/hicache/bench_serving.py (1029 lines) copy-pastes RequestFuncInput/RequestFuncOutput and async_request_openai_completions with modifications
  • bench_one_batch_server_internal.py (891 lines) has its own BenchArgs with overlapping fields

PR #10409 Review

PR #10409 created a python/sglang/benchmark/ package with good design decisions worth adopting:

Good ideas to keep:

  • Backend class hierarchy (BaseBackendClient + per-backend subclasses) — cleaner than a function dict
  • Dataset loader class hierarchy (BaseDatasetLoader + per-dataset subclasses) — easy to extend
  • Independent metrics.py for metrics calculation/printing/saving
  • BenchmarkRunner class encapsulating warmup/profiling/dispatch

Problems to avoid:

  • PR [benchmark] refactor bench (part 1) #10409 copies code instead of moving it — the old bench_serving.py is kept intact, meaning bug fixes need to be patched in two places
  • Uses SGLANG_BENCHMARK_V2 env var as feature flag — bench_serving.py actually gets larger (old code wrapped in else branch)
  • Introduces click dependency — inconsistent with the rest of sglang which uses argparse (e.g., ServerArgs)
  • Missing custom and openai dataset loaders — incomplete migration
  • Doesn't address duplication in bench_offline_throughput.py or bench_one_batch_server_internal.py

Proposed Plan: 5 Phased PRs

Each PR is independently mergeable and never breaks existing CLI usage (python -m sglang.bench_serving) or import paths (from sglang.bench_serving import ...). The mechanism: move code from bench_serving.py to benchmark/ submodules, then add re-exports in bench_serving.py.

PR1 (lowest risk) ──► PR1.5 (low risk) ──► PR2 (medium risk) ──► PR3 (higher risk) ──► PR4 (low risk)
simple code move      API refactor + tests    backends               metrics + runner       dedup consumers

Target structure:

python/sglang/benchmark/
├── __init__.py
├── utils.py                       # get_tokenizer, download, set_ulimit, etc.
├── metrics.py                     # BenchmarkMetrics + calculate/print/save
├── runner.py                      # BenchmarkRunner (warmup/profiling/dispatch)
├── datasets/
│   ├── __init__.py                # DATASET_MAPPING + get_dataset()
│   ├── common.py                  # DatasetRow, BaseDatasetArgs, BaseDatasetLoader
│   ├── sharegpt.py, random.py, custom.py, openai_dataset.py,
│   ├── image.py, mmmu.py, mooncake.py, generated_shared_prefix.py
└── backends/
    ├── __init__.py                # BACKEND_MAPPING + get_backend_client()
    ├── base_client.py             # BaseBackendClient ABC, RequestFuncInput/Output
    ├── sglang_client.py, oai_client.py, oai_chat_client.py,
    ├── trt_client.py, truss_client.py, gserver_client.py

PR1(#19077): Extract utils + datasets (lowest risk) — simple code movement only

  • Move utility functions (get_tokenizer, get_processor, download_and_cache_file, set_ulimit, remove_prefix, etc.) to benchmark/utils.py
  • Move DatasetRow and all sample_* functions to benchmark/datasets/ (one file per dataset)
  • Move get_dataset() as-is — keep the original if-elif dispatch logic, no new abstractions
  • Add re-exports in bench_serving.py — all existing from sglang.bench_serving import ... continue to work
  • Don't touch bench_offline_throughput.py or other consumers yet
  • Don't introduce class hierarchy (BaseDatasetLoader, *DatasetLoader classes, DATASET_MAPPING) — that belongs in PR1.5

PR1.5: Dataset API refactor + unit tests (low risk)

API refactor — introduce a three-layer architecture (DatasetArgs + Loader + Utils):

  • Argparse definitions in bench_serving.py stay unchanged (flat namespace)
  • Each dataset module defines a typed *Args dataclass with from_args(cls, args) to extract relevant fields, plus a *Dataset loader class that receives the typed config
  • datasets/__init__.py becomes a thin registry (DATASET_MAPPING + get_dataset() that calls args_class.from_args(args) then loader.load(config, tokenizer))

Unit tests — add CPU-only tests (stage-a-cpu-only) that call each dataset's sampling function with a lightweight tokenizer and validate the returned List[DatasetRow].

PR2: Extract backends (medium risk)

  • Move RequestFuncInput/RequestFuncOutput and all async_request_* functions to benchmark/backends/, wrapping each in a BaseBackendClient subclass
  • Keep ASYNC_REQUEST_FUNCS dict in bench_serving.py as a wrapper
  • Add re-exports for RequestFuncInput/RequestFuncOutput

PR3: Extract metrics + runner (higher risk)

  • Move BenchmarkMetrics, calculate_metrics(), result printing/saving to benchmark/metrics.py
  • Create BenchmarkRunner class in benchmark/runner.py to encapsulate the benchmark() async function logic
  • bench_serving.py becomes a thin wrapper (~800 lines: re-exports + argparse + run_benchmark entry point)

PR4: Simplify downstream consumers (low risk)

  • Update bench_offline_throughput.py to use benchmark/datasets directly, removing ~130 lines of duplicate dataset arg definitions
  • Update bench_one_batch_server_internal.py imports to point to benchmark/ modules

Out of Scope

  • multimodal_gen/benchmarks/bench_serving.py — independent diffusion model benchmark, minimal overlap
  • Merging bench_one_batch.py with bench_one_batch_server_internal.py — fundamentally different (local model API vs HTTP server)
  • benchmark/hicache/bench_serving.py — has its own RequestFuncInput variant with extra fields; modifying it risks breaking the hicache benchmark flow

Design Decisions

  • Keep argparse — consistent with the rest of sglang (ServerArgs, etc.), no new dependencies
  • Move, don't copybench_serving.py code is deleted and replaced with imports + re-exports
  • No feature flags — direct migration, backward compat via re-exports
  • All dataset loaders migrated — including custom and openai that PR [benchmark] refactor bench (part 1) #10409 missed

This proposal was generated with the assistance of Claude.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions