You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current benchmark scripts (bench_serving.py) are too complex and may be hard to maintain and add new functions. Also, different benchmark scripts(bench_serving, bench_one_batch, bench_one_batch_serving, etc.) share some code, but not a good abstraction for now.
Related resources
No response
Analysis & Phased Refactoring Plan
I did a thorough analysis of the current benchmark code duplication and also reviewed PR #10409. Here's a summary and a concrete phased plan.
Current State
bench_serving.py is 3415 lines — a monolith containing 5 distinct responsibilities:
Responsibility
Approx Lines
Examples
Backend request functions
~700
async_request_sglang_generate, async_request_openai_completions, async_request_trt_llm, etc.
Dataset loading
~800
sample_sharegpt_requests, sample_random_requests, sample_image_requests, get_dataset, etc.
Benchmark orchestration
~500
benchmark(), run_benchmark(), warmup, profiling
Metrics & result reporting
~200
BenchmarkMetrics, calculate_metrics, result printing/saving
CLI arg parsing
~400
argparse definitions in __main__
Multiple other files import from this monolith to reuse dataset/utility functions:
PR #10409 created a python/sglang/benchmark/ package with good design decisions worth adopting:
Good ideas to keep:
Backend class hierarchy (BaseBackendClient + per-backend subclasses) — cleaner than a function dict
Dataset loader class hierarchy (BaseDatasetLoader + per-dataset subclasses) — easy to extend
Independent metrics.py for metrics calculation/printing/saving
BenchmarkRunner class encapsulating warmup/profiling/dispatch
Problems to avoid:
PR [benchmark] refactor bench (part 1) #10409copies code instead of moving it — the old bench_serving.py is kept intact, meaning bug fixes need to be patched in two places
Uses SGLANG_BENCHMARK_V2 env var as feature flag — bench_serving.py actually gets larger (old code wrapped in else branch)
Introduces click dependency — inconsistent with the rest of sglang which uses argparse (e.g., ServerArgs)
Missing custom and openai dataset loaders — incomplete migration
Doesn't address duplication in bench_offline_throughput.py or bench_one_batch_server_internal.py
Proposed Plan: 5 Phased PRs
Each PR is independently mergeable and never breaks existing CLI usage (python -m sglang.bench_serving) or import paths (from sglang.bench_serving import ...). The mechanism: move code from bench_serving.py to benchmark/ submodules, then add re-exports in bench_serving.py.
Move DatasetRow and all sample_* functions to benchmark/datasets/ (one file per dataset)
Move get_dataset() as-is — keep the original if-elif dispatch logic, no new abstractions
Add re-exports in bench_serving.py — all existing from sglang.bench_serving import ... continue to work
Don't touchbench_offline_throughput.py or other consumers yet
Don't introduce class hierarchy (BaseDatasetLoader, *DatasetLoader classes, DATASET_MAPPING) — that belongs in PR1.5
PR1.5: Dataset API refactor + unit tests (low risk)
API refactor — introduce a three-layer architecture (DatasetArgs + Loader + Utils):
Argparse definitions in bench_serving.py stay unchanged (flat namespace)
Each dataset module defines a typed *Args dataclass with from_args(cls, args) to extract relevant fields, plus a *Dataset loader class that receives the typed config
datasets/__init__.py becomes a thin registry (DATASET_MAPPING + get_dataset() that calls args_class.from_args(args) then loader.load(config, tokenizer))
Unit tests — add CPU-only tests (stage-a-cpu-only) that call each dataset's sampling function with a lightweight tokenizer and validate the returned List[DatasetRow].
PR2: Extract backends (medium risk)
Move RequestFuncInput/RequestFuncOutput and all async_request_* functions to benchmark/backends/, wrapping each in a BaseBackendClient subclass
Keep ASYNC_REQUEST_FUNCS dict in bench_serving.py as a wrapper
Add re-exports for RequestFuncInput/RequestFuncOutput
PR3: Extract metrics + runner (higher risk)
Move BenchmarkMetrics, calculate_metrics(), result printing/saving to benchmark/metrics.py
Create BenchmarkRunner class in benchmark/runner.py to encapsulate the benchmark() async function logic
Motivation
Current benchmark scripts (
bench_serving.py) are too complex and may be hard to maintain and add new functions. Also, different benchmark scripts(bench_serving,bench_one_batch,bench_one_batch_serving, etc.) share some code, but not a good abstraction for now.Related resources
No response
Analysis & Phased Refactoring Plan
I did a thorough analysis of the current benchmark code duplication and also reviewed PR #10409. Here's a summary and a concrete phased plan.
Current State
bench_serving.pyis 3415 lines — a monolith containing 5 distinct responsibilities:async_request_sglang_generate,async_request_openai_completions,async_request_trt_llm, etc.sample_sharegpt_requests,sample_random_requests,sample_image_requests,get_dataset, etc.benchmark(),run_benchmark(), warmup, profilingBenchmarkMetrics,calculate_metrics, result printing/saving__main__Multiple other files import from this monolith to reuse dataset/utility functions:
Key duplication:
bench_offline_throughput.py(476 lines) duplicates ~130 lines of dataset-related CLI arg definitions frombench_serving.pybenchmark/hicache/bench_serving.py(1029 lines) copy-pastesRequestFuncInput/RequestFuncOutputandasync_request_openai_completionswith modificationsbench_one_batch_server_internal.py(891 lines) has its ownBenchArgswith overlapping fieldsPR #10409 Review
PR #10409 created a
python/sglang/benchmark/package with good design decisions worth adopting:Good ideas to keep:
BaseBackendClient+ per-backend subclasses) — cleaner than a function dictBaseDatasetLoader+ per-dataset subclasses) — easy to extendmetrics.pyfor metrics calculation/printing/savingBenchmarkRunnerclass encapsulating warmup/profiling/dispatchProblems to avoid:
bench_serving.pyis kept intact, meaning bug fixes need to be patched in two placesSGLANG_BENCHMARK_V2env var as feature flag —bench_serving.pyactually gets larger (old code wrapped inelsebranch)clickdependency — inconsistent with the rest of sglang which usesargparse(e.g.,ServerArgs)customandopenaidataset loaders — incomplete migrationbench_offline_throughput.pyorbench_one_batch_server_internal.pyProposed Plan: 5 Phased PRs
Each PR is independently mergeable and never breaks existing CLI usage (
python -m sglang.bench_serving) or import paths (from sglang.bench_serving import ...). The mechanism: move code frombench_serving.pytobenchmark/submodules, then add re-exports inbench_serving.py.Target structure:
PR1(#19077): Extract utils + datasets (lowest risk) — simple code movement only
get_tokenizer,get_processor,download_and_cache_file,set_ulimit,remove_prefix, etc.) tobenchmark/utils.pyDatasetRowand allsample_*functions tobenchmark/datasets/(one file per dataset)get_dataset()as-is — keep the original if-elif dispatch logic, no new abstractionsbench_serving.py— all existingfrom sglang.bench_serving import ...continue to workbench_offline_throughput.pyor other consumers yetBaseDatasetLoader,*DatasetLoaderclasses,DATASET_MAPPING) — that belongs in PR1.5PR1.5: Dataset API refactor + unit tests (low risk)
API refactor — introduce a three-layer architecture (DatasetArgs + Loader + Utils):
bench_serving.pystay unchanged (flat namespace)*Argsdataclass withfrom_args(cls, args)to extract relevant fields, plus a*Datasetloader class that receives the typed configdatasets/__init__.pybecomes a thin registry (DATASET_MAPPING+get_dataset()that callsargs_class.from_args(args)thenloader.load(config, tokenizer))Unit tests — add CPU-only tests (
stage-a-cpu-only) that call each dataset's sampling function with a lightweight tokenizer and validate the returnedList[DatasetRow].PR2: Extract backends (medium risk)
RequestFuncInput/RequestFuncOutputand allasync_request_*functions tobenchmark/backends/, wrapping each in aBaseBackendClientsubclassASYNC_REQUEST_FUNCSdict inbench_serving.pyas a wrapperRequestFuncInput/RequestFuncOutputPR3: Extract metrics + runner (higher risk)
BenchmarkMetrics,calculate_metrics(), result printing/saving tobenchmark/metrics.pyBenchmarkRunnerclass inbenchmark/runner.pyto encapsulate thebenchmark()async function logicbench_serving.pybecomes a thin wrapper (~800 lines: re-exports + argparse +run_benchmarkentry point)PR4: Simplify downstream consumers (low risk)
bench_offline_throughput.pyto usebenchmark/datasetsdirectly, removing ~130 lines of duplicate dataset arg definitionsbench_one_batch_server_internal.pyimports to point tobenchmark/modulesOut of Scope
multimodal_gen/benchmarks/bench_serving.py— independent diffusion model benchmark, minimal overlapbench_one_batch.pywithbench_one_batch_server_internal.py— fundamentally different (local model API vs HTTP server)benchmark/hicache/bench_serving.py— has its ownRequestFuncInputvariant with extra fields; modifying it risks breaking the hicache benchmark flowDesign Decisions
ServerArgs, etc.), no new dependenciesbench_serving.pycode is deleted and replaced with imports + re-exportscustomandopenaithat PR [benchmark] refactor bench (part 1) #10409 missedThis proposal was generated with the assistance of Claude.