Problem
SGLang currently has 8 separate evaluation systems that have grown organically over time, resulting in significant code duplication and maintenance burden. Several benchmarks (GSM8K, MMLU, MMMU) have 3-5 independent implementations.
Current Evaluation Systems
| # |
System |
Location |
Interface |
Purpose |
| 1 |
Simple-Evals |
python/sglang/test/run_eval.py + simple_eval_*.py |
OpenAI Chat API |
Primary CI eval |
| 2 |
Few-Shot GSM8K |
python/sglang/test/few_shot_gsm8k.py |
SGL program (@sgl.function) |
GSM8KMixin |
| 3 |
Accuracy Test Runner |
python/sglang/test/accuracy_test_runner.py |
Orchestration layer over #1 and #2 |
Nightly multi-model |
| 4 |
lm-eval harness |
python/sglang/test/kits/lm_eval_kit.py |
lm_eval.simple_evaluate() |
YAML-driven baseline comparison |
| 5 |
lmms-eval |
python/sglang/test/kits/mmmu_vlm_kit.py |
subprocess lmms_eval CLI |
VLM MMMU |
| 6 |
benchmark/ scripts |
benchmark/{gsm8k,mmlu,mmmu,...}/bench_sglang.py |
SGL program (@sgl.function) |
Early-stage accuracy + throughput benchmarks |
| 7 |
sgl-model-gateway E2E |
sgl-model-gateway/e2e_test/infra/run_eval.py |
Forked simple-evals (MMLU only) |
Gateway deployment verification |
| 8 |
Ascend GSM8K Mixin |
python/sglang/test/ascend/gsm8k_ascend_mixin.py |
few-shot gsm8k + NPU-specific env |
Huawei Ascend NPU |
Duplication Analysis
| Dataset |
# of Implementations |
Systems |
| GSM8K |
5 |
simple-eval, few-shot SGL program, benchmark/, AMD copy-paste (20+ files), Ascend mixin |
| MMLU |
4 |
simple-eval, benchmark/, sgl-model-gateway fork, lm-eval (partial) |
| MMMU |
3 |
simple-eval, lmms-eval subprocess, benchmark/ |
| hellaswag, boolq, ceval |
1 |
benchmark/ only |
| mtbench, llm_judge |
1 |
benchmark/ only |
Key Issues
-
GSM8K has 5 implementations with different interfaces ({"score"} vs {"accuracy"}), different backends (OpenAI API vs SGL program vs raw HTTP), and different answer extraction logic.
-
AMD eval tests copy-paste run_gsm8k_benchmark() and helper functions across 20+ files (test/registered/amd/accuracy/mi30x/ and mi35x/), each with minor per-model variations.
-
benchmark/ directory scripts use the legacy SGL program API (@sgl.function + RuntimeEndpoint), which overlaps with simple-evals for GSM8K, MMLU, and MMMU but is the only implementation for hellaswag, boolq, ceval, etc.
-
sgl-model-gateway has a forked copy of run_eval.py + simple_eval_common.py + simple_eval_mmlu.py that can drift from the main implementation.
-
Return value naming is inconsistent: run_eval() returns {"score": float}, few_shot_gsm8k returns {"accuracy": float}, accuracy_test_runner wraps both into AccuracyTestResult.
Suggested Directions
- Unify the return format (
score vs accuracy) across all eval backends
- Extract AMD copy-pasted helpers into a shared module
- Evaluate whether
benchmark/ SGL program scripts are still needed given simple-evals coverage
- Consider having sgl-model-gateway import from the main eval package instead of forking
- Consolidate GSM8K into fewer implementations (ideally: one OpenAI-API-based, one SGL-program-based)
Sub-Issues & TODO
Problem
SGLang currently has 8 separate evaluation systems that have grown organically over time, resulting in significant code duplication and maintenance burden. Several benchmarks (GSM8K, MMLU, MMMU) have 3-5 independent implementations.
Current Evaluation Systems
python/sglang/test/run_eval.py+simple_eval_*.pypython/sglang/test/few_shot_gsm8k.py@sgl.function)GSM8KMixinpython/sglang/test/accuracy_test_runner.pypython/sglang/test/kits/lm_eval_kit.pylm_eval.simple_evaluate()python/sglang/test/kits/mmmu_vlm_kit.pylmms_evalCLIbenchmark/{gsm8k,mmlu,mmmu,...}/bench_sglang.py@sgl.function)sgl-model-gateway/e2e_test/infra/run_eval.pypython/sglang/test/ascend/gsm8k_ascend_mixin.pyDuplication Analysis
Key Issues
GSM8K has 5 implementations with different interfaces (
{"score"}vs{"accuracy"}), different backends (OpenAI API vs SGL program vs raw HTTP), and different answer extraction logic.AMD eval tests copy-paste
run_gsm8k_benchmark()and helper functions across 20+ files (test/registered/amd/accuracy/mi30x/andmi35x/), each with minor per-model variations.benchmark/directory scripts use the legacy SGL program API (@sgl.function+RuntimeEndpoint), which overlaps with simple-evals for GSM8K, MMLU, and MMMU but is the only implementation for hellaswag, boolq, ceval, etc.sgl-model-gateway has a forked copy of
run_eval.py+simple_eval_common.py+simple_eval_mmlu.pythat can drift from the main implementation.Return value naming is inconsistent:
run_eval()returns{"score": float},few_shot_gsm8kreturns{"accuracy": float},accuracy_test_runnerwraps both intoAccuracyTestResult.Suggested Directions
scorevsaccuracy) across all eval backendsbenchmark/SGL program scripts are still needed given simple-evals coverageSub-Issues & TODO
{"score"}vs{"accuracy"}— low priority, do it opportunistically when touching related codebenchmark/scripts — determine which are still needed vs already covered by simple-evals