Consolidate fragmented evaluation systems

## Problem

SGLang currently has **8 separate evaluation systems** that have grown organically over time, resulting in significant code duplication and maintenance burden. Several benchmarks (GSM8K, MMLU, MMMU) have 3-5 independent implementations.

## Current Evaluation Systems

| # | System | Location | Interface | Purpose |
|---|--------|----------|-----------|---------|
| 1 | **Simple-Evals** | `python/sglang/test/run_eval.py` + `simple_eval_*.py` | OpenAI Chat API | Primary CI eval |
| 2 | **Few-Shot GSM8K** | `python/sglang/test/few_shot_gsm8k.py` | SGL program (`@sgl.function`) | `GSM8KMixin` |
| 3 | **Accuracy Test Runner** | `python/sglang/test/accuracy_test_runner.py` | Orchestration layer over #1 and #2 | Nightly multi-model |
| 4 | **lm-eval harness** | `python/sglang/test/kits/lm_eval_kit.py` | `lm_eval.simple_evaluate()` | YAML-driven baseline comparison |
| 5 | **lmms-eval** | `python/sglang/test/kits/mmmu_vlm_kit.py` | subprocess `lmms_eval` CLI | VLM MMMU |
| 6 | **benchmark/ scripts** | `benchmark/{gsm8k,mmlu,mmmu,...}/bench_sglang.py` | SGL program (`@sgl.function`) | Early-stage accuracy + throughput benchmarks |
| 7 | **sgl-model-gateway E2E** | `sgl-model-gateway/e2e_test/infra/run_eval.py` | Forked simple-evals (MMLU only) | Gateway deployment verification |
| 8 | **Ascend GSM8K Mixin** | `python/sglang/test/ascend/gsm8k_ascend_mixin.py` | few-shot gsm8k + NPU-specific env | Huawei Ascend NPU |

## Duplication Analysis

| Dataset | # of Implementations | Systems |
|---------|---------------------|---------|
| **GSM8K** | **5** | simple-eval, few-shot SGL program, benchmark/, AMD copy-paste (20+ files), Ascend mixin |
| **MMLU** | **4** | simple-eval, benchmark/, sgl-model-gateway fork, lm-eval (partial) |
| **MMMU** | **3** | simple-eval, lmms-eval subprocess, benchmark/ |
| hellaswag, boolq, ceval | 1 | benchmark/ only |
| mtbench, llm_judge | 1 | benchmark/ only |

## Key Issues

1. **GSM8K has 5 implementations** with different interfaces (`{"score"}` vs `{"accuracy"}`), different backends (OpenAI API vs SGL program vs raw HTTP), and different answer extraction logic.

2. **AMD eval tests copy-paste `run_gsm8k_benchmark()` and helper functions across 20+ files** (`test/registered/amd/accuracy/mi30x/` and `mi35x/`), each with minor per-model variations.

3. **`benchmark/` directory scripts use the legacy SGL program API** (`@sgl.function` + `RuntimeEndpoint`), which overlaps with simple-evals for GSM8K, MMLU, and MMMU but is the only implementation for hellaswag, boolq, ceval, etc.

4. **sgl-model-gateway has a forked copy** of `run_eval.py` + `simple_eval_common.py` + `simple_eval_mmlu.py` that can drift from the main implementation.

5. **Return value naming is inconsistent**: `run_eval()` returns `{"score": float}`, `few_shot_gsm8k` returns `{"accuracy": float}`, `accuracy_test_runner` wraps both into `AccuracyTestResult`.

## Suggested Directions

- Unify the return format (`score` vs `accuracy`) across all eval backends
- Extract AMD copy-pasted helpers into a shared module
- Evaluate whether `benchmark/` SGL program scripts are still needed given simple-evals coverage
- Consider having sgl-model-gateway import from the main eval package instead of forking
- Consolidate GSM8K into fewer implementations (ideally: one OpenAI-API-based, one SGL-program-based)

## Sub-Issues & TODO

- [x] #21049 — Extract shared GSM8K helpers from AMD accuracy tests (20+ duplicated files)
- [x] #21051 — Replace sgl-model-gateway forked simple-evals with main package imports
- [ ] Unify return format `{"score"}` vs `{"accuracy"}` — low priority, do it opportunistically when touching related code
- [ ] Audit `benchmark/` scripts — determine which are still needed vs already covered by simple-evals
- [ ] Consolidate GSM8K from 5 implementations down to 2 (one OpenAI Chat API, one completion)


#	System	Location	Interface	Purpose
1	Simple-Evals	`python/sglang/test/run_eval.py` + `simple_eval_*.py`	OpenAI Chat API	Primary CI eval
2	Few-Shot GSM8K	`python/sglang/test/few_shot_gsm8k.py`	SGL program (`@sgl.function`)	`GSM8KMixin`
3	Accuracy Test Runner	`python/sglang/test/accuracy_test_runner.py`	Orchestration layer over #1 and #2	Nightly multi-model
4	lm-eval harness	`python/sglang/test/kits/lm_eval_kit.py`	`lm_eval.simple_evaluate()`	YAML-driven baseline comparison
5	lmms-eval	`python/sglang/test/kits/mmmu_vlm_kit.py`	subprocess `lmms_eval` CLI	VLM MMMU
6	benchmark/ scripts	`benchmark/{gsm8k,mmlu,mmmu,...}/bench_sglang.py`	SGL program (`@sgl.function`)	Early-stage accuracy + throughput benchmarks
7	sgl-model-gateway E2E	`sgl-model-gateway/e2e_test/infra/run_eval.py`	Forked simple-evals (MMLU only)	Gateway deployment verification
8	Ascend GSM8K Mixin	`python/sglang/test/ascend/gsm8k_ascend_mixin.py`	few-shot gsm8k + NPU-specific env	Huawei Ascend NPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate fragmented evaluation systems #21046

Problem

Current Evaluation Systems

Duplication Analysis

Key Issues

Suggested Directions

Sub-Issues & TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset	# of Implementations	Systems
GSM8K	5	simple-eval, few-shot SGL program, benchmark/, AMD copy-paste (20+ files), Ascend mixin
MMLU	4	simple-eval, benchmark/, sgl-model-gateway fork, lm-eval (partial)
MMMU	3	simple-eval, lmms-eval subprocess, benchmark/
hellaswag, boolq, ceval	1	benchmark/ only
mtbench, llm_judge	1	benchmark/ only

Consolidate fragmented evaluation systems #21046

Description

Problem

Current Evaluation Systems

Duplication Analysis

Key Issues

Suggested Directions

Sub-Issues & TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions