Skip to content

Consolidate fragmented evaluation systems #21046

@hnyls2002

Description

@hnyls2002

Problem

SGLang currently has 8 separate evaluation systems that have grown organically over time, resulting in significant code duplication and maintenance burden. Several benchmarks (GSM8K, MMLU, MMMU) have 3-5 independent implementations.

Current Evaluation Systems

# System Location Interface Purpose
1 Simple-Evals python/sglang/test/run_eval.py + simple_eval_*.py OpenAI Chat API Primary CI eval
2 Few-Shot GSM8K python/sglang/test/few_shot_gsm8k.py SGL program (@sgl.function) GSM8KMixin
3 Accuracy Test Runner python/sglang/test/accuracy_test_runner.py Orchestration layer over #1 and #2 Nightly multi-model
4 lm-eval harness python/sglang/test/kits/lm_eval_kit.py lm_eval.simple_evaluate() YAML-driven baseline comparison
5 lmms-eval python/sglang/test/kits/mmmu_vlm_kit.py subprocess lmms_eval CLI VLM MMMU
6 benchmark/ scripts benchmark/{gsm8k,mmlu,mmmu,...}/bench_sglang.py SGL program (@sgl.function) Early-stage accuracy + throughput benchmarks
7 sgl-model-gateway E2E sgl-model-gateway/e2e_test/infra/run_eval.py Forked simple-evals (MMLU only) Gateway deployment verification
8 Ascend GSM8K Mixin python/sglang/test/ascend/gsm8k_ascend_mixin.py few-shot gsm8k + NPU-specific env Huawei Ascend NPU

Duplication Analysis

Dataset # of Implementations Systems
GSM8K 5 simple-eval, few-shot SGL program, benchmark/, AMD copy-paste (20+ files), Ascend mixin
MMLU 4 simple-eval, benchmark/, sgl-model-gateway fork, lm-eval (partial)
MMMU 3 simple-eval, lmms-eval subprocess, benchmark/
hellaswag, boolq, ceval 1 benchmark/ only
mtbench, llm_judge 1 benchmark/ only

Key Issues

  1. GSM8K has 5 implementations with different interfaces ({"score"} vs {"accuracy"}), different backends (OpenAI API vs SGL program vs raw HTTP), and different answer extraction logic.

  2. AMD eval tests copy-paste run_gsm8k_benchmark() and helper functions across 20+ files (test/registered/amd/accuracy/mi30x/ and mi35x/), each with minor per-model variations.

  3. benchmark/ directory scripts use the legacy SGL program API (@sgl.function + RuntimeEndpoint), which overlaps with simple-evals for GSM8K, MMLU, and MMMU but is the only implementation for hellaswag, boolq, ceval, etc.

  4. sgl-model-gateway has a forked copy of run_eval.py + simple_eval_common.py + simple_eval_mmlu.py that can drift from the main implementation.

  5. Return value naming is inconsistent: run_eval() returns {"score": float}, few_shot_gsm8k returns {"accuracy": float}, accuracy_test_runner wraps both into AccuracyTestResult.

Suggested Directions

  • Unify the return format (score vs accuracy) across all eval backends
  • Extract AMD copy-pasted helpers into a shared module
  • Evaluate whether benchmark/ SGL program scripts are still needed given simple-evals coverage
  • Consider having sgl-model-gateway import from the main eval package instead of forking
  • Consolidate GSM8K into fewer implementations (ideally: one OpenAI-API-based, one SGL-program-based)

Sub-Issues & TODO

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions