Skip to content

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Dec 1, 2025

Add Eval Runs After Throughput Benchmarks

Tables for full run

https://github.com/InferenceMAX/InferenceMAX/actions/runs/19930433049

Eval Summary - dsr1_1k1k

Model Hardware Framework Precision TP EP DPA Task EM Strict EM Flexible N (eff)
amd/DeepSeek-R1-0528-MXFP4-Preview MI355X SGLANG FP4 4 1 false gsm8k 94.16% ±0.65% 94.39% ±0.63% 1319
amd/DeepSeek-R1-0528-MXFP4-Preview MI355X SGLANG FP4 8 1 false gsm8k 94.01% ±0.65% 94.54% ±0.63% 1319
deepseek-ai/DeepSeek-R1-0528 B200 SGLANG FP8 8 1 false gsm8k 94.39% ±0.63% 94.39% ±0.63% 1319
deepseek-ai/DeepSeek-R1-0528 B200-TRT TRT FP8 8 8 True gsm8k 86.28% ±0.95% 86.81% ±0.93% 1319
deepseek-ai/DeepSeek-R1-0528 H200 SGLANG FP8 8 1 false gsm8k 94.77% ±0.61% 94.84% ±0.61% 1319
deepseek-ai/DeepSeek-R1-0528 H200 TRT FP8 8 8 false gsm8k 86.20% ±0.95% 85.97% ±0.96% 1319
deepseek-ai/DeepSeek-R1-0528 MI300X SGLANG FP8 8 1 false gsm8k 94.69% ±0.62% 94.84% ±0.61% 1319
deepseek-ai/DeepSeek-R1-0528 MI325X SGLANG FP8 8 1 false gsm8k 94.62% ±0.62% 95.00% ±0.60% 1319
deepseek-ai/DeepSeek-R1-0528 MI355X SGLANG FP8 8 1 false gsm8k 94.54% ±0.63% 94.69% ±0.62% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200 SGLANG FP4 4 4 false gsm8k 94.77% ±0.61% 94.84% ±0.61% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200 SGLANG FP4 8 8 false gsm8k 94.47% ±0.63% 94.92% ±0.60% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200-TRT TRT FP4 4 4 True gsm8k 85.22% ±0.98% 84.08% ±1.01% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200-TRT TRT FP4 8 8 True gsm8k 85.22% ±0.98% 84.53% ±1.00% 1319

Eval Summary - gptoss_1k1k

Model Hardware Framework Precision TP EP DPA Task EM Strict EM Flexible N (eff)
gpt-oss-120b B200 VLLM FP4 1 1 false gsm8k 94.84% ±0.61% 94.77% ±0.61% 1319
gpt-oss-120b B200 VLLM FP4 8 1 false gsm8k 94.54% ±0.63% 94.47% ±0.63% 1319
gpt-oss-120b H100 VLLM FP4 2 1 false gsm8k 95.00% ±0.60% 95.00% ±0.60% 1319
gpt-oss-120b H100 VLLM FP4 8 1 false gsm8k 95.15% ±0.59% 95.07% ±0.60% 1319
gpt-oss-120b H200 VLLM FP4 1 1 false gsm8k 94.16% ±0.65% 94.16% ±0.65% 1319
gpt-oss-120b H200 VLLM FP4 8 1 false gsm8k 95.53% ±0.57% 95.45% ±0.57% 1319
gpt-oss-120b MI300X VLLM FP4 1 1 false gsm8k 95.60% ±0.56% 95.53% ±0.57% 1319
gpt-oss-120b MI325X VLLM FP4 1 1 false gsm8k 95.60% ±0.56% 95.53% ±0.57% 1319
gpt-oss-120b MI325X VLLM FP4 8 1 false gsm8k 95.45% ±0.57% 95.38% ±0.58% 1319
gpt-oss-120b MI355X VLLM FP4 1 1 false gsm8k 95.07% ±0.60% 95.00% ±0.60% 1319
gpt-oss-120b MI355X VLLM FP4 8 1 false gsm8k 95.15% ±0.59% 95.15% ±0.59% 1319
openai/gpt-oss-120b B200-TRT TRT FP4 1 1 false gsm8k 95.60% ±0.56% 95.45% ±0.57% 1319
openai/gpt-oss-120b B200-TRT TRT FP4 8 1 false gsm8k 94.69% ±0.62% 94.77% ±0.61% 1319
openai/gpt-oss-120b MI300X VLLM FP4 8 1 false gsm8k 95.22% ±0.59% 95.15% ±0.59% 1319

Eval Summary - dsr1_8k1k

Model Hardware Framework Precision TP EP DPA Task EM Strict EM Flexible N (eff)
amd/DeepSeek-R1-0528-MXFP4-Preview MI355X SGLANG FP4 8 1 false gsm8k 94.39% ±0.63% 94.62% ±0.62% 1319
deepseek-ai/DeepSeek-R1-0528 B200 SGLANG FP8 8 1 false gsm8k 94.62% ±0.62% 95.15% ±0.59% 1319
deepseek-ai/DeepSeek-R1-0528 B200-TRT TRT FP8 8 8 false gsm8k 94.84% ±0.61% 95.00% ±0.60% 1319
deepseek-ai/DeepSeek-R1-0528 H200 SGLANG FP8 8 1 false gsm8k 95.22% ±0.59% 95.00% ±0.60% 1319
deepseek-ai/DeepSeek-R1-0528 H200 TRT FP8 8 8 True gsm8k 94.62% ±0.62% 95.07% ±0.60% 1319
deepseek-ai/DeepSeek-R1-0528 MI300X SGLANG FP8 8 1 false gsm8k 94.39% ±0.63% 94.77% ±0.61% 1319
deepseek-ai/DeepSeek-R1-0528 MI325X SGLANG FP8 8 1 false gsm8k 94.24% ±0.64% 94.62% ±0.62% 1319
deepseek-ai/DeepSeek-R1-0528 MI355X SGLANG FP8 8 1 false gsm8k 94.31% ±0.64% 94.47% ±0.63% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200 SGLANG FP4 4 4 false gsm8k 94.09% ±0.65% 94.16% ±0.65% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200 SGLANG FP4 8 8 false gsm8k 93.48% ±0.68% 93.63% ±0.67% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200-TRT TRT FP4 4 4 True gsm8k 94.16% ±0.65% 94.47% ±0.63% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200-TRT TRT FP4 8 8 True gsm8k 94.62% ±0.62% 94.92% ±0.60% 1319

Eval Summary - gptoss_8k1k

Model Hardware Framework Precision TP EP DPA Task EM Strict EM Flexible N (eff)
gpt-oss-120b B200 VLLM FP4 1 1 false gsm8k 94.92% ±0.60% 94.84% ±0.61% 1319
gpt-oss-120b B200 VLLM FP4 8 1 false gsm8k 95.00% ±0.60% 94.92% ±0.60% 1319
gpt-oss-120b H100 VLLM FP4 2 1 false gsm8k 95.15% ±0.59% 95.15% ±0.59% 1319
gpt-oss-120b H100 VLLM FP4 8 1 false gsm8k 94.77% ±0.61% 94.77% ±0.61% 1319
gpt-oss-120b H200 VLLM FP4 1 1 false gsm8k 95.38% ±0.58% 95.30% ±0.58% 1319
gpt-oss-120b H200 VLLM FP4 8 1 false gsm8k 95.00% ±0.60% 94.92% ±0.60% 1319
gpt-oss-120b MI300X VLLM FP4 1 1 false gsm8k 95.68% ±0.56% 95.60% ±0.56% 1319
gpt-oss-120b MI325X VLLM FP4 1 1 false gsm8k 95.68% ±0.56% 95.60% ±0.56% 1319
gpt-oss-120b MI325X VLLM FP4 8 1 false gsm8k 95.38% ±0.58% 95.38% ±0.58% 1319
gpt-oss-120b MI355X VLLM FP4 1 1 false gsm8k 95.22% ±0.59% 95.15% ±0.59% 1319
gpt-oss-120b MI355X VLLM FP4 8 1 false gsm8k 95.30% ±0.58% 95.30% ±0.58% 1319
openai/gpt-oss-120b B200-TRT TRT FP4 1 1 false gsm8k 95.53% ±0.57% 95.38% ±0.58% 1319
openai/gpt-oss-120b B200-TRT TRT FP4 8 1 false gsm8k 94.69% ±0.62% 94.62% ±0.62% 1319
openai/gpt-oss-120b MI300X VLLM FP4 8 1 false gsm8k 95.45% ±0.57% 95.45% ±0.57% 1319

Eval Summary - dsr1_1k8k

Model Hardware Framework Precision TP EP DPA Task EM Strict EM Flexible N (eff)
amd/DeepSeek-R1-0528-MXFP4-Preview MI355X SGLANG FP4 8 1 false gsm8k 95.15% ±0.59% 95.15% ±0.59% 1319
deepseek-ai/DeepSeek-R1-0528 B200 SGLANG FP8 8 1 false gsm8k 94.84% ±0.61% 95.07% ±0.60% 1319
deepseek-ai/DeepSeek-R1-0528 B200-TRT TRT FP8 8 8 false gsm8k 95.00% ±0.60% 95.22% ±0.59% 1319
deepseek-ai/DeepSeek-R1-0528 H200 SGLANG FP8 8 1 false gsm8k 94.92% ±0.60% 95.07% ±0.60% 1319
deepseek-ai/DeepSeek-R1-0528 H200 TRT FP8 8 8 false gsm8k 95.38% ±0.58% 95.53% ±0.57% 1319
deepseek-ai/DeepSeek-R1-0528 MI300X SGLANG FP8 8 1 false gsm8k 94.39% ±0.63% 94.69% ±0.62% 1319
deepseek-ai/DeepSeek-R1-0528 MI325X SGLANG FP8 8 1 false gsm8k 94.39% ±0.63% 94.54% ±0.63% 1319
deepseek-ai/DeepSeek-R1-0528 MI355X SGLANG FP8 8 1 false gsm8k 94.69% ±0.62% 95.07% ±0.60% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200 SGLANG FP4 4 4 false gsm8k 94.31% ±0.64% 94.31% ±0.64% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200 SGLANG FP4 8 8 false gsm8k 94.47% ±0.63% 94.92% ±0.60% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200-TRT TRT FP4 4 4 True gsm8k 94.09% ±0.65% 94.39% ±0.63% 1319
nvidia/DeepSeek-R1-0528-FP4-V2 B200-TRT TRT FP4 8 8 True gsm8k 94.01% ±0.65% 94.39% ±0.63% 1319

Eval Summary - gptoss_1k8k

Model Hardware Framework Precision TP EP DPA Task EM Strict EM Flexible N (eff)
gpt-oss-120b B200 VLLM FP4 1 1 false gsm8k 94.84% ±0.61% 94.77% ±0.61% 1319
gpt-oss-120b B200 VLLM FP4 8 1 false gsm8k 95.07% ±0.60% 95.00% ±0.60% 1319
gpt-oss-120b H100 VLLM FP4 2 1 false gsm8k 95.53% ±0.57% 95.53% ±0.57% 1319
gpt-oss-120b H100 VLLM FP4 8 1 false gsm8k 95.00% ±0.60% 94.92% ±0.60% 1319
gpt-oss-120b H200 VLLM FP4 1 1 false gsm8k 94.92% ±0.60% 94.84% ±0.61% 1319
gpt-oss-120b H200 VLLM FP4 8 1 false gsm8k 95.38% ±0.58% 95.30% ±0.58% 1319
gpt-oss-120b MI300X VLLM FP4 8 1 false gsm8k 95.45% ±0.57% 95.53% ±0.57% 1319
gpt-oss-120b MI325X VLLM FP4 1 1 false gsm8k 95.68% ±0.56% 95.60% ±0.56% 1319
gpt-oss-120b MI325X VLLM FP4 8 1 false gsm8k 95.22% ±0.59% 95.15% ±0.59% 1319
gpt-oss-120b MI355X VLLM FP4 1 1 false gsm8k 95.22% ±0.59% 95.15% ±0.59% 1319
gpt-oss-120b MI355X VLLM FP4 8 1 false gsm8k 95.38% ±0.58% 95.38% ±0.58% 1319
openai/gpt-oss-120b B200-TRT TRT FP4 1 1 false gsm8k 0.23% ±0.13% 0.23% ±0.13% 1319
openai/gpt-oss-120b B200-TRT TRT FP4 8 1 false gsm8k 93.86% ±0.66% 93.78% ±0.67% 1319
openai/gpt-oss-120b MI300X VLLM FP4 1 1 false gsm8k 95.68% ±0.56% 95.60% ±0.56% 1319

TL;DR

  • Adds optional eval runs (e.g. GSM8K) that run right after throughput benchmarks, reusing the same inference server to validate servers' operations.
  • Evals are plumbed into all throughput workflows, but are opt-in (RUN_EVAL=false → no change in behavior).
  • When enabled, the default eval suite is gsm8k via lm-eval, with support for lighteval as an alternative.
  • To keep CI cost reasonable, evals only run for two representative points per config:
    • Lowest TP per GPU with highest concurrency, and
    • Highest TP per GPU with highest concurrency.
    • 2-shot instead of usual 5-shot or 8-shot to keep evals short.

Motivation

Throughput optimizations can quietly trade off accuracy (e.g. via aggressive truncation, decoding tweaks, or endpoint misconfiguration). Without evals, a misconfigured server (truncation, bad decoding, wrong endpoint params) can still produce great throughput numbers but garbage answers.

This PR wires evals directly into the benchmarking flow so that:

  • Each representative throughput config has an associated numerical accuracy check.
  • We can align throughput numbers with SLAs and avoid “gaming” (e.g. lowering max_new_tokens or silently dropping tokens).
  • Adding new eval suites in future (beyond GSM8K) is straightforward and reuses the same plumbing.

What This PR Changes

1. Optional evals for all throughput workflows

  • All throughput workflows that call benchmarks/* now have the ability to run evals immediately after throughput.
  • This is controlled via the matrix and an environment flag:
    • Matrix sets a boolean FIELD_RUN_EVAL.
    • Workflows export this as RUN_EVAL for each matrix entry.
  • Behavior:
    • RUN_EVAL unset or false → only throughput runs (current behavior).
    • RUN_EVAL=true → throughput then evals on the same server.

By default, no evals are run (opt-in), but the plumbing exists for all throughput workflows.

When evals are enabled, the default task is GSM8K:

  • EVAL_TASK defaults to gsm8k.
  • EVAL_FRAMEWORK defaults to lm-eval.
  • Both can be overridden via env for future suites.

2. Representative eval selection via matrix generation

To balance coverage and cost, we only run evals for two key points per configuration.

The matrix helper mark_eval_entries does, for each unique group:

  • Group key: (model, runner, framework, precision, isl, osl).
  • Within each group:
    • Find min TP and max TP.
    • For max TP:
    • Identify entries with that TP.
    • Among them, pick the highest concurrency (FIELD_CONC) → mark as eval.
    • For min TP (if different from max TP):
    • Same logic: lowest TP + highest concurrency → mark as eval.

The selected entries get: entry[FIELD_RUN_EVAL] = True

This means evals are ran only at the highest concurrency for the lowest and highest TP per GPU for each (model, runner, framework, precision, ISL, OSL) combo.

Everything else runs throughput-only.


3. Eval integration in runner scripts (benchmarks/*)

All runner scripts follow the same pattern:

  1. Start the server
  2. Call wait_for_server_ready.
  3. Run throughput via run_benchmark_serving.
  4. Conditionally run evals:
    • Only when RUN_EVAL=true.
    • Use run_eval + append_lm_eval_summary

4. Eval Frameworks

This PR supports two eval frameworks, with a unified entrypoint and local patching to handle reasoning tokens and OpenAI-compatible endpoints.

1. lm-eval (lm-evaluation-harness)

1.1 Installation & Prep

  • _install_lm_eval_deps
  • Installs lm-eval[api].
  • Pulls lm-evaluation-harness
  • _patch_lm_eval: injects a sitecustomize.py that:
    • Fixes LocalChatCompletion.parse_generations
    • Handles responses where message.content is empty but reasoning_content contains the actual answer.
    • Avoids crashes and ensures text extraction works for reasoning-style models.
    • Fixes TemplateAPI.apply_chat_template
      • Stops injecting {"type": "text"} into the payload when there is no tokenizer / non-HF tokenizer.
      • This was breaking TRT endpoints with strict JSON schemas.

Patched behavior is wired by adding the generated directory to PYTHONPATH.

1.2 Running lm-eval (run_lm_eval)

run_lm_eval wraps the lm_eval CLI:

  • Defaults:
    • task = ${EVAL_TASK:-gsm8k}
    • num_fewshot = ${NUM_FEWSHOT:-2}
    • concurrent_requests = 32
    • gen_max_tokens = 4096
    • temperature = 0, top_p = 1.0

1.3 Summarizing lm-eval results (append_lm_eval_summary)

  • Writes meta_env.json describing:
    • framework
    • precision
    • tp
    • ep
    • dp_attention
    • model
  • Runs utils/lm_eval_to_md.py to convert raw lm-eval results into SUMMARY.md.
  • If running inside GitHub Actions:
    • Appends SUMMARY.md into $GITHUB_STEP_SUMMARY (in the same runner).
  • Raw eval outputs remain under /tmp (they are not copied back into the repo workspace).

2. lighteval + litellm

While lm-eval is the default, this PR also supports lighteval as an alternative backend via the unified run_eval wrapper.

2.1 Installation & patching

  • _install_lighteval_deps:
    • Installs lighteval and litellm.
  • _patch_lighteval_litellm via sitecustomize.py:
    • Disables sglang imports:
      • Some lighteval versions attempt to import sglang, which crashes with our version mismatches.
      • We patch lighteval.utils.imports.is_package_available("sglang") to always return False.
    • Patches LiteLLMClient to be OpenAI-server friendly:
      • Removes response_format={"type": "text"} which interferes with vLLM endpoints.
      • Handles reasoning-only responses via reasoning_content.
      • Adds retry/backoff logic around litellm completions.
    • Switches parallel evaluation to threads:
      • Replaces async concurrency with ThreadPoolExecutor(self.concurrent_requests) to avoid stalls under high load.
    • Returns ModelResponse with text and reasonings separated for downstream extraction.

2.2 Running lighteval (run_lighteval_eval)

  • Expects MODEL_NAME to be set (will error otherwise).
  • Wraps the model with an OpenAI-style prefix:
    • lite_model="openai/${MODEL_NAME}"
  • Builds MODEL_ARGS for lighteval:
    -model_name=${lite_model},base_url=${base_url},api_key=${OPENAI_API_KEY},generation_parameters={temperature:0.0,top_p=1,max_new_tokens:2048},concurrent_requests=${concurrent_requests}
  • Task specification:
    • TASK_SPEC="${task}|${num_fewshot}"

3. Unified eval entrypoint (run_eval)

run_eval abstracts over frameworks:

  • Defaults:
    • EVAL_FRAMEWORK=lm-eval
    • EVAL_TASK=gsm8k
  • Runner scripts can override via env or by passing --framework explicitly.
  • All additional arguments (e.g. --port, --concurrent-requests, --results-dir) are forwarded to the underlying framework-specific function.

Future Work / Notes

  • Currently the default behavior is unchanged for most users:
    • Evals are off by default (RUN_EVAL=false).
    • Only selected matrix entries (lowest & highest TP per GPU at max concurrency) enable RUN_EVAL=true.
  • The plumbing is now in place to:
    • Add more eval suites (e.g. MMLU, Math, custom internal tasks) via EVAL_TASK and utils/evals/*.
  • Token count optimizations.
  • GB200 multinode GB200 multinode evals #268

Note

Adds an optional evaluation phase to throughput runs and aggregates results into a single summary.

  • CI: Add run-eval input/env to benchmark templates; append "eval" to job name; upload eval artifacts and clean up
  • New collect-evals.yml workflow to download eval artifacts, tabulate results via utils/collect_eval_results.py, and upload agg_eval_*.json
  • Bench scripts: add run_eval/append_lm_eval_summary in benchmarks/benchmark_lib.sh (with patches for lm-eval and lighteval/LiteLLM); call evals conditionally in all benchmarks/* scripts
  • Runner scripts: pass RUN_EVAL/RUNNER_TYPE; minor serving tweaks (served model name, tokenizer pass-through, disable request logs)
  • Matrix logic: mark eval subset (lowest/highest TP at max conc) via mark_eval_entries; support --run-evals/--evals-only; validation adds run-eval
  • Results: enhance utils/summarize.py; add eval task specs under utils/evals/* and new utils/collect_eval_results.py for summary generation

Written by Cursor Bugbot for commit c902545. This will update automatically on new commits. Configure here.

Comment on lines +122 to +126
# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $(( $CONC * 2 ))
append_lm_eval_summary
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this easy to extend to multi eval without needing to change the API contract again @Oseltamivir

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will modified so that:

run_eval
append_lm_eval_summary

run_eval
append_lm_eval_summary

Will produce 1 json rows per eval in the aggregate

@Oseltamivir Oseltamivir linked an issue Jan 15, 2026 that may be closed by this pull request
@cquil11
Copy link
Collaborator Author

cquil11 commented Jan 15, 2026

@claude review. if there are comments by other code reviewers, please only include them if they abide by our standards for a useful high signal comment

@claude
Copy link

claude bot commented Jan 15, 2026

Claude finished @cquil11's task in 4m 46s —— View job


Code Review Complete

Reviewed PR #258: Adding evals after throughput benchmarks

Summary

Found 1 blocking issue that must be fixed before merge. The issue causes eval and benchmark failures to be silently masked, making CI jobs appear successful while data is missing.

Blocking Issues

I've posted inline comments for the critical bug:

  • Exit code masking in benchmarks/benchmark_lib.sh (lines 185, 315) - The set +x command returns 0, masking failures from eval and benchmark commands

Notes on Existing Review Comments

Most of the Cursor bot review comments reference issues in older commits that have since been fixed:

  • Hardcoded num-prompts values → Fixed (now uses $NUM_PROMPTS)
  • Hardcoded DP_ATTENTIONFixed (now uses $DP_ATTENTION variable)
  • Missing MODEL fallback in lighteval → Not blocking (lighteval framework not currently used)

Verification

The PR correctly implements:

  • ✅ Conditional eval execution via RUN_EVAL flag
  • ✅ Matrix entry selection for eval runs (lowest/highest TP at max concurrency)
  • ✅ Eval artifact uploads with if-no-files-found: ignore
  • ✅ Integration with lm-eval framework with proper patching

Action Required: Fix the exit code masking bug before merge.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

add GSM8k eval quality benchmark CI

5 participants