-
Notifications
You must be signed in to change notification settings - Fork 70
Adding evals after throughput benchmarks #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" --concurrent-requests $(( $CONC * 2 )) | ||
| append_lm_eval_summary | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this easy to extend to multi eval without needing to change the API contract again @Oseltamivir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will modified so that:
run_eval
append_lm_eval_summary
run_eval
append_lm_eval_summary
Will produce 1 json rows per eval in the aggregate
07405c9 to
4b0f8de
Compare
4c5241d to
d55c796
Compare
0d993a8 to
d55c796
Compare
|
@claude review. if there are comments by other code reviewers, please only include them if they abide by our standards for a useful high signal comment |
|
Claude finished @cquil11's task in 4m 46s —— View job Code Review CompleteReviewed PR #258: Adding evals after throughput benchmarks SummaryFound 1 blocking issue that must be fixed before merge. The issue causes eval and benchmark failures to be silently masked, making CI jobs appear successful while data is missing. Blocking IssuesI've posted inline comments for the critical bug:
Notes on Existing Review CommentsMost of the Cursor bot review comments reference issues in older commits that have since been fixed:
VerificationThe PR correctly implements:
Action Required: Fix the exit code masking bug before merge. |
Add Eval Runs After Throughput Benchmarks
Tables for full run
https://github.com/InferenceMAX/InferenceMAX/actions/runs/19930433049
Eval Summary - dsr1_1k1k
Eval Summary - gptoss_1k1k
Eval Summary - dsr1_8k1k
Eval Summary - gptoss_8k1k
Eval Summary - dsr1_1k8k
Eval Summary - gptoss_1k8k
TL;DR
RUN_EVAL=false→ no change in behavior).gsm8kvialm-eval, with support for lighteval as an alternative.Motivation
Throughput optimizations can quietly trade off accuracy (e.g. via aggressive truncation, decoding tweaks, or endpoint misconfiguration). Without evals, a misconfigured server (truncation, bad decoding, wrong endpoint params) can still produce great throughput numbers but garbage answers.
This PR wires evals directly into the benchmarking flow so that:
max_new_tokensor silently dropping tokens).What This PR Changes
1. Optional evals for all throughput workflows
benchmarks/*now have the ability to run evals immediately after throughput.FIELD_RUN_EVAL.RUN_EVALfor each matrix entry.RUN_EVALunset orfalse→ only throughput runs (current behavior).RUN_EVAL=true→ throughput then evals on the same server.By default, no evals are run (opt-in), but the plumbing exists for all throughput workflows.
When evals are enabled, the default task is GSM8K:
EVAL_TASKdefaults togsm8k.EVAL_FRAMEWORKdefaults tolm-eval.2. Representative eval selection via matrix generation
To balance coverage and cost, we only run evals for two key points per configuration.
The matrix helper
mark_eval_entriesdoes, for each unique group:The selected entries get:
entry[FIELD_RUN_EVAL] = TrueThis means evals are ran only at the highest concurrency for the lowest and highest TP per GPU for each (model, runner, framework, precision, ISL, OSL) combo.
Everything else runs throughput-only.
3. Eval integration in runner scripts (
benchmarks/*)All runner scripts follow the same pattern:
run_eval+append_lm_eval_summary⸻
4. Eval Frameworks
This PR supports two eval frameworks, with a unified entrypoint and local patching to handle reasoning tokens and OpenAI-compatible endpoints.
1. lm-eval (lm-evaluation-harness)
1.1 Installation & Prep
_install_lm_eval_deps_patch_lm_eval: injects asitecustomize.pythat:LocalChatCompletion.parse_generationsTemplateAPI.apply_chat_template{"type": "text"}into the payload when there is no tokenizer / non-HF tokenizer.Patched behavior is wired by adding the generated directory to
PYTHONPATH.1.2 Running lm-eval (
run_lm_eval)run_lm_evalwraps the lm_eval CLI:task = ${EVAL_TASK:-gsm8k}num_fewshot = ${NUM_FEWSHOT:-2}concurrent_requests = 32gen_max_tokens = 4096temperature = 0, top_p = 1.01.3 Summarizing lm-eval results (
append_lm_eval_summary)meta_env.jsondescribing:frameworkprecisiontpepdp_attentionmodelutils/lm_eval_to_md.pyto convert raw lm-eval results intoSUMMARY.md.SUMMARY.mdinto$GITHUB_STEP_SUMMARY(in the same runner)./tmp(they are not copied back into the repo workspace).2.
lighteval+litellmWhile
lm-evalis the default, this PR also supports lighteval as an alternative backend via the unifiedrun_evalwrapper.2.1 Installation & patching
_install_lighteval_deps:lightevalandlitellm._patch_lighteval_litellmviasitecustomize.py:sglangimports:sglang, which crashes with our version mismatches.lighteval.utils.imports.is_package_available("sglang")to always returnFalse.LiteLLMClientto be OpenAI-server friendly:response_format={"type": "text"}which interferes with vLLM endpoints.reasoning_content.litellmcompletions.ThreadPoolExecutor(self.concurrent_requests)to avoid stalls under high load.ModelResponsewithtextandreasoningsseparated for downstream extraction.2.2 Running lighteval (
run_lighteval_eval)MODEL_NAMEto be set (will error otherwise).lite_model="openai/${MODEL_NAME}"MODEL_ARGSfor lighteval:-
model_name=${lite_model},base_url=${base_url},api_key=${OPENAI_API_KEY},generation_parameters={temperature:0.0,top_p=1,max_new_tokens:2048},concurrent_requests=${concurrent_requests}TASK_SPEC="${task}|${num_fewshot}"3. Unified eval entrypoint (
run_eval)run_evalabstracts over frameworks:EVAL_FRAMEWORK=lm-evalEVAL_TASK=gsm8k--frameworkexplicitly.--port,--concurrent-requests,--results-dir) are forwarded to the underlying framework-specific function.Future Work / Notes
RUN_EVAL=false).RUN_EVAL=true.EVAL_TASKandutils/evals/*.Note
Adds an optional evaluation phase to throughput runs and aggregates results into a single summary.
run-evalinput/env to benchmark templates; append "eval" to job name; upload eval artifacts and clean upcollect-evals.ymlworkflow to download eval artifacts, tabulate results viautils/collect_eval_results.py, and uploadagg_eval_*.jsonrun_eval/append_lm_eval_summaryinbenchmarks/benchmark_lib.sh(with patches for lm-eval and lighteval/LiteLLM); call evals conditionally in allbenchmarks/*scriptsRUN_EVAL/RUNNER_TYPE; minor serving tweaks (served model name, tokenizer pass-through, disable request logs)mark_eval_entries; support--run-evals/--evals-only; validation addsrun-evalutils/summarize.py; add eval task specs underutils/evals/*and newutils/collect_eval_results.pyfor summary generationWritten by Cursor Bugbot for commit c902545. This will update automatically on new commits. Configure here.