Adding evals after throughput benchmarks #258

cquil11 · 2025-12-01T00:41:25Z

Add Eval Runs After Throughput Benchmarks

Tables for full run

https://github.com/InferenceMAX/InferenceMAX/actions/runs/19930433049

Eval Summary - dsr1_1k1k

Model	Hardware	Framework	Precision	TP	EP	DPA	Task	EM Strict	EM Flexible	N (eff)
amd/DeepSeek-R1-0528-MXFP4-Preview	MI355X	SGLANG	FP4	4	1	false	gsm8k	94.16% ±0.65%	94.39% ±0.63%	1319
amd/DeepSeek-R1-0528-MXFP4-Preview	MI355X	SGLANG	FP4	8	1	false	gsm8k	94.01% ±0.65%	94.54% ±0.63%	1319
deepseek-ai/DeepSeek-R1-0528	B200	SGLANG	FP8	8	1	false	gsm8k	94.39% ±0.63%	94.39% ±0.63%	1319
deepseek-ai/DeepSeek-R1-0528	B200-TRT	TRT	FP8	8	8	True	gsm8k	86.28% ±0.95%	86.81% ±0.93%	1319
deepseek-ai/DeepSeek-R1-0528	H200	SGLANG	FP8	8	1	false	gsm8k	94.77% ±0.61%	94.84% ±0.61%	1319
deepseek-ai/DeepSeek-R1-0528	H200	TRT	FP8	8	8	false	gsm8k	86.20% ±0.95%	85.97% ±0.96%	1319
deepseek-ai/DeepSeek-R1-0528	MI300X	SGLANG	FP8	8	1	false	gsm8k	94.69% ±0.62%	94.84% ±0.61%	1319
deepseek-ai/DeepSeek-R1-0528	MI325X	SGLANG	FP8	8	1	false	gsm8k	94.62% ±0.62%	95.00% ±0.60%	1319
deepseek-ai/DeepSeek-R1-0528	MI355X	SGLANG	FP8	8	1	false	gsm8k	94.54% ±0.63%	94.69% ±0.62%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200	SGLANG	FP4	4	4	false	gsm8k	94.77% ±0.61%	94.84% ±0.61%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200	SGLANG	FP4	8	8	false	gsm8k	94.47% ±0.63%	94.92% ±0.60%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200-TRT	TRT	FP4	4	4	True	gsm8k	85.22% ±0.98%	84.08% ±1.01%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200-TRT	TRT	FP4	8	8	True	gsm8k	85.22% ±0.98%	84.53% ±1.00%	1319

Eval Summary - gptoss_1k1k

Model	Hardware	Framework	Precision	TP	EP	DPA	Task	EM Strict	EM Flexible	N (eff)
gpt-oss-120b	B200	VLLM	FP4	1	1	false	gsm8k	94.84% ±0.61%	94.77% ±0.61%	1319
gpt-oss-120b	B200	VLLM	FP4	8	1	false	gsm8k	94.54% ±0.63%	94.47% ±0.63%	1319
gpt-oss-120b	H100	VLLM	FP4	2	1	false	gsm8k	95.00% ±0.60%	95.00% ±0.60%	1319
gpt-oss-120b	H100	VLLM	FP4	8	1	false	gsm8k	95.15% ±0.59%	95.07% ±0.60%	1319
gpt-oss-120b	H200	VLLM	FP4	1	1	false	gsm8k	94.16% ±0.65%	94.16% ±0.65%	1319
gpt-oss-120b	H200	VLLM	FP4	8	1	false	gsm8k	95.53% ±0.57%	95.45% ±0.57%	1319
gpt-oss-120b	MI300X	VLLM	FP4	1	1	false	gsm8k	95.60% ±0.56%	95.53% ±0.57%	1319
gpt-oss-120b	MI325X	VLLM	FP4	1	1	false	gsm8k	95.60% ±0.56%	95.53% ±0.57%	1319
gpt-oss-120b	MI325X	VLLM	FP4	8	1	false	gsm8k	95.45% ±0.57%	95.38% ±0.58%	1319
gpt-oss-120b	MI355X	VLLM	FP4	1	1	false	gsm8k	95.07% ±0.60%	95.00% ±0.60%	1319
gpt-oss-120b	MI355X	VLLM	FP4	8	1	false	gsm8k	95.15% ±0.59%	95.15% ±0.59%	1319
openai/gpt-oss-120b	B200-TRT	TRT	FP4	1	1	false	gsm8k	95.60% ±0.56%	95.45% ±0.57%	1319
openai/gpt-oss-120b	B200-TRT	TRT	FP4	8	1	false	gsm8k	94.69% ±0.62%	94.77% ±0.61%	1319
openai/gpt-oss-120b	MI300X	VLLM	FP4	8	1	false	gsm8k	95.22% ±0.59%	95.15% ±0.59%	1319

Eval Summary - dsr1_8k1k

Model	Hardware	Framework	Precision	TP	EP	DPA	Task	EM Strict	EM Flexible	N (eff)
amd/DeepSeek-R1-0528-MXFP4-Preview	MI355X	SGLANG	FP4	8	1	false	gsm8k	94.39% ±0.63%	94.62% ±0.62%	1319
deepseek-ai/DeepSeek-R1-0528	B200	SGLANG	FP8	8	1	false	gsm8k	94.62% ±0.62%	95.15% ±0.59%	1319
deepseek-ai/DeepSeek-R1-0528	B200-TRT	TRT	FP8	8	8	false	gsm8k	94.84% ±0.61%	95.00% ±0.60%	1319
deepseek-ai/DeepSeek-R1-0528	H200	SGLANG	FP8	8	1	false	gsm8k	95.22% ±0.59%	95.00% ±0.60%	1319
deepseek-ai/DeepSeek-R1-0528	H200	TRT	FP8	8	8	True	gsm8k	94.62% ±0.62%	95.07% ±0.60%	1319
deepseek-ai/DeepSeek-R1-0528	MI300X	SGLANG	FP8	8	1	false	gsm8k	94.39% ±0.63%	94.77% ±0.61%	1319
deepseek-ai/DeepSeek-R1-0528	MI325X	SGLANG	FP8	8	1	false	gsm8k	94.24% ±0.64%	94.62% ±0.62%	1319
deepseek-ai/DeepSeek-R1-0528	MI355X	SGLANG	FP8	8	1	false	gsm8k	94.31% ±0.64%	94.47% ±0.63%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200	SGLANG	FP4	4	4	false	gsm8k	94.09% ±0.65%	94.16% ±0.65%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200	SGLANG	FP4	8	8	false	gsm8k	93.48% ±0.68%	93.63% ±0.67%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200-TRT	TRT	FP4	4	4	True	gsm8k	94.16% ±0.65%	94.47% ±0.63%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200-TRT	TRT	FP4	8	8	True	gsm8k	94.62% ±0.62%	94.92% ±0.60%	1319

Eval Summary - gptoss_8k1k

Model	Hardware	Framework	Precision	TP	EP	DPA	Task	EM Strict	EM Flexible	N (eff)
gpt-oss-120b	B200	VLLM	FP4	1	1	false	gsm8k	94.92% ±0.60%	94.84% ±0.61%	1319
gpt-oss-120b	B200	VLLM	FP4	8	1	false	gsm8k	95.00% ±0.60%	94.92% ±0.60%	1319
gpt-oss-120b	H100	VLLM	FP4	2	1	false	gsm8k	95.15% ±0.59%	95.15% ±0.59%	1319
gpt-oss-120b	H100	VLLM	FP4	8	1	false	gsm8k	94.77% ±0.61%	94.77% ±0.61%	1319
gpt-oss-120b	H200	VLLM	FP4	1	1	false	gsm8k	95.38% ±0.58%	95.30% ±0.58%	1319
gpt-oss-120b	H200	VLLM	FP4	8	1	false	gsm8k	95.00% ±0.60%	94.92% ±0.60%	1319
gpt-oss-120b	MI300X	VLLM	FP4	1	1	false	gsm8k	95.68% ±0.56%	95.60% ±0.56%	1319
gpt-oss-120b	MI325X	VLLM	FP4	1	1	false	gsm8k	95.68% ±0.56%	95.60% ±0.56%	1319
gpt-oss-120b	MI325X	VLLM	FP4	8	1	false	gsm8k	95.38% ±0.58%	95.38% ±0.58%	1319
gpt-oss-120b	MI355X	VLLM	FP4	1	1	false	gsm8k	95.22% ±0.59%	95.15% ±0.59%	1319
gpt-oss-120b	MI355X	VLLM	FP4	8	1	false	gsm8k	95.30% ±0.58%	95.30% ±0.58%	1319
openai/gpt-oss-120b	B200-TRT	TRT	FP4	1	1	false	gsm8k	95.53% ±0.57%	95.38% ±0.58%	1319
openai/gpt-oss-120b	B200-TRT	TRT	FP4	8	1	false	gsm8k	94.69% ±0.62%	94.62% ±0.62%	1319
openai/gpt-oss-120b	MI300X	VLLM	FP4	8	1	false	gsm8k	95.45% ±0.57%	95.45% ±0.57%	1319

Eval Summary - dsr1_1k8k

Model	Hardware	Framework	Precision	TP	EP	DPA	Task	EM Strict	EM Flexible	N (eff)
amd/DeepSeek-R1-0528-MXFP4-Preview	MI355X	SGLANG	FP4	8	1	false	gsm8k	95.15% ±0.59%	95.15% ±0.59%	1319
deepseek-ai/DeepSeek-R1-0528	B200	SGLANG	FP8	8	1	false	gsm8k	94.84% ±0.61%	95.07% ±0.60%	1319
deepseek-ai/DeepSeek-R1-0528	B200-TRT	TRT	FP8	8	8	false	gsm8k	95.00% ±0.60%	95.22% ±0.59%	1319
deepseek-ai/DeepSeek-R1-0528	H200	SGLANG	FP8	8	1	false	gsm8k	94.92% ±0.60%	95.07% ±0.60%	1319
deepseek-ai/DeepSeek-R1-0528	H200	TRT	FP8	8	8	false	gsm8k	95.38% ±0.58%	95.53% ±0.57%	1319
deepseek-ai/DeepSeek-R1-0528	MI300X	SGLANG	FP8	8	1	false	gsm8k	94.39% ±0.63%	94.69% ±0.62%	1319
deepseek-ai/DeepSeek-R1-0528	MI325X	SGLANG	FP8	8	1	false	gsm8k	94.39% ±0.63%	94.54% ±0.63%	1319
deepseek-ai/DeepSeek-R1-0528	MI355X	SGLANG	FP8	8	1	false	gsm8k	94.69% ±0.62%	95.07% ±0.60%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200	SGLANG	FP4	4	4	false	gsm8k	94.31% ±0.64%	94.31% ±0.64%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200	SGLANG	FP4	8	8	false	gsm8k	94.47% ±0.63%	94.92% ±0.60%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200-TRT	TRT	FP4	4	4	True	gsm8k	94.09% ±0.65%	94.39% ±0.63%	1319
nvidia/DeepSeek-R1-0528-FP4-V2	B200-TRT	TRT	FP4	8	8	True	gsm8k	94.01% ±0.65%	94.39% ±0.63%	1319

Eval Summary - gptoss_1k8k

Model	Hardware	Framework	Precision	TP	EP	DPA	Task	EM Strict	EM Flexible	N (eff)
gpt-oss-120b	B200	VLLM	FP4	1	1	false	gsm8k	94.84% ±0.61%	94.77% ±0.61%	1319
gpt-oss-120b	B200	VLLM	FP4	8	1	false	gsm8k	95.07% ±0.60%	95.00% ±0.60%	1319
gpt-oss-120b	H100	VLLM	FP4	2	1	false	gsm8k	95.53% ±0.57%	95.53% ±0.57%	1319
gpt-oss-120b	H100	VLLM	FP4	8	1	false	gsm8k	95.00% ±0.60%	94.92% ±0.60%	1319
gpt-oss-120b	H200	VLLM	FP4	1	1	false	gsm8k	94.92% ±0.60%	94.84% ±0.61%	1319
gpt-oss-120b	H200	VLLM	FP4	8	1	false	gsm8k	95.38% ±0.58%	95.30% ±0.58%	1319
gpt-oss-120b	MI300X	VLLM	FP4	8	1	false	gsm8k	95.45% ±0.57%	95.53% ±0.57%	1319
gpt-oss-120b	MI325X	VLLM	FP4	1	1	false	gsm8k	95.68% ±0.56%	95.60% ±0.56%	1319
gpt-oss-120b	MI325X	VLLM	FP4	8	1	false	gsm8k	95.22% ±0.59%	95.15% ±0.59%	1319
gpt-oss-120b	MI355X	VLLM	FP4	1	1	false	gsm8k	95.22% ±0.59%	95.15% ±0.59%	1319
gpt-oss-120b	MI355X	VLLM	FP4	8	1	false	gsm8k	95.38% ±0.58%	95.38% ±0.58%	1319
openai/gpt-oss-120b	B200-TRT	TRT	FP4	1	1	false	gsm8k	0.23% ±0.13%	0.23% ±0.13%	1319
openai/gpt-oss-120b	B200-TRT	TRT	FP4	8	1	false	gsm8k	93.86% ±0.66%	93.78% ±0.67%	1319
openai/gpt-oss-120b	MI300X	VLLM	FP4	1	1	false	gsm8k	95.68% ±0.56%	95.60% ±0.56%	1319

TL;DR

Adds optional eval runs (e.g. GSM8K) that run right after throughput benchmarks, reusing the same inference server to validate servers' operations.
Evals are plumbed into all throughput workflows, but are opt-in (RUN_EVAL=false → no change in behavior).
When enabled, the default eval suite is gsm8k via lm-eval, with support for lighteval as an alternative.
To keep CI cost reasonable, evals only run for two representative points per config:
- Lowest TP per GPU with highest concurrency, and
- Highest TP per GPU with highest concurrency.
- 2-shot instead of usual 5-shot or 8-shot to keep evals short.

Motivation

Throughput optimizations can quietly trade off accuracy (e.g. via aggressive truncation, decoding tweaks, or endpoint misconfiguration). Without evals, a misconfigured server (truncation, bad decoding, wrong endpoint params) can still produce great throughput numbers but garbage answers.

This PR wires evals directly into the benchmarking flow so that:

Each representative throughput config has an associated numerical accuracy check.
We can align throughput numbers with SLAs and avoid “gaming” (e.g. lowering max_new_tokens or silently dropping tokens).
Adding new eval suites in future (beyond GSM8K) is straightforward and reuses the same plumbing.

What This PR Changes

1. Optional evals for all throughput workflows

All throughput workflows that call benchmarks/* now have the ability to run evals immediately after throughput.
This is controlled via the matrix and an environment flag:
- Matrix sets a boolean FIELD_RUN_EVAL.
- Workflows export this as RUN_EVAL for each matrix entry.
Behavior:
- RUN_EVAL unset or false → only throughput runs (current behavior).
- RUN_EVAL=true → throughput then evals on the same server.

By default, no evals are run (opt-in), but the plumbing exists for all throughput workflows.

When evals are enabled, the default task is GSM8K:

EVAL_TASK defaults to gsm8k.
EVAL_FRAMEWORK defaults to lm-eval.
Both can be overridden via env for future suites.

2. Representative eval selection via matrix generation

To balance coverage and cost, we only run evals for two key points per configuration.

The matrix helper mark_eval_entries does, for each unique group:

Group key: (model, runner, framework, precision, isl, osl).
Within each group:
- Find min TP and max TP.
- For max TP:
- Identify entries with that TP.
- Among them, pick the highest concurrency (FIELD_CONC) → mark as eval.
- For min TP (if different from max TP):
- Same logic: lowest TP + highest concurrency → mark as eval.

The selected entries get: entry[FIELD_RUN_EVAL] = True

This means evals are ran only at the highest concurrency for the lowest and highest TP per GPU for each (model, runner, framework, precision, ISL, OSL) combo.

Everything else runs throughput-only.

3. Eval integration in runner scripts (`benchmarks/*`)

All runner scripts follow the same pattern:

Start the server
Call wait_for_server_ready.
Run throughput via run_benchmark_serving.
Conditionally run evals:
- Only when RUN_EVAL=true.
- Use run_eval + append_lm_eval_summary

⸻

4. Eval Frameworks

This PR supports two eval frameworks, with a unified entrypoint and local patching to handle reasoning tokens and OpenAI-compatible endpoints.

1. lm-eval (lm-evaluation-harness)

1.1 Installation & Prep

_install_lm_eval_deps
Installs lm-eval[api].
Pulls lm-evaluation-harness
_patch_lm_eval: injects a sitecustomize.py that:
- Fixes LocalChatCompletion.parse_generations
- Handles responses where message.content is empty but reasoning_content contains the actual answer.
- Avoids crashes and ensures text extraction works for reasoning-style models.
- Fixes TemplateAPI.apply_chat_template
  - Stops injecting {"type": "text"} into the payload when there is no tokenizer / non-HF tokenizer.
  - This was breaking TRT endpoints with strict JSON schemas.

Patched behavior is wired by adding the generated directory to PYTHONPATH.

1.2 Running lm-eval (`run_lm_eval`)

run_lm_eval wraps the lm_eval CLI:

Defaults:
- task = ${EVAL_TASK:-gsm8k}
- num_fewshot = ${NUM_FEWSHOT:-2}
- concurrent_requests = 32
- gen_max_tokens = 4096
- temperature = 0, top_p = 1.0

1.3 Summarizing lm-eval results (`append_lm_eval_summary`)

Writes meta_env.json describing:
- framework
- precision
- tp
- ep
- dp_attention
- model
Runs utils/lm_eval_to_md.py to convert raw lm-eval results into SUMMARY.md.
If running inside GitHub Actions:
- Appends SUMMARY.md into $GITHUB_STEP_SUMMARY (in the same runner).
Raw eval outputs remain under /tmp (they are not copied back into the repo workspace).

2. `lighteval` + `litellm`

While lm-eval is the default, this PR also supports lighteval as an alternative backend via the unified run_eval wrapper.

2.1 Installation & patching

_install_lighteval_deps:
- Installs lighteval and litellm.
_patch_lighteval_litellm via sitecustomize.py:
- Disables sglang imports:
  - Some lighteval versions attempt to import sglang, which crashes with our version mismatches.
  - We patch lighteval.utils.imports.is_package_available("sglang") to always return False.
- Patches LiteLLMClient to be OpenAI-server friendly:
  - Removes response_format={"type": "text"} which interferes with vLLM endpoints.
  - Handles reasoning-only responses via reasoning_content.
  - Adds retry/backoff logic around litellm completions.
- Switches parallel evaluation to threads:
  - Replaces async concurrency with ThreadPoolExecutor(self.concurrent_requests) to avoid stalls under high load.
- Returns ModelResponse with text and reasonings separated for downstream extraction.

2.2 Running lighteval (`run_lighteval_eval`)

Expects MODEL_NAME to be set (will error otherwise).
Wraps the model with an OpenAI-style prefix:
- lite_model="openai/${MODEL_NAME}"
Builds MODEL_ARGS for lighteval:
-model_name=${lite_model},base_url=${base_url},api_key=${OPENAI_API_KEY},generation_parameters={temperature:0.0,top_p=1,max_new_tokens:2048},concurrent_requests=${concurrent_requests}
Task specification:
- TASK_SPEC="${task}|${num_fewshot}"

3. Unified eval entrypoint (`run_eval`)

run_eval abstracts over frameworks:

Defaults:
- EVAL_FRAMEWORK=lm-eval
- EVAL_TASK=gsm8k
Runner scripts can override via env or by passing --framework explicitly.
All additional arguments (e.g. --port, --concurrent-requests, --results-dir) are forwarded to the underlying framework-specific function.

Future Work / Notes

Currently the default behavior is unchanged for most users:
- Evals are off by default (RUN_EVAL=false).
- Only selected matrix entries (lowest & highest TP per GPU at max concurrency) enable RUN_EVAL=true.
The plumbing is now in place to:
- Add more eval suites (e.g. MMLU, Math, custom internal tasks) via EVAL_TASK and utils/evals/*.
Token count optimizations.
GB200 multinode GB200 multinode evals #268

Note

Adds an optional evaluation phase to throughput runs and aggregates results into a single summary.

CI: Add run-eval input/env to benchmark templates; append "eval" to job name; upload eval artifacts and clean up
New collect-evals.yml workflow to download eval artifacts, tabulate results via utils/collect_eval_results.py, and upload agg_eval_*.json
Bench scripts: add run_eval/append_lm_eval_summary in benchmarks/benchmark_lib.sh (with patches for lm-eval and lighteval/LiteLLM); call evals conditionally in all benchmarks/* scripts
Runner scripts: pass RUN_EVAL/RUNNER_TYPE; minor serving tweaks (served model name, tokenizer pass-through, disable request logs)
Matrix logic: mark eval subset (lowest/highest TP at max conc) via mark_eval_entries; support --run-evals/--evals-only; validation adds run-eval
Results: enhance utils/summarize.py; add eval task specs under utils/evals/* and new utils/collect_eval_results.py for summary generation

^{Written by Cursor Bugbot for commit c902545. This will update automatically on new commits. Configure here.}

functionstackx · 2026-01-12T05:11:56Z

benchmarks/dsr1_fp4_b200_trt_slurm.sh

+# After throughput, run evaluation only if RUN_EVAL is true
+if [ "${RUN_EVAL}" = "true" ]; then
+    run_eval --framework lm-eval --port "$PORT" --concurrent-requests $(( $CONC * 2 ))
+    append_lm_eval_summary
+fi


is this easy to extend to multi eval without needing to change the API contract again @Oseltamivir

Will modified so that:

run_eval append_lm_eval_summary run_eval append_lm_eval_summary

Will produce 1 json rows per eval in the aggregate

benchmarks/dsr1_fp4_b200_docker.sh

benchmarks/benchmark_lib.sh

utils/matrix_logic/generate_sweep_configs.py

benchmarks/benchmark_lib.sh

.github/workflows/benchmark-tmpl.yml

benchmarks/benchmark_lib.sh

utils/matrix_logic/generate_sweep_configs.py

benchmarks/benchmark_lib.sh

benchmarks/gptoss_fp4_b200_trt_slurm.sh

utils/evals/math500.yaml

benchmarks/benchmark_lib.sh

benchmarks/gptoss_fp4_b200_trt_slurm.sh

cquil11 · 2026-01-15T17:32:33Z

@claude review. if there are comments by other code reviewers, please only include them if they abide by our standards for a useful high signal comment

claude · 2026-01-15T17:32:47Z

Claude finished @cquil11's task in 4m 46s —— View job

Code Review Complete

Reviewed PR #258: Adding evals after throughput benchmarks

Summary

Found 1 blocking issue that must be fixed before merge. The issue causes eval and benchmark failures to be silently masked, making CI jobs appear successful while data is missing.

Blocking Issues

I've posted inline comments for the critical bug:

Exit code masking in benchmarks/benchmark_lib.sh (lines 185, 315) - The set +x command returns 0, masking failures from eval and benchmark commands

Notes on Existing Review Comments

Most of the Cursor bot review comments reference issues in older commits that have since been fixed:

Hardcoded num-prompts values → Fixed (now uses $NUM_PROMPTS)
Hardcoded DP_ATTENTION → Fixed (now uses $DP_ATTENTION variable)
Missing MODEL fallback in lighteval → Not blocking (lighteval framework not currently used)

Verification

The PR correctly implements:

✅ Conditional eval execution via RUN_EVAL flag
✅ Matrix entry selection for eval runs (lowest/highest TP at max concurrency)
✅ Eval artifact uploads with if-no-files-found: ignore
✅ Integration with lm-eval framework with proper patching

Action Required: Fix the exit code masking bug before merge.

benchmarks/benchmark_lib.sh

cquil11 added 30 commits November 13, 2025 18:36

initial poc

cd9cb64

remove -d flag when launching docker container

00ac64a

syntax error

e38b38a

compatibility fixes

66eae81

add correct endpoint prefix

fdec241

remove reference env var

08de857

run vllm serve in background

06231ee

unescape sequences

21ed067

stop vllm to stdout after it stops

65ef1f0

stop vllm to stdout after it stops pt 2

cb55721

get rid of docker stop as no longer in detatched

788b7f1

clone bench serving to tmp dir

a87e174

clone bench serving to tmp dir pt 2

c1d0a79

add explanatory comment

4823afa

cleaning up

d52299f

cleaning up

85de6e7

adding mi355x refactor

48f7588

adding h200 initial refactor

faec31e

different way to see server logs

1ef1b23

cleanup

75523ee

now fail if server fails

2536652

starting on b200

2d58f0d

doign b200

f5cf4a7

reverting erroneous change

92af70b

fixing b200

f330d67

fixing b200 pt 2

c5fcf81

updating mi300

3ededf0

updating mi300 pt 2

813381b

updating mi300 pt 3 -- remove detached mode

e1b387c

cleaning up mi355x

c0a5c62

functionstackx reviewed Jan 12, 2026

View reviewed changes

cursor bot reviewed Jan 12, 2026

View reviewed changes

benchmarks/dsr1_fp4_b200_docker.sh Outdated Show resolved Hide resolved

benchmarks/benchmark_lib.sh Outdated Show resolved Hide resolved

set max tokens

4b0f8de

Oseltamivir force-pushed the evals-on-refactor branch from 07405c9 to 4b0f8de Compare January 12, 2026 21:21

remove nvd

a52f4c6

cursor bot reviewed Jan 12, 2026

View reviewed changes

benchmarks/benchmark_lib.sh Outdated Show resolved Hide resolved

utils/matrix_logic/generate_sweep_configs.py Show resolved Hide resolved

utils/matrix_logic/generate_sweep_configs.py Show resolved Hide resolved

In case of multiple evals

568e1d3