chore: upstream bench serving and add warmup #339

cquil11 · 2025-12-17T01:07:54Z

Test run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20662552221

Summary

Note: This PR will be merged 10 days from the time it is opened. This gives ample time for all parties to review.

Warmup is standard practice in production deployments via Kubernetes readiness probes, pre-traffic validation. This is how production systems operate. SGLang, OpenAI, and Together AI all warmup their inference engine before serving traffic. Major frameworks such as vLLM and SGLang benchmarking tools already support warmup and it is widely used in production.

Reference implementations:

vLLM warmup implementation
SGLang warmup implementation
NVIDIA aiperf warmup implementation
HuggingFace inference-benchmarker warmup implementation
TRT-LLM warmup implementation
Llama CPP warmup implementation
And many other examples in inference benchmarking in open source & internal AI labs tools have warmup as this closely follows real world prod via k8s readiness probes

Our benchmarking tool closely follows the vLLM benchmarking tool as it should reflect real-world production performance, not cold-start artifacts. Adding warmup aligns our methodology with industry standard practice.

This implementation seeks to standardize warmup across all single node configurations. We will now be sending up to 2x concurrency level requests as warmup across the board. Additionally, we will be upstreaming our benchmark suite into InferenceMAX as part of these changes instead of having it be in a separate side repo.

Many months ago we already said we will add warmup & will provide an additional 10 days notice as of sending this document out to yall.

#339

Most importantly, this is standard industry practice. Warmup is supported and the default in the official SGLang and vLLM benchmarking scripts. It is a widely used practice throughout the SGLang/vLLM communities. Warmup is common in real-world production environments during container startup, and used by firms such as OpenAI, TogetherAI, xAI as well as OSS projects such as KServe, NVIDIA aiperf, HuggingFace, Llama.cpp, and more. As always, InferenceMAX aims to benchmark "real-world" scenarios and as such, adding warmup is natural.

Changes

Warmup

We add the --num-warmups n argument to the benchmark_serving.py script, which takes in a positive integer n. When specified, the benchmark_serving.py script invokes this loop within the benchmark function to send the first randomly generated request n times to the server. This ends up being a one-line change in terms of how benchmarks are launched—now we just add the --num-warmups flag.

Before:

python3 benchmark_serving.py \
    --model "$model" \
    --backend "$backend" \
    --base-url "http://0.0.0.0:$port" \
    --dataset-name random \
    --random-input-len "$input_len" \
    --random-output-len "$output_len" \
    --random-range-ratio "$random_range_ratio" \
    --num-prompts "$num_prompts" \
    --max-concurrency "$max_concurrency" \
    --request-rate inf \
    --ignore-eos \
    --save-result \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --result-dir "$result_dir" \
    --result-filename "$result_filename.json"

After:

# vllm & sglang benchmark_serving.py also have --num-warmup & inferencemax
# benchmark_serving closely follows vllm script
python3 benchmark_serving.py \
    --model "$model" \
    --backend "$backend" \
    --base-url "http://0.0.0.0:$port" \
    --dataset-name random \
    --random-input-len "$input_len" \
    --random-output-len "$output_len" \
    --random-range-ratio "$random_range_ratio" \
    --num-prompts "$num_prompts" \
    --num-warmups "$num_warmups" \
    --max-concurrency "$max_concurrency" \
    --request-rate inf \
    --ignore-eos \
    --save-result \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --result-dir "$result_dir" \
    --result-filename "$result_filename.json"

Each benchmark run will invoke this script up to --num-warmups=(2 * CONC), where CONC is the maximum number of concurrent requests that can be in-flight at any single point in time.

We note that this is the standard implementation as seen in SGLang and vLLM, hence InferenceMAX will be following suit with this implementation. This is important as it is our goal to remain consistent with the broader ML community.

Code Upstream

Currently, we utilize a benchmark suite in a downstream repository, which is a fork of another repository, which is a fork of the original vLLM benchmark serving suite (which has since been refactored). We will be upstreaming the relevant code from this repository into InferenceMAX for maximum transparency.

Note: Multinode runs will be upstreamed at a later day due to the fact that multinode scripts are not fully upstreamed to the InferenceMAX repository at this point.

Related PRs

chore: upstream bench serving and add warmup #339

github-actions · 2026-01-02T16:53:31Z

📊 Line Count Report

File: utils/matrix_logic/generate_sweep_configs.py

Total Lines: 684

Base Lines: 682

Change: +2 lines 📈

github-actions · 2026-01-14T00:06:39Z

📊 Line Count Report

File: utils/matrix_logic/generate_sweep_configs.py

Total Lines: 684

Base Lines: 682

Change: +2 lines 📈

cquil11 · 2026-01-15T17:23:26Z

@claude review

claude · 2026-01-15T17:23:38Z

Claude finished @cquil11's task in 2m 34s —— View job

Code Review Complete

Reviewed PR #339: Adding warmup functionality and upstreaming benchmark serving code.

Summary: Found 1 warning-level issue that should be fixed for Python <3.10 compatibility. The goodput comparison logic flagged by the Copilot reviewer is actually correct.

Issues Found

🟡 WARNING: Warmup semaphore handling inconsistency (benchmark_serving.py:743)

The warmup code uses contextlib.nullcontext() when max_concurrency is None, but the main benchmark explicitly avoids this pattern because it doesn't work as an async context manager in Python <3.10. Should use the same None-checking pattern as the main benchmark (lines 795-804).

Notes

Goodput logic is correct: The Copilot reviewer incorrectly flagged line 567 as having inverted comparison logic. The comparison s >= r (SLO >= actual latency) is correct for checking if a request meets the SLO threshold.
contextlib IS imported: Copilot's claim about missing contextlib import was incorrect - it's imported on line 29.
Empty except clause: Line 730 has an empty except clause that would benefit from a comment explaining it's intentionally ignoring connection errors during the ready check loop. Not blocking.
MAX_MODEL_LEN calculation: The change from isl + osl + 200 to int((isl + osl) * 1.5) looks reasonable and is consistently applied.
Warmup implementation: The --num-warmups parameter is correctly set to 2 * max_concurrency as described in the PR.

Recommendation

Fix the warmup semaphore handling for Python <3.10 compatibility before merging. The rest looks good.

claude · 2026-01-15T17:24:58Z

utils/bench_serving/benchmark_serving.py

+                              MILLISECONDS_TO_SECONDS_CONVERSION)
+
+        for req_metric in zip(*valid_metrics):
+            is_good_req = all([s >= r for s, r in zip(slo_values, req_metric)])


🔴 BLOCKING: Inverted goodput comparison logic

Why it matters: This bug causes the goodput calculation to count requests as "good" when their latency is HIGHER than the SLO threshold, which is completely backwards. Lower latencies are better, so a request should only be considered "good" when its actual latency is less than or equal to the SLO.

Current logic: s >= r means "SLO is greater than or equal to actual metric"

Example: If SLO is 100ms and actual is 200ms, then 100 >= 200 is False (correctly marked as bad)

Example: If SLO is 100ms and actual is 50ms, then 100 >= 50 is True (correctly marked as good)

Wait, actually on second thought, let me reconsider this logic...

Actually, the comparison s >= r reads as "SLO value is greater than or equal to request metric value". For latency metrics where lower is better:

If SLO = 100ms (threshold) and request latency = 50ms, then 100 >= 50 = True ✓ (good request, met SLO)

If SLO = 100ms and request latency = 150ms, then 100 >= 150 = False ✗ (bad request, exceeded SLO)

This is actually CORRECT. The copilot reviewer was wrong - the logic is fine as-is.

claude · 2026-01-15T17:25:39Z

utils/bench_serving/benchmark_serving.py

+        if num_warmups > 0:
+            print(f"Warming up with {num_warmups} requests...")
+            warmup_pbar = None if disable_tqdm else tqdm(total=num_warmups)
+            warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else contextlib.nullcontext()


🟡 WARNING: Inconsistent semaphore handling between warmup and main benchmark

Why it matters: The warmup code uses contextlib.nullcontext() when max_concurrency is None, but the main benchmark code at lines 795-804 explicitly avoids this pattern because contextlib.nullcontext() doesn't work as an async context manager in Python versions below 3.10 (see comment at lines 791-794).

Fix: Use the same pattern as the main benchmark:

warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None async def warmup_limited_req_fn(): if warmup_semaphore is None: return await request_func(request_func_input=test_input, pbar=warmup_pbar, session=session) async with warmup_semaphore: return await request_func(request_func_input=test_input, pbar=warmup_pbar, session=session)

claude · 2026-01-15T17:25:46Z

utils/bench_serving/benchmark_serving.py

+                    if test_output.success:
+                        break
+                except aiohttp.ClientConnectorError as e:
+                    pass


INFO: Empty except clause should have an explanatory comment

While this is acceptable in a retry loop waiting for the server to start, adding a comment would improve code clarity:

except aiohttp.ClientConnectorError as e: # Endpoint is not ready yet; ignore connection errors and retry after delay pass

Not blocking, but recommended for maintainability.

github-project-automation bot added this to InferenceMAX Board Dec 17, 2025

cquil11 mentioned this pull request Dec 21, 2025

Benchmarked ISL and OSL averages 0.9*target_length meaning results are over-optimistic #356

Open

asb mentioned this pull request Dec 22, 2025

[feature suggestion for vllm/vllm benchmark_serving] #357

Closed

Copilot AI review requested due to automatic review settings January 2, 2026 15:34

Copilot started reviewing on behalf of cquil11 January 2, 2026 15:35 View session

This comment was marked as outdated.

Sign in to view

cquil11 changed the title ~~[WIP] Upstream bench serving and add warmup~~ chore: upstream bench serving and add warmup Jan 2, 2026

cquil11 marked this pull request as ready for review January 2, 2026 16:53

cquil11 requested a review from a team as a code owner January 2, 2026 16:53

cquil11 moved this to In Progress in InferenceMAX Board Jan 2, 2026

This was linked to issues Jan 2, 2026

rfc on warmup #124

Open

upstream https://github.com/kimbochen/bench_serving/tree/warmup into main InferenceMAX repo #338

Open

cquil11 self-assigned this Jan 2, 2026

functionstackx marked this pull request as draft January 5, 2026 16:28

cquil11 added 15 commits January 13, 2026 17:07

add initial bench serving scripts

dfe9a01

remove all references to NUM_PROMPTS

9d9d32a

get rid of GITHUB WORKSPACE

9258bd0

debugging

ec7ad37

debugging: cat server log

cb88fea

debugging: output len

96a22fa

test edge case

622a6da

remove debug

1efbbee

formatting

417aebb

get rid of calculated max model len

4349853

debug

f4ff7ec

debug pt 2

ed685b6

increase max model len

3a2c088

increase max model len

30ad14c

debugging

b5ebafa

add test perf changelog

560b457

cquil11 added the sweep-enabled label Jan 14, 2026

cquil11 marked this pull request as ready for review January 14, 2026 00:06

cquil11 added 2 commits January 15, 2026 16:22

Merge branch 'main' into upstream-bench-serving

d73fc97

printing benchmark result

4f70361

cquil11 removed the sweep-enabled label Jan 15, 2026

cquil11 added 2 commits January 15, 2026 11:15

bug fix

fb882ca

bug fix

236cb09

InferenceMAX deleted a comment from cursor bot Jan 15, 2026

bug fix pt 2

3d851c8

claude bot reviewed Jan 15, 2026

View reviewed changes

cquil11 added 5 commits January 15, 2026 11:51

remove num prompts env var from atom

574543d

add support for trt max num tokens and random range ratio

c0ccf8f

switch from bc to awk

40be3a4

Merge branch 'main' into upstream-bench-serving

7485cc9

Merge branch 'main' into upstream-bench-serving

7733a7b

cquil11 added the sweep-enabled label Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: upstream bench serving and add warmup #339

chore: upstream bench serving and add warmup #339

Uh oh!

cquil11 commented Dec 17, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

cquil11 commented Jan 15, 2026

Uh oh!

claude bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

claude bot Jan 15, 2026

Uh oh!

claude bot Jan 15, 2026

Uh oh!

claude bot Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: upstream bench serving and add warmup #339

Are you sure you want to change the base?

chore: upstream bench serving and add warmup #339

Uh oh!

Conversation

cquil11 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Warmup

Code Upstream

Related PRs

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Jan 2, 2026

📊 Line Count Report

Uh oh!

github-actions bot commented Jan 14, 2026

📊 Line Count Report

Uh oh!

cquil11 commented Jan 15, 2026

Uh oh!

claude bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Issues Found

🟡 WARNING: Warmup semaphore handling inconsistency (benchmark_serving.py:743)

Notes

Recommendation

Uh oh!

claude bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cquil11 commented Dec 17, 2025 •

edited

Loading

claude bot commented Jan 15, 2026 •

edited

Loading