Skip to content

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Dec 17, 2025

Test run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20662552221

Summary

Note: This PR will be merged 10 days from the time it is opened. This gives ample time for all parties to review.

Warmup is standard practice in production deployments via Kubernetes readiness probes, pre-traffic validation. This is how production systems operate. SGLang, OpenAI, and Together AI all warmup their inference engine before serving traffic. Major frameworks such as vLLM and SGLang benchmarking tools already support warmup and it is widely used in production.

Reference implementations:

Our benchmarking tool closely follows the vLLM benchmarking tool as it should reflect real-world production performance, not cold-start artifacts. Adding warmup aligns our methodology with industry standard practice.

This implementation seeks to standardize warmup across all single node configurations. We will now be sending up to 2x concurrency level requests as warmup across the board. Additionally, we will be upstreaming our benchmark suite into InferenceMAX as part of these changes instead of having it be in a separate side repo.

Many months ago we already said we will add warmup & will provide an additional 10 days notice as of sending this document out to yall.

#339

Most importantly, this is standard industry practice. Warmup is supported and the default in the official SGLang and vLLM benchmarking scripts. It is a widely used practice throughout the SGLang/vLLM communities. Warmup is common in real-world production environments during container startup, and used by firms such as OpenAI, TogetherAI, xAI as well as OSS projects such as KServe, NVIDIA aiperf, HuggingFace, Llama.cpp, and more. As always, InferenceMAX aims to benchmark "real-world" scenarios and as such, adding warmup is natural.


Changes

Warmup

We add the --num-warmups n argument to the benchmark_serving.py script, which takes in a positive integer n. When specified, the benchmark_serving.py script invokes this loop within the benchmark function to send the first randomly generated request n times to the server. This ends up being a one-line change in terms of how benchmarks are launched—now we just add the --num-warmups flag.

Before:

python3 benchmark_serving.py \
    --model "$model" \
    --backend "$backend" \
    --base-url "http://0.0.0.0:$port" \
    --dataset-name random \
    --random-input-len "$input_len" \
    --random-output-len "$output_len" \
    --random-range-ratio "$random_range_ratio" \
    --num-prompts "$num_prompts" \
    --max-concurrency "$max_concurrency" \
    --request-rate inf \
    --ignore-eos \
    --save-result \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --result-dir "$result_dir" \
    --result-filename "$result_filename.json"

After:

# vllm & sglang benchmark_serving.py also have --num-warmup & inferencemax
# benchmark_serving closely follows vllm script
python3 benchmark_serving.py \
    --model "$model" \
    --backend "$backend" \
    --base-url "http://0.0.0.0:$port" \
    --dataset-name random \
    --random-input-len "$input_len" \
    --random-output-len "$output_len" \
    --random-range-ratio "$random_range_ratio" \
    --num-prompts "$num_prompts" \
    --num-warmups "$num_warmups" \
    --max-concurrency "$max_concurrency" \
    --request-rate inf \
    --ignore-eos \
    --save-result \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --result-dir "$result_dir" \
    --result-filename "$result_filename.json"

Each benchmark run will invoke this script up to --num-warmups=(2 * CONC), where CONC is the maximum number of concurrent requests that can be in-flight at any single point in time.

We note that this is the standard implementation as seen in SGLang and vLLM, hence InferenceMAX will be following suit with this implementation. This is important as it is our goal to remain consistent with the broader ML community.

Code Upstream

Currently, we utilize a benchmark suite in a downstream repository, which is a fork of another repository, which is a fork of the original vLLM benchmark serving suite (which has since been refactored). We will be upstreaming the relevant code from this repository into InferenceMAX for maximum transparency.

Note: Multinode runs will be upstreamed at a later day due to the fact that multinode scripts are not fully upstreamed to the InferenceMAX repository at this point.


Related PRs

This comment was marked as outdated.

@cquil11 cquil11 changed the title [WIP] Upstream bench serving and add warmup chore: upstream bench serving and add warmup Jan 2, 2026
@cquil11 cquil11 marked this pull request as ready for review January 2, 2026 16:53
@cquil11 cquil11 requested a review from a team as a code owner January 2, 2026 16:53
@github-actions
Copy link

github-actions bot commented Jan 2, 2026

📊 Line Count Report

File: utils/matrix_logic/generate_sweep_configs.py

Total Lines: 684

Base Lines: 682

Change: +2 lines 📈

@cquil11 cquil11 marked this pull request as ready for review January 14, 2026 00:06
@github-actions
Copy link

📊 Line Count Report

File: utils/matrix_logic/generate_sweep_configs.py

Total Lines: 684

Base Lines: 682

Change: +2 lines 📈

@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@InferenceMAX InferenceMAX deleted a comment from cursor bot Jan 15, 2026
@cquil11
Copy link
Collaborator Author

cquil11 commented Jan 15, 2026

@claude review

@claude
Copy link

claude bot commented Jan 15, 2026

Claude finished @cquil11's task in 2m 34s —— View job


Code Review Complete

Reviewed PR #339: Adding warmup functionality and upstreaming benchmark serving code.

Summary: Found 1 warning-level issue that should be fixed for Python <3.10 compatibility. The goodput comparison logic flagged by the Copilot reviewer is actually correct.

Issues Found

🟡 WARNING: Warmup semaphore handling inconsistency (benchmark_serving.py:743)

The warmup code uses contextlib.nullcontext() when max_concurrency is None, but the main benchmark explicitly avoids this pattern because it doesn't work as an async context manager in Python <3.10. Should use the same None-checking pattern as the main benchmark (lines 795-804).

Notes

  1. Goodput logic is correct: The Copilot reviewer incorrectly flagged line 567 as having inverted comparison logic. The comparison s >= r (SLO >= actual latency) is correct for checking if a request meets the SLO threshold.

  2. contextlib IS imported: Copilot's claim about missing contextlib import was incorrect - it's imported on line 29.

  3. Empty except clause: Line 730 has an empty except clause that would benefit from a comment explaining it's intentionally ignoring connection errors during the ready check loop. Not blocking.

  4. MAX_MODEL_LEN calculation: The change from isl + osl + 200 to int((isl + osl) * 1.5) looks reasonable and is consistently applied.

  5. Warmup implementation: The --num-warmups parameter is correctly set to 2 * max_concurrency as described in the PR.

Recommendation

Fix the warmup semaphore handling for Python <3.10 compatibility before merging. The rest looks good.


MILLISECONDS_TO_SECONDS_CONVERSION)

for req_metric in zip(*valid_metrics):
is_good_req = all([s >= r for s, r in zip(slo_values, req_metric)])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Inverted goodput comparison logic

Why it matters: This bug causes the goodput calculation to count requests as "good" when their latency is HIGHER than the SLO threshold, which is completely backwards. Lower latencies are better, so a request should only be considered "good" when its actual latency is less than or equal to the SLO.

Current logic: s >= r means "SLO is greater than or equal to actual metric"

  • Example: If SLO is 100ms and actual is 200ms, then 100 >= 200 is False (correctly marked as bad)
  • Example: If SLO is 100ms and actual is 50ms, then 100 >= 50 is True (correctly marked as good)

Wait, actually on second thought, let me reconsider this logic...

Actually, the comparison s >= r reads as "SLO value is greater than or equal to request metric value". For latency metrics where lower is better:

  • If SLO = 100ms (threshold) and request latency = 50ms, then 100 >= 50 = True ✓ (good request, met SLO)
  • If SLO = 100ms and request latency = 150ms, then 100 >= 150 = False ✗ (bad request, exceeded SLO)

This is actually CORRECT. The copilot reviewer was wrong - the logic is fine as-is.

if num_warmups > 0:
print(f"Warming up with {num_warmups} requests...")
warmup_pbar = None if disable_tqdm else tqdm(total=num_warmups)
warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else contextlib.nullcontext()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING: Inconsistent semaphore handling between warmup and main benchmark

Why it matters: The warmup code uses contextlib.nullcontext() when max_concurrency is None, but the main benchmark code at lines 795-804 explicitly avoids this pattern because contextlib.nullcontext() doesn't work as an async context manager in Python versions below 3.10 (see comment at lines 791-794).

Fix: Use the same pattern as the main benchmark:

warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None

async def warmup_limited_req_fn():
    if warmup_semaphore is None:
        return await request_func(request_func_input=test_input, pbar=warmup_pbar, session=session)
    async with warmup_semaphore:
        return await request_func(request_func_input=test_input, pbar=warmup_pbar, session=session)

if test_output.success:
break
except aiohttp.ClientConnectorError as e:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO: Empty except clause should have an explanatory comment

While this is acceptable in a retry loop waiting for the server to start, adding a comment would improve code clarity:

except aiohttp.ClientConnectorError as e:
    # Endpoint is not ready yet; ignore connection errors and retry after delay
    pass

Not blocking, but recommended for maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

upstream https://github.com/kimbochen/bench_serving/tree/warmup into main InferenceMAX repo rfc on warmup

2 participants