-
Notifications
You must be signed in to change notification settings - Fork 70
chore: upstream bench serving and add warmup #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📊 Line Count ReportFile: Total Lines: 684 Base Lines: 682 Change: +2 lines 📈 |
📊 Line Count ReportFile: Total Lines: 684 Base Lines: 682 Change: +2 lines 📈 |
|
@claude review |
|
Claude finished @cquil11's task in 2m 34s —— View job Code Review CompleteReviewed PR #339: Adding warmup functionality and upstreaming benchmark serving code. Summary: Found 1 warning-level issue that should be fixed for Python <3.10 compatibility. The goodput comparison logic flagged by the Copilot reviewer is actually correct. Issues Found🟡 WARNING: Warmup semaphore handling inconsistency (benchmark_serving.py:743)The warmup code uses Notes
RecommendationFix the warmup semaphore handling for Python <3.10 compatibility before merging. The rest looks good. |
| MILLISECONDS_TO_SECONDS_CONVERSION) | ||
|
|
||
| for req_metric in zip(*valid_metrics): | ||
| is_good_req = all([s >= r for s, r in zip(slo_values, req_metric)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 BLOCKING: Inverted goodput comparison logic
Why it matters: This bug causes the goodput calculation to count requests as "good" when their latency is HIGHER than the SLO threshold, which is completely backwards. Lower latencies are better, so a request should only be considered "good" when its actual latency is less than or equal to the SLO.
Current logic: s >= r means "SLO is greater than or equal to actual metric"
- Example: If SLO is 100ms and actual is 200ms, then
100 >= 200is False (correctly marked as bad) - Example: If SLO is 100ms and actual is 50ms, then
100 >= 50is True (correctly marked as good)
Wait, actually on second thought, let me reconsider this logic...
Actually, the comparison s >= r reads as "SLO value is greater than or equal to request metric value". For latency metrics where lower is better:
- If SLO = 100ms (threshold) and request latency = 50ms, then 100 >= 50 = True ✓ (good request, met SLO)
- If SLO = 100ms and request latency = 150ms, then 100 >= 150 = False ✗ (bad request, exceeded SLO)
This is actually CORRECT. The copilot reviewer was wrong - the logic is fine as-is.
| if num_warmups > 0: | ||
| print(f"Warming up with {num_warmups} requests...") | ||
| warmup_pbar = None if disable_tqdm else tqdm(total=num_warmups) | ||
| warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else contextlib.nullcontext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 WARNING: Inconsistent semaphore handling between warmup and main benchmark
Why it matters: The warmup code uses contextlib.nullcontext() when max_concurrency is None, but the main benchmark code at lines 795-804 explicitly avoids this pattern because contextlib.nullcontext() doesn't work as an async context manager in Python versions below 3.10 (see comment at lines 791-794).
Fix: Use the same pattern as the main benchmark:
warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None
async def warmup_limited_req_fn():
if warmup_semaphore is None:
return await request_func(request_func_input=test_input, pbar=warmup_pbar, session=session)
async with warmup_semaphore:
return await request_func(request_func_input=test_input, pbar=warmup_pbar, session=session)| if test_output.success: | ||
| break | ||
| except aiohttp.ClientConnectorError as e: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO: Empty except clause should have an explanatory comment
While this is acceptable in a retry loop waiting for the server to start, adding a comment would improve code clarity:
except aiohttp.ClientConnectorError as e:
# Endpoint is not ready yet; ignore connection errors and retry after delay
passNot blocking, but recommended for maintainability.
Test run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20662552221
Summary
Warmup is standard practice in production deployments via Kubernetes readiness probes, pre-traffic validation. This is how production systems operate. SGLang, OpenAI, and Together AI all warmup their inference engine before serving traffic. Major frameworks such as vLLM and SGLang benchmarking tools already support warmup and it is widely used in production.
Reference implementations:
Our benchmarking tool closely follows the vLLM benchmarking tool as it should reflect real-world production performance, not cold-start artifacts. Adding warmup aligns our methodology with industry standard practice.
This implementation seeks to standardize warmup across all single node configurations. We will now be sending up to 2x concurrency level requests as warmup across the board. Additionally, we will be upstreaming our benchmark suite into InferenceMAX as part of these changes instead of having it be in a separate side repo.
Many months ago we already said we will add warmup & will provide an additional 10 days notice as of sending this document out to yall.
#339
Most importantly, this is standard industry practice. Warmup is supported and the default in the official SGLang and vLLM benchmarking scripts. It is a widely used practice throughout the SGLang/vLLM communities. Warmup is common in real-world production environments during container startup, and used by firms such as OpenAI, TogetherAI, xAI as well as OSS projects such as KServe, NVIDIA aiperf, HuggingFace, Llama.cpp, and more. As always, InferenceMAX aims to benchmark "real-world" scenarios and as such, adding warmup is natural.
Changes
Warmup
We add the
--num-warmups nargument to thebenchmark_serving.pyscript, which takes in a positive integern. When specified, thebenchmark_serving.pyscript invokes this loop within the benchmark function to send the first randomly generated requestntimes to the server. This ends up being a one-line change in terms of how benchmarks are launched—now we just add the--num-warmupsflag.Before:
python3 benchmark_serving.py \ --model "$model" \ --backend "$backend" \ --base-url "http://0.0.0.0:$port" \ --dataset-name random \ --random-input-len "$input_len" \ --random-output-len "$output_len" \ --random-range-ratio "$random_range_ratio" \ --num-prompts "$num_prompts" \ --max-concurrency "$max_concurrency" \ --request-rate inf \ --ignore-eos \ --save-result \ --percentile-metrics 'ttft,tpot,itl,e2el' \ --result-dir "$result_dir" \ --result-filename "$result_filename.json"After:
Each benchmark run will invoke this script up to
--num-warmups=(2 * CONC), whereCONCis the maximum number of concurrent requests that can be in-flight at any single point in time.We note that this is the standard implementation as seen in SGLang and vLLM, hence InferenceMAX will be following suit with this implementation. This is important as it is our goal to remain consistent with the broader ML community.
Code Upstream
Currently, we utilize a benchmark suite in a downstream repository, which is a fork of another repository, which is a fork of the original vLLM benchmark serving suite (which has since been refactored). We will be upstreaming the relevant code from this repository into InferenceMAX for maximum transparency.
Related PRs