[Chore][CI]: K3 MP output token quantity tolerance#3030
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a 1% token tolerance in the vLLM benchmark verification scripts to handle minor drifts in token counts and refactors the verification logic into a helper function. A critical issue was identified in pyproject.toml where the torch version was updated to 2.11.0, a version that does not exist on PyPI, which will lead to build failures.
| "setuptools>=77.0.3,<81.0.0", | ||
| "setuptools_scm>=8", | ||
| "torch==2.10.0", | ||
| "torch==2.11.0", |
There was a problem hiding this comment.
The version 2.11.0 for torch does not exist on PyPI (the current stable versions are in the 2.x range, e.g., 2.5.1, 2.6.0). This appears to be a typo, likely intended to be 2.1.1 (given the previous value was 2.10.0, which was likely a typo for 2.1.0). Using a non-existent version will cause build failures when resolving dependencies for the build system.
d8d6c5a to
d421250
Compare
vLLM's RandomDataset decodes and re-encodes generated token sequences (vllm/benchmarks/datasets.py) to avoid string-level drift, but the roundtrip is not guaranteed to preserve exact token counts — the benchmark itself only warns when token_mismatch != 0. The strict -eq assertion against NUM_PROMPTS * RANDOM_INPUT_LEN was failing with a 0.08% overage (500400 vs 500000) on Qwen3-14B after a vLLM upgrade. Switch to a ±1% tolerance check, which matches the benchmark's own semantics while still catching real workload-size regressions. Signed-off-by: Samuel Shen <slshen@uchciago.edu>
d421250 to
a9f407e
Compare
[CI] Allow 1% tolerance on vllm_bench total_input_tokens check vLLM's RandomDataset decodes and re-encodes generated token sequences (vllm/benchmarks/datasets.py) to avoid string-level drift, but the roundtrip is not guaranteed to preserve exact token counts — the benchmark itself only warns when token_mismatch != 0. The strict -eq assertion against NUM_PROMPTS * RANDOM_INPUT_LEN was failing with a 0.08% overage (500400 vs 500000) on Qwen3-14B after a vLLM upgrade. Switch to a ±1% tolerance check, which matches the benchmark's own semantics while still catching real workload-size regressions. Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
[CI] Allow 1% tolerance on vllm_bench total_input_tokens check vLLM's RandomDataset decodes and re-encodes generated token sequences (vllm/benchmarks/datasets.py) to avoid string-level drift, but the roundtrip is not guaranteed to preserve exact token counts — the benchmark itself only warns when token_mismatch != 0. The strict -eq assertion against NUM_PROMPTS * RANDOM_INPUT_LEN was failing with a 0.08% overage (500400 vs 500000) on Qwen3-14B after a vLLM upgrade. Switch to a ±1% tolerance check, which matches the benchmark's own semantics while still catching real workload-size regressions. Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Low Risk
Low risk: CI-only bash script changes that relax an assertion to account for known vLLM random-dataset token length drift; no production code or security-sensitive logic touched.
Overview
Updates the Buildkite vLLM bench verification scripts to treat
total_input_tokensas approximate rather than requiring an exact match.Both
run-vllm-bench.shvariants now compute a 1% tolerance and use a sharedcheck_input_tokenshelper for LMCache and baseline runs, reducing spurious CI failures caused by vLLM’s random dataset re-tokenization drift.Reviewed by Cursor Bugbot for commit a9f407e. Bugbot is set up for automated code reviews on this repo. Configure here.