Skip to content

[Chore][CI]: K3 MP output token quantity tolerance#3030

Merged
sammshen merged 1 commit intoLMCache:devfrom
sammshen:k3-mp-tolerance
Apr 14, 2026
Merged

[Chore][CI]: K3 MP output token quantity tolerance#3030
sammshen merged 1 commit intoLMCache:devfrom
sammshen:k3-mp-tolerance

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Apr 14, 2026

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Low Risk
Low risk: CI-only bash script changes that relax an assertion to account for known vLLM random-dataset token length drift; no production code or security-sensitive logic touched.

Overview
Updates the Buildkite vLLM bench verification scripts to treat total_input_tokens as approximate rather than requiring an exact match.

Both run-vllm-bench.sh variants now compute a 1% tolerance and use a shared check_input_tokens helper for LMCache and baseline runs, reducing spurious CI failures caused by vLLM’s random dataset re-tokenization drift.

Reviewed by Cursor Bugbot for commit a9f407e. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 1% token tolerance in the vLLM benchmark verification scripts to handle minor drifts in token counts and refactors the verification logic into a helper function. A critical issue was identified in pyproject.toml where the torch version was updated to 2.11.0, a version that does not exist on PyPI, which will lead to build failures.

Comment thread pyproject.toml Outdated
"setuptools>=77.0.3,<81.0.0",
"setuptools_scm>=8",
"torch==2.10.0",
"torch==2.11.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The version 2.11.0 for torch does not exist on PyPI (the current stable versions are in the 2.x range, e.g., 2.5.1, 2.6.0). This appears to be a typo, likely intended to be 2.1.1 (given the previous value was 2.10.0, which was likely a typo for 2.1.0). Using a non-existent version will cause build failures when resolving dependencies for the build system.

vLLM's RandomDataset decodes and re-encodes generated token sequences
(vllm/benchmarks/datasets.py) to avoid string-level drift, but the
roundtrip is not guaranteed to preserve exact token counts — the
benchmark itself only warns when token_mismatch != 0. The strict -eq
assertion against NUM_PROMPTS * RANDOM_INPUT_LEN was failing with a
0.08% overage (500400 vs 500000) on Qwen3-14B after a vLLM upgrade.

Switch to a ±1% tolerance check, which matches the benchmark's own
semantics while still catching real workload-size regressions.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sammshen sammshen changed the title [Chore][CI]: K3 MP tolerance [Chore][CI]: K3 MP output token quantity tolerance Apr 14, 2026
@sammshen sammshen enabled auto-merge (squash) April 14, 2026 22:40
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 14, 2026
@sammshen sammshen merged commit 9cb6322 into LMCache:dev Apr 14, 2026
38 of 39 checks passed
ekaynar pushed a commit to ekaynar/LMCache that referenced this pull request Apr 15, 2026
[CI] Allow 1% tolerance on vllm_bench total_input_tokens check

vLLM's RandomDataset decodes and re-encodes generated token sequences
(vllm/benchmarks/datasets.py) to avoid string-level drift, but the
roundtrip is not guaranteed to preserve exact token counts — the
benchmark itself only warns when token_mismatch != 0. The strict -eq
assertion against NUM_PROMPTS * RANDOM_INPUT_LEN was failing with a
0.08% overage (500400 vs 500000) on Qwen3-14B after a vLLM upgrade.

Switch to a ±1% tolerance check, which matches the benchmark's own
semantics while still catching real workload-size regressions.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
[CI] Allow 1% tolerance on vllm_bench total_input_tokens check

vLLM's RandomDataset decodes and re-encodes generated token sequences
(vllm/benchmarks/datasets.py) to avoid string-level drift, but the
roundtrip is not guaranteed to preserve exact token counts — the
benchmark itself only warns when token_mismatch != 0. The strict -eq
assertion against NUM_PROMPTS * RANDOM_INPUT_LEN was failing with a
0.08% overage (500400 vs 500000) on Qwen3-14B after a vLLM upgrade.

Switch to a ±1% tolerance check, which matches the benchmark's own
semantics while still catching real workload-size regressions.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants