fix(vlm): add max_tokens parameter to VLM completion calls to prevent vLLM rejection#689
Conversation
… vLLM rejection Without max_tokens, vLLM allocates all context space to input tokens and assigns 0 output tokens, rejecting requests with "You passed N input tokens and requested 0 output tokens." Even when prompts fit, the model has no guaranteed output space, leading to truncated or empty responses. This adds max_tokens support across all VLM backends: - VLMConfig: new max_tokens field (default 4096) - VLMBase: reads max_tokens from config dict - OpenAI, VolcEngine, LiteLLM backends: pass max_tokens in API calls - Conditional inclusion (if self.max_tokens) so None disables the limit Fixes volcengine#674 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
qin-ctx
left a comment
There was a problem hiding this comment.
Overall the fix is clean and well-scoped. One design concern about the default value, and a minor code robustness suggestion.
|
|
||
| max_tokens: Optional[int] = Field( | ||
| default=4096, description="Maximum tokens for VLM completion output" | ||
| ) |
There was a problem hiding this comment.
[Design] (blocking)
The default of 4096 changes behavior for all existing users, not just vLLM users. For OpenAI / VolcEngine native APIs, omitting max_tokens lets the server choose its own (typically generous) default. Forcing 4096 could silently truncate outputs that previously worked fine.
Suggestion: default to None so that max_tokens is only sent when the user explicitly configures it. vLLM users (who need this fix) can add "max_tokens": 4096 to their config; everyone else remains unaffected.
max_tokens: Optional[int] = Field(
default=None, description="Maximum tokens for VLM completion output"
)You could also call this out more prominently in the config example in the PR description or docs, so vLLM users know to set it.
| "messages": [{"role": "user", "content": prompt}], | ||
| "temperature": self.temperature, | ||
| } | ||
| if self.max_tokens: |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
Using if self.max_tokens treats 0 the same as None (both falsy). While max_tokens=0 is never a valid API value, if self.max_tokens is not None is semantically clearer and avoids any edge-case surprises. Same applies to all other backends.
if self.max_tokens is not None:
kwargs["max_tokens"] = self.max_tokensChange default from 4096 to None so max_tokens is only sent when explicitly configured. Prevents silently truncating outputs on OpenAI/VolcEngine where omitting max_tokens lets the server choose. Also use `is not None` instead of truthiness for max_tokens guards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Fixed in 605e954: changed |
|
Thanks for the fast turnaround on review and merge. |
Summary
VLM completion calls in all three backends (OpenAI, VolcEngine, LiteLLM) do not pass
max_tokensto the API. This causes two failures:max_tokens, vLLM allocates all context space to input and assigns 0 output tokens, returning:You passed 65537 input tokens and requested 0 output tokens.This is separate from #529 (prompt budget guard, fixed in #683). That PR addresses prompt assembly size. This PR addresses the missing
max_tokensin the API calls themselves, which affects all VLM usage.Changes
VLMConfig: addedmax_tokensfield with default 4096VLMBase.__init__(): readsmax_tokensfrom config dictopenai_vlm.py: all 4 completion methods passmax_tokenswhen setvolcengine_vlm.py: all 4 completion methods passmax_tokenswhen setlitellm_vlm.py:_build_kwargs()passesmax_tokenswhen set_build_vlm_config_dict(): includesmax_tokensin the config dictConfig example:
{ "vlm": { "model": "gpt-4o-mini", "api_key": "...", "max_tokens": 4096 } }Fixes #674
This contribution was developed with AI assistance (Claude Code).
Test plan
max_tokens: nullin config to disable the limitruff format --checkandruff checkpass