Skip to content

fix(vlm): add max_tokens parameter to VLM completion calls to prevent vLLM rejection#689

Merged
qin-ctx merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/674-vlm-max-tokens
Mar 18, 2026
Merged

fix(vlm): add max_tokens parameter to VLM completion calls to prevent vLLM rejection#689
qin-ctx merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/674-vlm-max-tokens

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

VLM completion calls in all three backends (OpenAI, VolcEngine, LiteLLM) do not pass max_tokens to the API. This causes two failures:

  1. vLLM rejects the request - without max_tokens, vLLM allocates all context space to input and assigns 0 output tokens, returning: You passed 65537 input tokens and requested 0 output tokens.
  2. No output budget - even when prompts fit, the model has no guaranteed output space, leading to truncated or empty responses.

This is separate from #529 (prompt budget guard, fixed in #683). That PR addresses prompt assembly size. This PR addresses the missing max_tokens in the API calls themselves, which affects all VLM usage.

Changes

  • VLMConfig: added max_tokens field with default 4096
  • VLMBase.__init__(): reads max_tokens from config dict
  • openai_vlm.py: all 4 completion methods pass max_tokens when set
  • volcengine_vlm.py: all 4 completion methods pass max_tokens when set
  • litellm_vlm.py: _build_kwargs() passes max_tokens when set
  • _build_vlm_config_dict(): includes max_tokens in the config dict

Config example:

{
  "vlm": {
    "model": "gpt-4o-mini",
    "api_key": "...",
    "max_tokens": 4096
  }
}

Fixes #674

This contribution was developed with AI assistance (Claude Code).

Test plan

  • Existing VLM tests pass (no breaking change - default is 4096)
  • Deploy with vLLM backend and verify large-directory overview generation succeeds
  • Set max_tokens: null in config to disable the limit
  • Verify ruff format --check and ruff check pass

… vLLM rejection

Without max_tokens, vLLM allocates all context space to input tokens and
assigns 0 output tokens, rejecting requests with "You passed N input
tokens and requested 0 output tokens." Even when prompts fit, the model
has no guaranteed output space, leading to truncated or empty responses.

This adds max_tokens support across all VLM backends:
- VLMConfig: new max_tokens field (default 4096)
- VLMBase: reads max_tokens from config dict
- OpenAI, VolcEngine, LiteLLM backends: pass max_tokens in API calls
- Conditional inclusion (if self.max_tokens) so None disables the limit

Fixes volcengine#674

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the fix is clean and well-scoped. One design concern about the default value, and a minor code robustness suggestion.


max_tokens: Optional[int] = Field(
default=4096, description="Maximum tokens for VLM completion output"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

The default of 4096 changes behavior for all existing users, not just vLLM users. For OpenAI / VolcEngine native APIs, omitting max_tokens lets the server choose its own (typically generous) default. Forcing 4096 could silently truncate outputs that previously worked fine.

Suggestion: default to None so that max_tokens is only sent when the user explicitly configures it. vLLM users (who need this fix) can add "max_tokens": 4096 to their config; everyone else remains unaffected.

max_tokens: Optional[int] = Field(
    default=None, description="Maximum tokens for VLM completion output"
)

You could also call this out more prominently in the config example in the PR description or docs, so vLLM users know to set it.

"messages": [{"role": "user", "content": prompt}],
"temperature": self.temperature,
}
if self.max_tokens:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

Using if self.max_tokens treats 0 the same as None (both falsy). While max_tokens=0 is never a valid API value, if self.max_tokens is not None is semantically clearer and avoids any edge-case surprises. Same applies to all other backends.

if self.max_tokens is not None:
    kwargs["max_tokens"] = self.max_tokens

Change default from 4096 to None so max_tokens is only sent when
explicitly configured. Prevents silently truncating outputs on
OpenAI/VolcEngine where omitting max_tokens lets the server choose.

Also use `is not None` instead of truthiness for max_tokens guards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Fixed in 605e954: changed max_tokens default from 4096 to None. Now only sent when explicitly configured, so OpenAI/VolcEngine native APIs use their own defaults. Also switched all guards to is not None per your suggestion.

@qin-ctx qin-ctx merged commit 985c60a into volcengine:main Mar 18, 2026
5 of 6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 18, 2026
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Thanks for the fast turnaround on review and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: VLM 调用溢出:_generate_overview() 缺少 prompt 截断和 max_tokens 参数

2 participants