Skip to content

[Serve] Duplicated arguments for vLLM frontend and engine #58937

@soodoshll

Description

@soodoshll

What happened + What you expected to happen

vLLM recently added an argument tokens_only for both frontend(code) and engine(code), which causes problem when creating the argparse.Namespace object here

Versions / Dependencies

ray-serve nightly, python3.12

Reproduction script

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen3-VL-235B-A22B-Instruct",
        "model_source": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    },
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 2,
        },
    },
    engine_kwargs={
        "tensor_parallel_size": 4,
        "max_model_len": 32768
    },
    runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogllmserveRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions