[CLI] Use streaming in CLI chat and completion commands#23769
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces streaming support for the chat and complete CLI commands, which is a great enhancement for user experience. The implementation is straightforward, using helper functions to handle the streaming logic. My review includes suggestions to improve robustness by adding error handling around the streaming API calls and to enhance code clarity by adding type hints to the new helper functions. These changes will make the CLI tool more resilient and the code easier to maintain.
| return model_name, openai_client | ||
|
|
||
|
|
||
| def _print_chat_stream(stream) -> str: |
There was a problem hiding this comment.
To improve type safety and code readability, please add a type hint for the stream parameter. The openai client returns a Stream of ChatCompletionChunk objects. Using a string forward reference for the type hint is a good practice here.
You'll need to ensure the necessary types are imported within a TYPE_CHECKING block:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from openai import Stream
from openai.types.chat import ChatCompletionChunk| def _print_chat_stream(stream) -> str: | |
| def _print_chat_stream(stream: "Stream[ChatCompletionChunk]") -> str: |
| return output | ||
|
|
||
|
|
||
| def _print_completion_stream(stream) -> str: |
There was a problem hiding this comment.
For consistency and to improve type safety, please add a type hint for the stream parameter. The openai client returns a Stream of Completion objects for completion requests.
You'll need to ensure the necessary types are imported within a TYPE_CHECKING block:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from openai import Stream
from openai.types import Completion| def _print_completion_stream(stream) -> str: | |
| def _print_completion_stream(stream: "Stream[Completion]") -> str: |
vllm/entrypoints/cli/openai.py
Outdated
| stream = client.chat.completions.create( | ||
| model=model_name, messages=conversation, stream=True) | ||
| output = _print_chat_stream(stream) | ||
| conversation.append({"role": "assistant", "content": output}) |
There was a problem hiding this comment.
The streaming API call can raise exceptions (e.g., openai.APIError) if an issue occurs during generation. To prevent the CLI from crashing and to provide a better user experience, it's best to wrap the streaming logic in a try...except block to gracefully handle any potential errors.
try:
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
except Exception as e:
print(f"\nAn error occurred: {e}")
vllm/entrypoints/cli/openai.py
Outdated
| stream = client.chat.completions.create( | ||
| model=model_name, messages=conversation, stream=True) | ||
| output = _print_chat_stream(stream) | ||
| conversation.append({"role": "assistant", "content": output}) |
There was a problem hiding this comment.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.
try:
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
except Exception as e:
print(f"\nAn error occurred: {e}")
vllm/entrypoints/cli/openai.py
Outdated
| stream = client.chat.completions.create( | ||
| model=model_name, messages=conversation, stream=True) | ||
| output = _print_chat_stream(stream) | ||
| conversation.append({"role": "assistant", "content": output}) |
There was a problem hiding this comment.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.
try:
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
except Exception as e:
print(f"\nAn error occurred: {e}")| stream = client.completions.create(model=model_name, | ||
| prompt=args.quick, | ||
| stream=True) | ||
| _print_completion_stream(stream) |
There was a problem hiding this comment.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.
try:
stream = client.completions.create(model=model_name,
prompt=args.quick,
stream=True)
_print_completion_stream(stream)
except Exception as e:
print(f"\nAn error occurred: {e}")| stream = client.completions.create(model=model_name, | ||
| prompt=input_prompt, | ||
| stream=True) | ||
| _print_completion_stream(stream) |
There was a problem hiding this comment.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.
try:
stream = client.completions.create(model=model_name,
prompt=input_prompt,
stream=True)
_print_completion_stream(stream)
except Exception as e:
print(f"\nAn error occurred: {e}")…-streaming-for-vllm-completechat
|
@chaunceyjiang sorry just saw the comments. i was mostly looking at the CLI file don't really have much type hint and is designed to be simple demo code. |
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: charlifu <charlifu@amd.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com>
Summary
vllm chatCLIvllm completeCLITesting
ruff check vllm/entrypoints/cli/openai.pypython -m py_compile vllm/entrypoints/cli/openai.pypre-commit run --files vllm/entrypoints/cli/openai.py(fails: command not found)pytest tests/v1/entrypoints/openai/test_completion.py::test_completion_streaming -q(fails: ModuleNotFoundError: torch)https://chatgpt.com/codex/tasks/task_e_68af54ff5ee88329b50c13bf46c0da0d