[Bugfix] Fix bench_serve UTF-8 decode crash on split multi-byte chars#38732
[Bugfix] Fix bench_serve UTF-8 decode crash on split multi-byte chars#38732ywang96 merged 1 commit intovllm-project:mainfrom
Conversation
When streaming responses containing multi-byte UTF-8 characters
(e.g. Chinese text), HTTP chunk boundaries can split a character
across two chunks. The direct bytes.decode("utf-8") call crashes:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position
65527-65528: unexpected end of data
Replace the one-shot decode with codecs.IncrementalDecoder which
buffers incomplete byte sequences across add_chunk() calls and
returns them once the next chunk completes the character.
Fixes vllm-project#38717
There was a problem hiding this comment.
Code Review
This pull request replaces direct UTF-8 decoding with an incremental decoder in the StreamedResponseHandler to correctly handle multi-byte characters split across chunks. However, the reviewer pointed out a potential data loss issue because the decoder is never finalized; a mechanism to flush the remaining buffer at the end of the stream is required.
| """Add a chunk of bytes to the buffer and return any complete | ||
| messages.""" | ||
| chunk_str = chunk_bytes.decode("utf-8") | ||
| chunk_str = self._decoder.decode(chunk_bytes) |
There was a problem hiding this comment.
The IncrementalDecoder can buffer incomplete byte sequences. If the stream ends with such an incomplete sequence, it will remain in the decoder's buffer and will be lost because there is no final call to flush the decoder. This can lead to data loss.
To fix this, you should add a mechanism to finalize the decoding process after the last chunk has been processed. This typically involves calling self._decoder.decode(b'', final=True) to flush any buffered data.
This would likely require adding a new method to StreamedResponseHandler, for example finalize(), and calling it from the request handling functions (e.g., async_request_openai_completions) after the streaming loop is complete.
Example of a finalize method:
def finalize(self) -> list[str]:
"""Flushes the decoder and processes any remaining buffered data."""
final_chunk_str = self._decoder.decode(b'', final=True)
if not final_chunk_str:
return []
self.buffer += final_chunk_str
# It's best to refactor the message processing logic from add_chunk
# into a private helper method to be reused here.
messages = self._process_buffer()
if self.buffer:
# Handle or log any remaining incomplete message in the buffer
# after final processing.
pass
return messagesThe call sites in async_request_openai_completions, async_request_openai_chat_completions, and async_request_openai_audio would need to be updated to call this finalize method.
There was a problem hiding this comment.
Actually @he-yufeng this looks like a valid point. Should we flush the incremental decoder?
|
ping — bench_serve crashes on CJK output, the fix is straightforward (IncrementalDecoder). |
|
I think it's okay to have this in - merging |
…vllm-project#38732) Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Summary
StreamedResponseHandler.add_chunk()callschunk_bytes.decode("utf-8")directly, which crashes when a multi-byte UTF-8 character (e.g. Chinese) is split across HTTP chunk boundaries:The reporter hit this at ~0.4% rate (4/1000 requests) during bench_serve with H100x8.
Replace the one-shot decode with
codecs.IncrementalDecoderwhich buffers incomplete byte sequences acrossadd_chunk()calls and decodes them once the next chunk completes the character. This is the standard Python approach for streaming byte-to-str conversion.Fixes #38717
Test plan
ruff checkandruff format --checkpassendpoint_request_func.py:32IncrementalDecodercorrectly handles split multi-byte sequences (tested locally with artificially split bytes of Chinese characters)