Skip to content

issue: missing tokens when streaming on fast inference providers #15850

@kalebwalton

Description

@kalebwalton

Check Existing Issues

  • I have searched the existing issues and discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.16

Ollama Version (if applicable)

No response

Operating System

Windows Sequoia

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Streaming output provides all streamed content and does not miss any parts

Actual Behavior

Streaming output occasionally misses a stream chunk (a few characters). It is often unnoticeable and you may assume it's a model issue or an inference provider issue, however, I have validated the issue occurs with multiple models from multiple inference providers.

Steps to Reproduce

  1. Overcome issue issue: debug logging of streamed responses fails with TypeError: not all arguments converted during string formatting #15848 by monkeypatching backend/open_webui/utils/middleware.py replacing
    log.debug("Error: ", e)
    with log.debug(f"Error: {e}") (this enables debug logging to print streaming errors properly)
  2. Run latest with docker run -d --name openwebui -p 3000:8080 -e GLOBAL_LOG_LEVEL=debug -v /path/to/monkeypatched_middleware.py:/app/backend/open_webui/utils/middleware.py -v openwebui-data:/app/backend/data --restart unless-stopped ghcr.io/open-webui/open-webui:latest
  3. Configure with any model like Cerebras qwen-3-235b-a22b or OpenAI gpt-4o-mini
  4. Run a prompt like 'print a bunch of stuff'
  5. Check the logs for errors like those indicated below
  6. If you don't see the error then rerun the prompt a few times or another prompt that outputs a bunch of tokens and it'll show up

**NOTE: I believe this happens more frequently on faster streaming models like OpenAI gpt-4o-mini or Cerebras qwen-3-235b-a22b . **

Logs & Screenshots

2025-07-18 20:35:40.116 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 139 (char 138) - {}
2025-07-18 20:35:40.137 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 172 (char 171) - {}
2025-07-18 20:35:40.268 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Expecting ':' delimiter: line 1 column 171 (char 170) - {}
2025-07-18 20:35:40.306 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 7 (char 6) - {}
2025-07-18 20:35:40.325 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 173 (char 172) - {}

Additional Information

I have investigated this at length. If you add do some debugging after line 1786 you'll find that 1. every so often a valid JSON event will be chunked into two line iterations, the first will contain some of the data including the beginning of the JSON string, and the second will contain the remaining part of the JSON string, and 2. each line does not contain line endings.

I dug around and found

which seems to use aiohttp.ClientSession, and then I try to follow that through and get a bit confused.

I don't know whether the correct solution is to do buffering in openwebui's middleware.py where it's processing the lines (which won't work well because the line endings are not showing up and you can only key on something like }), or if the solution is to do something lower level to prevent SSE JSON lines from ever being fragmented in the first place...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions