Skip to content

[Bug]: OpenAI-completions prompt_cache_key regression — caching worked in 2026.3.x, broken in 2026.5.x #81281

@juaps

Description

@juaps

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

prompt_cache_key is no longer emitted on openai-completions requests in 2026.5.x; caching worked correctly on 2026.3.2 (confirmed on same hardware with same config).

Steps to reproduce

  1. Deploy OpenClaw with an openai-completions provider configured with compat.supportsPromptCacheKey: true and cacheRetention: "long".
  2. Send repeated identical prompts through OpenClaw → oMLX (or any completions backend with prefix caching).
  3. Observe cached_tokens in backend responses — always 0 on 2026.5.x.
  4. Downgrade to 2026.3.2 with identical config and hardware.
  5. Repeat same prompts — cached_tokens correctly populate from request 2 onward.

Expected behavior

As observed in 2026.3.2: outgoing openai-completions requests include prompt_cache_key, and the backend reports cached_tokens > 0 on repeated requests with identical prefixes.

Actual behavior

On 2026.5.x: prompt_cache_key is absent from outgoing requests. Backend reports cached_tokens: 0 on every request. Downgrading to 2026.3.2 restores caching immediately with no config changes.

OpenClaw version

2026.5.7

Operating system

Linux (Docker container/portainer on UGREEN NAS)

Install method

docker

Model

omlx/local_model (Qwen3.6-35B-A3B-RotorQuant-MLX-4bit via oMLX)

Provider / routing chain

openclaw → oMLX (openai-completions, http://cerebro-mac:8080/v1)

Additional provider/model setup details

Provider api: openai-completions
compat.supportsPromptCacheKey: true
cacheRetention: "long" (set at defaults, model, and per-model override levels)
contextInjection: "continuation-skip"
API keys and URLs redacted.

Same config file used on both 2026.3.2 (working) and 2026.5.7 (broken).
oMLX backend confirmed working: direct repeated requests to the same endpoint
produce cached_tokens > 0 from request 2 onward, bypassing OpenClaw entirely.

Logs, screenshots, and evidence

oMLX dashboard on 2026.5.7:
  Total Prefill Tokens: 748,811
  Cached Tokens: 0
  Cache Efficiency: 0.0%

Direct requests to same backend (bypassing OpenClaw):
  Request 1: cached_tokens = 0
  Request 2: cached_tokens = 71680
  Request 3: cached_tokens = 71680

Outgoing request keys observed from OpenClaw on 2026.5.7:
  model, messages, stream, max_completion_tokens, tools, reasoning_effort, metadata
  (prompt_cache_key absent)

2026.3.2 tested on same UGREEN NAS hardware: caching works correctly.
No config changes between versions.

Impact and severity

Affected: any user of openai-completions with a prefix-caching-capable backend (oMLX, llama.cpp, etc.)
Severity: high — defeats prefix caching entirely, causing full prefill on every request
Frequency: 100% reproducible on 2026.5.7, never occurs on 2026.3.2
Consequence: significantly increased latency and compute cost per request;
on local hardware this is the difference between ~3s and ~60s TTFT for long contexts.

Additional information

Last known good version: 2026.3.2
First known bad version: 2026.5.x (exact first-bad version not tested between 2026.3.2 and 2026.5.7)
No workaround found on 2026.5.x short of downgrading.
Related: #69272, PR #69411 — those addressed the transport condition;
this regression suggests something in that chain changed between 2026.3.2 and 2026.5.x.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions