Skip to content

Perf proposal: conversation-aware sliding cache breakpoints for explicit caching backends #12089

@MaxxxDong

Description

@MaxxxDong

Summary

I’d like to propose a small performance optimization for Hermes on backends that support explicit prompt caching.

The idea is simple:

  • keep 1 cache marker on the system message,
  • place the remaining up to 3 markers on later reusable conversation history,
  • instead of concentrating them near the beginning of the session.

In long, tool-heavy conversations, the most expensive part of the prompt is often no longer the earliest prefix. A front-loaded cache layout can end up protecting a small stable prefix while repeatedly recomputing a much larger reusable middle-to-late conversation backbone.

Proposal

Use a conversation-aware sliding breakpoint strategy.

A simple first heuristic could target positions near:

  • len(messages) // 3
  • (len(messages) * 2) // 3
  • len(messages) - 5

Around each target, Hermes should prefer messages that are:

  • user or assistant
  • text-bearing
  • relatively stable
  • not giant tool-output blocks
  • not inside the final highly volatile tail

So the strategy becomes:

keep the system prompt cached, then place the remaining markers into the later reusable backbone of the conversation while avoiding the unstable tail.

Why this seems useful

This is not a proposal for a major architecture change.

It does not require:

  • changing providers or API families
  • introducing external memory
  • adding a summarization pipeline
  • mutating old messages
  • redesigning Hermes context management

It only changes where the same limited number of cache markers are placed.

That makes it complementary to existing Hermes caching/perf work:

  • issues about cache invalidation from message differences
  • issues about unstable system prompts
  • issues about cache-friendly tool result handling

Those are about preserving cache validity.
This proposal is about improving cache usefulness once validity already exists.

Open questions

  1. Should this only apply to providers/backends where explicit prompt caching is available?
  2. Should message selection use a simple heuristic or a more explicit scoring function?
  3. Should large tool outputs always be excluded from breakpoint placement?
  4. Should this be the default strategy or an optional alternative?

If this direction sounds reasonable, I’d be happy to help refine the heuristic into something easier to evaluate or implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havetype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions