Perf proposal: conversation-aware sliding cache breakpoints for explicit caching backends

## Summary

I’d like to propose a small performance optimization for Hermes on backends that support **explicit prompt caching**.

The idea is simple:

- keep **1 cache marker** on the `system` message,
- place the remaining **up to 3 markers** on later reusable conversation history,
- instead of concentrating them near the beginning of the session.

In long, tool-heavy conversations, the most expensive part of the prompt is often no longer the earliest prefix. A front-loaded cache layout can end up protecting a small stable prefix while repeatedly recomputing a much larger reusable middle-to-late conversation backbone.

## Proposal

Use a **conversation-aware sliding breakpoint strategy**.

A simple first heuristic could target positions near:

- `len(messages) // 3`
- `(len(messages) * 2) // 3`
- `len(messages) - 5`

Around each target, Hermes should prefer messages that are:

- `user` or `assistant`
- text-bearing
- relatively stable
- not giant tool-output blocks
- not inside the final highly volatile tail

So the strategy becomes:

> keep the system prompt cached, then place the remaining markers into the later reusable backbone of the conversation while avoiding the unstable tail.

## Why this seems useful

This is **not** a proposal for a major architecture change.

It does **not** require:

- changing providers or API families
- introducing external memory
- adding a summarization pipeline
- mutating old messages
- redesigning Hermes context management

It only changes **where the same limited number of cache markers are placed**.

That makes it complementary to existing Hermes caching/perf work:

- issues about cache invalidation from message differences
- issues about unstable system prompts
- issues about cache-friendly tool result handling

Those are about preserving cache validity.
This proposal is about improving cache usefulness once validity already exists.

## Open questions

1. Should this only apply to providers/backends where explicit prompt caching is available?
2. Should message selection use a simple heuristic or a more explicit scoring function?
3. Should large tool outputs always be excluded from breakpoint placement?
4. Should this be the default strategy or an optional alternative?

If this direction sounds reasonable, I’d be happy to help refine the heuristic into something easier to evaluate or implement.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf proposal: conversation-aware sliding cache breakpoints for explicit caching backends #12089

Summary

Proposal

Why this seems useful

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Perf proposal: conversation-aware sliding cache breakpoints for explicit caching backends #12089

Description

Summary

Proposal

Why this seems useful

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions