Summary
I’d like to propose a small performance optimization for Hermes on backends that support explicit prompt caching.
The idea is simple:
- keep 1 cache marker on the
system message,
- place the remaining up to 3 markers on later reusable conversation history,
- instead of concentrating them near the beginning of the session.
In long, tool-heavy conversations, the most expensive part of the prompt is often no longer the earliest prefix. A front-loaded cache layout can end up protecting a small stable prefix while repeatedly recomputing a much larger reusable middle-to-late conversation backbone.
Proposal
Use a conversation-aware sliding breakpoint strategy.
A simple first heuristic could target positions near:
len(messages) // 3
(len(messages) * 2) // 3
len(messages) - 5
Around each target, Hermes should prefer messages that are:
user or assistant
- text-bearing
- relatively stable
- not giant tool-output blocks
- not inside the final highly volatile tail
So the strategy becomes:
keep the system prompt cached, then place the remaining markers into the later reusable backbone of the conversation while avoiding the unstable tail.
Why this seems useful
This is not a proposal for a major architecture change.
It does not require:
- changing providers or API families
- introducing external memory
- adding a summarization pipeline
- mutating old messages
- redesigning Hermes context management
It only changes where the same limited number of cache markers are placed.
That makes it complementary to existing Hermes caching/perf work:
- issues about cache invalidation from message differences
- issues about unstable system prompts
- issues about cache-friendly tool result handling
Those are about preserving cache validity.
This proposal is about improving cache usefulness once validity already exists.
Open questions
- Should this only apply to providers/backends where explicit prompt caching is available?
- Should message selection use a simple heuristic or a more explicit scoring function?
- Should large tool outputs always be excluded from breakpoint placement?
- Should this be the default strategy or an optional alternative?
If this direction sounds reasonable, I’d be happy to help refine the heuristic into something easier to evaluate or implement.
Summary
I’d like to propose a small performance optimization for Hermes on backends that support explicit prompt caching.
The idea is simple:
systemmessage,In long, tool-heavy conversations, the most expensive part of the prompt is often no longer the earliest prefix. A front-loaded cache layout can end up protecting a small stable prefix while repeatedly recomputing a much larger reusable middle-to-late conversation backbone.
Proposal
Use a conversation-aware sliding breakpoint strategy.
A simple first heuristic could target positions near:
len(messages) // 3(len(messages) * 2) // 3len(messages) - 5Around each target, Hermes should prefer messages that are:
userorassistantSo the strategy becomes:
Why this seems useful
This is not a proposal for a major architecture change.
It does not require:
It only changes where the same limited number of cache markers are placed.
That makes it complementary to existing Hermes caching/perf work:
Those are about preserving cache validity.
This proposal is about improving cache usefulness once validity already exists.
Open questions
If this direction sounds reasonable, I’d be happy to help refine the heuristic into something easier to evaluate or implement.