Skip to content

Decouple auto-memory recall from the main-agent request path (follow-up to #3759) #3761

@tanzhenxin

Description

@tanzhenxin

What would you like to be added?

Restructure the auto-memory recall flow so that the main-agent request never waits on the recall selector. Today the selector is fired off as a promise but is then unconditionally awaited just before the main request goes out. The proposal is to consume the recall result only when it is already available, and otherwise let the turn proceed and inject the memory later — or skip it entirely for that turn — without ever blocking the user-visible path.

Why is this needed?

The current architecture couples the user-visible turn latency to the recall selector's response time. Whatever the selector takes to return — model latency, network blip, transient API slowness — is added to the time before the user's prompt reaches the main model.

Issue #3759 makes this concrete: when the selector consistently times out at 5 seconds, every turn is delayed by close to 5 seconds. Pulling the selector onto fastModel (the proposed fix in #3759) brings the typical case down to the prep-work window and removes the visible penalty most of the time. But it does not change the underlying coupling — any individual slow call still delays the turn.

A non-blocking consumption model removes that coupling entirely. The recall becomes a best-effort enrichment: when it lands in time, it shows up; when it does not, the turn proceeds without it and the user never notices. This matches how other "nice to have, not required" enrichments are typically wired in agent loops.

Additional context

Relationship to #3759. Issue #3759 is the immediate, narrow fix: route the selector through fastModel so the typical call completes quickly. This issue is the deeper architectural fix: regardless of how fast the selector is, the main-agent path should never await it. The two are complementary — #3759 should ship first as a low-risk patch; this issue can be picked up afterward if residual latency from the recall path is still observed, or as a hardening change so future regressions in selector latency cannot reintroduce the per-turn tax.

Reference implementation. The claude-code project handles this pattern by polling whether the prefetch has settled before consuming it, and injecting any returned memory into the agent loop as a tool-result-style attachment rather than as a system reminder prepended to the first model request. The injection model and the non-blocking consumption are linked: late arrival is only useful if there is a path to inject memory after the first model call has already started. This is one viable shape; alternatives (e.g., a short bounded wait followed by graceful skip) may also be acceptable depending on how invasive the implementor wants the refactor to be.

Adjacent improvements worth considering in scope. The same prefetch path could pick up two cheap skip gates that reduce wasted side queries without changing UX in any case where the user actually wanted recall:

  • Skip recall when the user prompt is a single token / has no whitespace — too little context for meaningful selection.
  • Skip recall once a per-session budget of surfaced memory bytes has been reached — bounds runaway recall on long sessions.

These are independent of the non-blocking refactor and could land as separate small PRs, but they are the right time to think about them since the recall path is being touched.

Out of scope. This issue is specifically about how the recall result is consumed and injected. It is not about how the selector decides which memories are relevant, the selection model, or the recall feature being on/off — those are governed by existing settings and other issues.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions