Episodic memory: track retrieval outcomes so the palace learns what's useful

### Problem

The palace is great at organizing *what you know* — facts, events, preferences, advice across wings and rooms. But it doesn't track *what happened when a memory was used*. If a closet or drawer gets retrieved during a conversation and the user says "no, that's wrong" or "yes, exactly", that signal disappears.

Over time this means retrieval can't distinguish between memories that consistently help and ones that consistently mislead.

### Proposal

Add an **episode** layer that records retrieval-and-outcome pairs:

```
situation  →  what memories were retrieved  →  what action/answer resulted  →  how the user responded
```

Each episode carries a **utility score** (0.0–1.0) that adjusts based on feedback:

```python
FEEDBACK_DELTAS = {
    "confirmed":  +0.2,   # user said this was right
    "corrected":  -0.1,   # user tweaked it — still partially useful
    "rejected":   -0.3,   # user said no
    "undone":     -0.4,   # action was taken and user had to reverse it
    "ignored":    -0.05,  # no engagement — weak signal
}

new_utility = clamp(old_utility + delta, 0.0, 1.0)
```

The key asymmetry: **undone > rejected**. If the system acted on a memory and the user had to clean up, that's worse than the user catching it before anything happened. This distinction matters for agentic use cases where specialist agents take actions.

### What an episode captures

- **Situation**: domain, topic, one-line summary
- **Context snapshot**: time of day, day of week, which preferences/patterns were active — because the same query at 9am vs 11pm may have different right answers
- **Retrieved memories**: which drawers/closets were pulled
- **Outcome**: what happened as a result
- **Feedback**: user response + utility delta

### How it helps retrieval

Episodes become a signal in retrieval ranking. When multiple drawers match a query, prefer ones linked to episodes with high utility. When a drawer is consistently linked to rejected episodes, demote it. A simple weighted score:

```python
relevance = (
    0.3 * domain_match
  + 0.3 * topic_match
  + 0.2 * utility_score      # from episode history
  + 0.1 * (created < 7 days)
  + 0.1 * (created < 24 hours)
)
```

This biases toward recent, topic-matched, proven-useful memories.

### Where it fits in the palace

Episodes aren't drawers — they don't contain facts. They sit alongside the palace as a parallel store, linked to drawers by ID. Think of them as the palace's journal: "I used this memory in this situation and it went well/badly."

### Reference

We built this in a [TypeScript port](https://github.com/jayzalowitz/skytwin/tree/main/packages/mempalace). The core is ~200 lines and storage-agnostic. Happy to help with a Python version if this direction interests you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episodic memory: track retrieval outcomes so the palace learns what's useful #10

Problem

Proposal

What an episode captures

How it helps retrieval

Where it fits in the palace

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Episodic memory: track retrieval outcomes so the palace learns what's useful #10

Description

Problem

Proposal

What an episode captures

How it helps retrieval

Where it fits in the palace

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions