Pinned
Reasoning models think hard — but all that thinking fills up your KV cache fast.
Memento fixes this: the model compresses its own chain-of-thought mid-generation, flushing old KV entries after each block. 2-3× less peak KV cache, ~2× throughput — accuracy largely preserved.
















