Skip to content

Eval self.left_padding whenever it is updated in BatchRotatingKVCache#960

Open
rltakashige wants to merge 1 commit intoml-explore:mainfrom
rltakashige:leo/eval-left-padding-in-batched-rotation
Open

Eval self.left_padding whenever it is updated in BatchRotatingKVCache#960
rltakashige wants to merge 1 commit intoml-explore:mainfrom
rltakashige:leo/eval-left-padding-in-batched-rotation

Conversation

@rltakashige
Copy link

@rltakashige rltakashige commented Mar 7, 2026

Motivation:

I was running into RuntimeError: [metal::malloc] Resource limit (499000) exceeded. when using batching for GPT OSS. (see the attached log.txt). Upon investigation, this happened for any model with rotating KV cache.

Steps to Reproduce:
Run my attached reproduce_batch_kvcache_leak.py with any model that uses sliding window attention with python reproduce_batch_kvcache_leak.py --model <model path> --crash. This runs the model in a batch generator with two requests for 50000 steps together. I have been using GPT OSS 120B MXFP4 Q8 primarily.

--add-eval adds an eval to the left padding, which prevents this from occurring.

Issue and Proposed changes
I think the issue is caused by the left padding never being evaluated, meaning buffers are accumulated in an unbounded fashion.
I am not sure whether you'd prefer moving the evals outside this function. However, it is only necessary to evaluate the left padding when it is updated (from testing).

rltakashige added a commit to exo-explore/exo that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant