Skip to content

[BUGFIX] fix radix cache memory consumption to avoid OOM#17191

Merged
hnyls2002 merged 3 commits intomainfrom
yizhang/fix-radix-cache-memory-expand
Jan 17, 2026
Merged

[BUGFIX] fix radix cache memory consumption to avoid OOM#17191
hnyls2002 merged 3 commits intomainfrom
yizhang/fix-radix-cache-memory-expand

Conversation

@yizhang2077
Copy link
Copy Markdown
Collaborator

@yizhang2077 yizhang2077 commented Jan 16, 2026

Motivation

When radix tree does splitting, new_node and child both hold the view for inserted value, so when child is evicted, child.value will not return to torch memory pool until all nodes along path hold the view of value are evicted. It may cause dynamic CUDA OOM when tree shares long prefix
This bug may also happen in other radix cache types.

Modifications

use clone to release original value when do slice, it may cause overhead but I think it is small and can be hidden by overlap schedule

Accuracy Tests

see ut, before this pr, torch_allocated_memory / tree total_size is more than 400, which is abnormal. After this pr, it is less than 10 (it should be value dtype theoratically)

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yizhang2077
Copy link
Copy Markdown
Collaborator Author

/tag-run-ci-label

@yizhang2077 yizhang2077 changed the title [BUGFIX] fix radix cache memory consumption [BUGFIX] fix radix cache memory consumption to avoid OOM Jan 16, 2026
@@ -654,10 +654,10 @@ def _split_node(self, key: RadixKey, child: TreeNode, split_len: int):
new_node.parent = child.parent
new_node.lock_ref = child.lock_ref
new_node.key = child.key[:split_len]
new_node.value = child.value[:split_len]
new_node.value = child.value[:split_len].clone()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix this for all radix cache variants.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Swipe4057
Copy link
Copy Markdown
Contributor

Can I test it?

@yizhang2077
Copy link
Copy Markdown
Collaborator Author

yizhang2077 commented Jan 16, 2026

Can I test it?

you can run unit test and print allocated memory. Bad case can be reproduced easily

@ispobock
Copy link
Copy Markdown
Collaborator

/tag-run-ci-label

@hnyls2002 hnyls2002 merged commit 737a118 into main Jan 17, 2026
294 of 313 checks passed
@hnyls2002 hnyls2002 deleted the yizhang/fix-radix-cache-memory-expand branch January 17, 2026 08:47
@Swipe4057
Copy link
Copy Markdown
Contributor

Swipe4057 commented Jan 17, 2026

Can I test it?

you can run unit test and print allocated memory. Bad case can be reproduced easily

the test lasted 15 minutes (2хH100)
start:
image

end:
image

@Swipe4057
Copy link
Copy Markdown
Contributor

Can I test it?

you can run unit test and print allocated memory. Bad case can be reproduced easily

the test lasted 15 minutes (2хH100) start: image

end: image

OOM still happened, but later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants