Skip to content

make radix cache deterministic#10721

Merged
hnyls2002 merged 8 commits intosgl-project:mainfrom
skyzh:skyzh/det-radix
Oct 14, 2025
Merged

make radix cache deterministic#10721
hnyls2002 merged 8 commits intosgl-project:mainfrom
skyzh:skyzh/det-radix

Conversation

@skyzh
Copy link
Copy Markdown
Contributor

@skyzh skyzh commented Sep 22, 2025

Motivation

The patch (tries) adding determinism to the radix cache. Part of #10278.

Modifications

The patch mainly modifies the match_prefix logic:

  • If it's called from the scheduler, always return the value to the multiply of SPLIT_SIZE.
  • Otherwise, insert into the tree as normal.

The algorithm is two-pass:

  • In the first pass, do the prefix matching as before.
  • If we need to align to the split size, in the second pass, we use the nodes we collected from the previous pass to split exactly at the split size and return the node.

Note that this would do two splits in one match_prefix: one during the normal match to split at any prefix match, and another during we split at the split_size point; but it seems to work as desired. Please let me know if this doesn't work (memory leak?)

This patch also changes the test size for test_determinism's prefix size so that it can create more possibility of doing prefix matches in the cache.

Accuracy Tests

python3 -m sglang.test.test_deterministic --test-mode prefix --temperature 0.7

gives 1 unique sample in all testable backends with Qwen-8B

Benchmarking and Profiling

Checklist

@skyzh skyzh mentioned this pull request Sep 22, 2025
4 tasks
@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 22, 2025

slightly updated the test to have larger prefix so that we can trigger the code path to split at SPLIT_SIZE=1024

Testing Trial 12 with batch size 12, # prefix length 1: 3, # prefix length 4097: 2, # prefix length 5120: 2, # prefix length 8191: 5,
Prompt 0 with prefix length 1: total samples: 22, Unique samples: 1
Prompt 1 with prefix length 4097: total samples: 11, Unique samples: 1
Prompt 2 with prefix length 5120: total samples: 19, Unique samples: 3
Prompt 3 with prefix length 8191: total samples: 26, Unique samples: 3

Now we're getting a few non-determinism but still better than without the SPLIT_SIZE code path at all.

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 22, 2025

well, this non-determinism doesn't come from the radix cache: with --disable-radix-cache I can still get:

Testing Trial 12 with batch size 12, # prefix length 1: 4, # prefix length 4097: 2, # prefix length 5120: 2, # prefix length 8191: 4,
Prompt 0 with prefix length 1: total samples: 21, Unique samples: 1
Prompt 1 with prefix length 4097: total samples: 20, Unique samples: 1
Prompt 2 with prefix length 5120: total samples: 15, Unique samples: 3
Prompt 3 with prefix length 8191: total samples: 22, Unique samples: 2

with my slightly modified prefix lengths. guess we have some issues with other parts of the code.

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 22, 2025

(orSGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZE=1024 is not practical to test this)

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 23, 2025

This patch also includes #10637; I can close that one and keep this

Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
Comment thread python/sglang/srt/managers/scheduler.py
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache.py
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
@Fridge003
Copy link
Copy Markdown
Collaborator

Also please post the result of

python3 -m sglang.test.test_deterministic --test-mode single
python3 -m sglang.test.test_deterministic --test-mode mixed
python3 -m sglang.test.test_deterministic --test-mode prefix

on both flashinfer and triton backends

When do the testing, please make sure the align size is the same as prefill_split_size/prefill_truncation_size

@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Sep 24, 2025

@hnyls2002 @xiezhq-hermann
This code changes some of the logics in radix cache, can you please help checking together.
Especially regarding potential memory leaks

@hnyls2002 hnyls2002 self-assigned this Sep 24, 2025
@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 25, 2025

python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B --attention-backend flashinfer --enable-deterministic-inference

Flashinfer:

single - Total samples: 50, Unique samples: 1 ; the logic added in this patch is not triggered b/c all sequences are < split size, and therefore the cache always return un-match on prefill as part of the fast path (I assume this is expected behavior?)

mixed

Prompt 1: total samples: 583, Unique samples: 1
Prompt 2: total samples: 468, Unique samples: 1
Long prompt: total samples: 224, Unique samples: 1

same, I don't see the logic of reconstruction is triggered, probably due to the prompt length is too small

prefix

Prompt 0 with prefix length 1: total samples: 321, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 332, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 311, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 311, Unique samples: 1

"prefix length 4097" does not yield a call to radix cache with >= 4096 (split size) requests, probably because of tokenization?

triton:

single - Total samples: 50, Unique samples: 1

Maybe it makes sense for me to increase the prompt length so that we can actually test the behavior of radix cache?


I'm resolving the comments meanwhile, thanks for reviews :)

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 25, 2025

okay I need to do a benchmark again 🤣 I lost the changes to remove disable radix cache logic after rebasing

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 25, 2025

^ still not hitting the radix cache due to prompt length too small, should we update the tests?

@Fridge003
Copy link
Copy Markdown
Collaborator

^ still not hitting the radix cache due to prompt length too small, should we update the tests?

Yes, we should try some longer prefices to test radix cache logic

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 26, 2025

with e207d8c I can confirm the cache gets a lot of hits,

trition:

Testing Trial 50 with batch size 50, # prefix length 1: 13, # prefix length 2048: 11, # prefix length 10000: 10, # prefix length 20000: 16,
Prompt 0 with prefix length 1: total samples: 305, Unique samples: 1
Prompt 1 with prefix length 2048: total samples: 334, Unique samples: 1
Prompt 2 with prefix length 10000: total samples: 302, Unique samples: 1
Prompt 3 with prefix length 20000: total samples: 334, Unique samples: 1

flashinfer:

Testing Trial 50 with batch size 50, # prefix length 1: 13, # prefix length 2048: 10, # prefix length 10000: 13, # prefix length 20000: 14,
Prompt 0 with prefix length 1: total samples: 296, Unique samples: 1
Prompt 1 with prefix length 2048: total samples: 312, Unique samples: 1
Prompt 2 with prefix length 10000: total samples: 350, Unique samples: 1
Prompt 3 with prefix length 20000: total samples: 317, Unique samples: 1

@Fridge003
Copy link
Copy Markdown
Collaborator

@skyzh Can you also provide some data for performance? Like how much it degrades from normal mode?

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

Thank you @skyzh for the PR, it is indeed an interesting and important feature. But I am not sure whether enforcing a split is the best way of achieving this. Does it really guarantee determinism or more like reducing variances, especially if the split length is small. Would a large page size somehow achieving similar results?

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 26, 2025

Hi @xiezhq-hermann,

  • when split size is small: yes, I tried split size like 128 and it can reduce non-determinism compared with not considering the split size at all. However, there're other parts of the system that is not determinism with the split size is very small.
  • large page size: yes. The initial implementation of this patch simply sets page_size to split_size. But it leads to some memory mismatch (leak) errors. Before we hit those errors it gives good results. So I think it's better to consider both page_size and split_size in this patch.

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 26, 2025

(still gathering perf data, but my impression was that there're no significant speed different before/after enable if split_size is small; in other words, as we always align the result with the split size, when it's large, we have to decode more)

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Sep 30, 2025

perf data: no significant difference with Qwen3-8B:

python3 -m sglang.test.test_deterministic --test-mode prefix --temperature 0.7
  • with radix cache with deterministic, batch = 10: 3.3927013874053955 seconds
  • with radix cache w/o deterministic, batch = 10: 3.073150634765625 seconds

@hebiao064
Copy link
Copy Markdown
Collaborator

@Fridge003 @skyzh is it ready to merge?

@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Oct 1, 2025

@hebiao064 not yet - I have mid-confidence on whether this patch makes sense

@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Oct 1, 2025

@hebiao064 Let's wait for opinions from @hnyls2002

Comment thread python/sglang/test/test_deterministic.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache.py
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
skyzh added 4 commits October 13, 2025 22:44
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
@skyzh skyzh requested review from Fridge003 and hnyls2002 October 13, 2025 23:24
skyzh added 2 commits October 13, 2025 23:27
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
@skyzh
Copy link
Copy Markdown
Contributor Author

skyzh commented Oct 13, 2025

addressed the comments and ready for review again :) thanks!

@hnyls2002 hnyls2002 merged commit dc965db into sgl-project:main Oct 14, 2025
39 of 68 checks passed
Comment on lines +242 to +244
def match_prefix(
self, key: RadixKey, is_cache_unfinished: bool = False, **kwargs
) -> MatchResult:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not break API and use kwargs for is_cache_unfinished

disable=server_args.disable_radix_cache,
enable_kv_cache_events=self.enable_kv_cache_events,
eviction_policy=server_args.radix_eviction_policy,
enable_deterministic_inference=server_args.enable_deterministic_inference,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great to add asserts for other radix cache

return value, node

# use the access history to first find a split point at split_size and then return the value and node at that point.
def reconstruct_at_split_point(match_history, value_len):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this definition outside of def _match_prefix_helper()

Fridge003 added a commit that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants