make radix cache deterministic by skyzh · Pull Request #10721 · sgl-project/sglang

skyzh · 2025-09-22T03:16:10Z

Motivation

The patch (tries) adding determinism to the radix cache. Part of #10278.

Modifications

The patch mainly modifies the match_prefix logic:

If it's called from the scheduler, always return the value to the multiply of SPLIT_SIZE.
Otherwise, insert into the tree as normal.

The algorithm is two-pass:

In the first pass, do the prefix matching as before.
If we need to align to the split size, in the second pass, we use the nodes we collected from the previous pass to split exactly at the split size and return the node.

Note that this would do two splits in one match_prefix: one during the normal match to split at any prefix match, and another during we split at the split_size point; but it seems to work as desired. Please let me know if this doesn't work (memory leak?)

This patch also changes the test size for test_determinism's prefix size so that it can create more possibility of doing prefix matches in the cache.

Accuracy Tests

python3 -m sglang.test.test_deterministic --test-mode prefix --temperature 0.7

gives 1 unique sample in all testable backends with Qwen-8B

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

skyzh · 2025-09-22T03:23:31Z

slightly updated the test to have larger prefix so that we can trigger the code path to split at SPLIT_SIZE=1024

Testing Trial 12 with batch size 12, # prefix length 1: 3, # prefix length 4097: 2, # prefix length 5120: 2, # prefix length 8191: 5,
Prompt 0 with prefix length 1: total samples: 22, Unique samples: 1
Prompt 1 with prefix length 4097: total samples: 11, Unique samples: 1
Prompt 2 with prefix length 5120: total samples: 19, Unique samples: 3
Prompt 3 with prefix length 8191: total samples: 26, Unique samples: 3

Now we're getting a few non-determinism but still better than without the SPLIT_SIZE code path at all.

skyzh · 2025-09-22T03:26:28Z

well, this non-determinism doesn't come from the radix cache: with --disable-radix-cache I can still get:

Testing Trial 12 with batch size 12, # prefix length 1: 4, # prefix length 4097: 2, # prefix length 5120: 2, # prefix length 8191: 4,
Prompt 0 with prefix length 1: total samples: 21, Unique samples: 1
Prompt 1 with prefix length 4097: total samples: 20, Unique samples: 1
Prompt 2 with prefix length 5120: total samples: 15, Unique samples: 3
Prompt 3 with prefix length 8191: total samples: 22, Unique samples: 2

with my slightly modified prefix lengths. guess we have some issues with other parts of the code.

skyzh · 2025-09-22T03:27:03Z

(orSGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZE=1024 is not practical to test this)

skyzh · 2025-09-23T23:02:30Z

This patch also includes #10637; I can close that one and keep this

Fridge003 · 2025-09-24T04:27:49Z

Also please post the result of

python3 -m sglang.test.test_deterministic --test-mode single
python3 -m sglang.test.test_deterministic --test-mode mixed
python3 -m sglang.test.test_deterministic --test-mode prefix

on both flashinfer and triton backends

When do the testing, please make sure the align size is the same as prefill_split_size/prefill_truncation_size

Fridge003 · 2025-09-24T04:28:28Z

@hnyls2002 @xiezhq-hermann
This code changes some of the logics in radix cache, can you please help checking together.
Especially regarding potential memory leaks

skyzh · 2025-09-25T23:39:28Z

python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B --attention-backend flashinfer --enable-deterministic-inference

Flashinfer:

single - Total samples: 50, Unique samples: 1 ; the logic added in this patch is not triggered b/c all sequences are < split size, and therefore the cache always return un-match on prefill as part of the fast path (I assume this is expected behavior?)

mixed

Prompt 1: total samples: 583, Unique samples: 1
Prompt 2: total samples: 468, Unique samples: 1
Long prompt: total samples: 224, Unique samples: 1

same, I don't see the logic of reconstruction is triggered, probably due to the prompt length is too small

prefix

Prompt 0 with prefix length 1: total samples: 321, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 332, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 311, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 311, Unique samples: 1

"prefix length 4097" does not yield a call to radix cache with >= 4096 (split size) requests, probably because of tokenization?

triton:

single - Total samples: 50, Unique samples: 1

Maybe it makes sense for me to increase the prompt length so that we can actually test the behavior of radix cache?

I'm resolving the comments meanwhile, thanks for reviews :)

skyzh · 2025-09-25T23:44:10Z

okay I need to do a benchmark again 🤣 I lost the changes to remove disable radix cache logic after rebasing

skyzh · 2025-09-25T23:52:44Z

^ still not hitting the radix cache due to prompt length too small, should we update the tests?

Fridge003 · 2025-09-25T23:57:58Z

^ still not hitting the radix cache due to prompt length too small, should we update the tests?

Yes, we should try some longer prefices to test radix cache logic

skyzh · 2025-09-26T00:12:05Z

with e207d8c I can confirm the cache gets a lot of hits,

trition:

Testing Trial 50 with batch size 50, # prefix length 1: 13, # prefix length 2048: 11, # prefix length 10000: 10, # prefix length 20000: 16,
Prompt 0 with prefix length 1: total samples: 305, Unique samples: 1
Prompt 1 with prefix length 2048: total samples: 334, Unique samples: 1
Prompt 2 with prefix length 10000: total samples: 302, Unique samples: 1
Prompt 3 with prefix length 20000: total samples: 334, Unique samples: 1

flashinfer:

Testing Trial 50 with batch size 50, # prefix length 1: 13, # prefix length 2048: 10, # prefix length 10000: 13, # prefix length 20000: 14,
Prompt 0 with prefix length 1: total samples: 296, Unique samples: 1
Prompt 1 with prefix length 2048: total samples: 312, Unique samples: 1
Prompt 2 with prefix length 10000: total samples: 350, Unique samples: 1
Prompt 3 with prefix length 20000: total samples: 317, Unique samples: 1

Fridge003 · 2025-09-26T03:21:40Z

@skyzh Can you also provide some data for performance? Like how much it degrades from normal mode?

xiezhq-hermann · 2025-09-26T07:09:54Z

Thank you @skyzh for the PR, it is indeed an interesting and important feature. But I am not sure whether enforcing a split is the best way of achieving this. Does it really guarantee determinism or more like reducing variances, especially if the split length is small. Would a large page size somehow achieving similar results?

skyzh · 2025-09-26T22:44:51Z

Hi @xiezhq-hermann,

when split size is small: yes, I tried split size like 128 and it can reduce non-determinism compared with not considering the split size at all. However, there're other parts of the system that is not determinism with the split size is very small.
large page size: yes. The initial implementation of this patch simply sets page_size to split_size. But it leads to some memory mismatch (leak) errors. Before we hit those errors it gives good results. So I think it's better to consider both page_size and split_size in this patch.

skyzh · 2025-09-26T22:46:08Z

(still gathering perf data, but my impression was that there're no significant speed different before/after enable if split_size is small; in other words, as we always align the result with the split size, when it's large, we have to decode more)

skyzh · 2025-09-30T05:14:42Z

perf data: no significant difference with Qwen3-8B:

python3 -m sglang.test.test_deterministic --test-mode prefix --temperature 0.7

with radix cache with deterministic, batch = 10: 3.3927013874053955 seconds
with radix cache w/o deterministic, batch = 10: 3.073150634765625 seconds

hebiao064 · 2025-10-01T00:57:31Z

@Fridge003 @skyzh is it ready to merge?

skyzh · 2025-10-01T01:00:13Z

@hebiao064 not yet - I have mid-confidence on whether this patch makes sense

Fridge003 · 2025-10-01T02:07:06Z

@hebiao064 Let's wait for opinions from @hnyls2002

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

skyzh · 2025-10-13T23:28:47Z

addressed the comments and ready for review again :) thanks!

hanming-lu · 2025-10-16T18:49:15Z

+    def match_prefix(
+        self, key: RadixKey, is_cache_unfinished: bool = False, **kwargs
+    ) -> MatchResult:


let's not break API and use kwargs for is_cache_unfinished

hanming-lu · 2025-10-16T18:49:55Z

                    disable=server_args.disable_radix_cache,
                    enable_kv_cache_events=self.enable_kv_cache_events,
                    eviction_policy=server_args.radix_eviction_policy,
+                    enable_deterministic_inference=server_args.enable_deterministic_inference,


It will be great to add asserts for other radix cache

hanming-lu · 2025-10-16T18:52:11Z

+            return value, node
+
+        # use the access history to first find a split point at split_size and then return the value and node at that point.
+        def reconstruct_at_split_point(match_history, value_len):


let's move this definition outside of def _match_prefix_helper()

This reverts commit dc965db.

skyzh mentioned this pull request Sep 22, 2025

[PoC] make radix cache deterministic #10639

Closed

4 tasks

Fridge003 self-assigned this Sep 22, 2025

Fridge003 mentioned this pull request Sep 22, 2025

[Feature] Support deterministic inference with Batch Invariant Ops #10278

Closed

28 tasks

skyzh force-pushed the skyzh/det-radix branch from b98aec2 to f48bdf7 Compare September 23, 2025 22:55

skyzh marked this pull request as ready for review September 23, 2025 22:57

skyzh requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners September 23, 2025 22:57

Fridge003 added the run-ci label Sep 23, 2025

Fridge003 reviewed Sep 24, 2025

View reviewed changes

hnyls2002 self-assigned this Sep 24, 2025

skyzh force-pushed the skyzh/det-radix branch from e207d8c to 83188cf Compare September 26, 2025 02:13

Fridge003 reviewed Oct 1, 2025

View reviewed changes

Comment thread python/sglang/test/test_deterministic.py Outdated

hebiao064 mentioned this pull request Oct 1, 2025

Deterministic Mode: Add 1-stage triton kernel for prefill #11147

Merged

4 tasks

hnyls2002 requested changes Oct 3, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated

Comment thread python/sglang/srt/mem_cache/radix_cache.py

skyzh force-pushed the skyzh/det-radix branch from c60a7d1 to 35e6d24 Compare October 13, 2025 22:42

make radix cache deterministic

f67b7e5

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

skyzh force-pushed the skyzh/det-radix branch from 35e6d24 to f67b7e5 Compare October 13, 2025 22:44

skyzh added 4 commits October 13, 2025 22:44

ensure prefix length is smaller than prompt length

b0b71ca

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

rename align_split_size to is_cache_unfinished, flip the logic

b00bd5a

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

nit

247989a

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

tune test parameters

8f1235c

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

skyzh requested review from Fridge003 and hnyls2002 October 13, 2025 23:24

skyzh added 2 commits October 13, 2025 23:27

rm python. for utils

edb3b82

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

tests only output in the end

c7c9954

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>

Merge branch 'main' into skyzh/det-radix

9eceba3

hnyls2002 approved these changes Oct 14, 2025

View reviewed changes

hnyls2002 merged commit dc965db into sgl-project:main Oct 14, 2025
39 of 68 checks passed

hanming-lu reviewed Oct 16, 2025

View reviewed changes

Fridge003 added a commit that referenced this pull request Oct 16, 2025

Revert "make radix cache deterministic (#10721)"

ee43e1c

This reverts commit dc965db.

Fridge003 mentioned this pull request Oct 16, 2025

Revert "make radix cache deterministic" #11728

Merged

Conversation

skyzh commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

skyzh commented Sep 22, 2025

Uh oh!

skyzh commented Sep 22, 2025

Uh oh!

skyzh commented Sep 22, 2025

Uh oh!

skyzh commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Sep 24, 2025

Uh oh!

Fridge003 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skyzh commented Sep 25, 2025

Uh oh!

skyzh commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skyzh commented Sep 25, 2025

Uh oh!

Fridge003 commented Sep 25, 2025

Uh oh!

skyzh commented Sep 26, 2025

Uh oh!

Fridge003 commented Sep 26, 2025

Uh oh!

xiezhq-hermann commented Sep 26, 2025

Uh oh!

skyzh commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skyzh commented Sep 26, 2025

Uh oh!

skyzh commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebiao064 commented Oct 1, 2025

Uh oh!

skyzh commented Oct 1, 2025

Uh oh!

Fridge003 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skyzh commented Oct 13, 2025

Uh oh!

Uh oh!

hanming-lu Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

hanming-lu Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

hanming-lu Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

skyzh commented Sep 22, 2025 •

edited

Loading

Fridge003 commented Sep 24, 2025 •

edited

Loading

skyzh commented Sep 25, 2025 •

edited

Loading

skyzh commented Sep 26, 2025 •

edited

Loading

skyzh commented Sep 30, 2025 •

edited

Loading

Fridge003 commented Oct 1, 2025 •

edited

Loading