make radix cache deterministic#10721
Conversation
|
slightly updated the test to have larger prefix so that we can trigger the code path to split at SPLIT_SIZE=1024 Now we're getting a few non-determinism but still better than without the SPLIT_SIZE code path at all. |
|
well, this non-determinism doesn't come from the radix cache: with with my slightly modified prefix lengths. guess we have some issues with other parts of the code. |
|
(or |
b98aec2 to
f48bdf7
Compare
|
This patch also includes #10637; I can close that one and keep this |
|
Also please post the result of python3 -m sglang.test.test_deterministic --test-mode single
python3 -m sglang.test.test_deterministic --test-mode mixed
python3 -m sglang.test.test_deterministic --test-mode prefixon both flashinfer and triton backends When do the testing, please make sure the align size is the same as prefill_split_size/prefill_truncation_size |
|
@hnyls2002 @xiezhq-hermann |
Flashinfer: single - Total samples: 50, Unique samples: 1 ; the logic added in this patch is not triggered b/c all sequences are < split size, and therefore the cache always return un-match on prefill as part of the fast path (I assume this is expected behavior?) mixed same, I don't see the logic of reconstruction is triggered, probably due to the prompt length is too small prefix "prefix length 4097" does not yield a call to radix cache with >= 4096 (split size) requests, probably because of tokenization? triton: single - Total samples: 50, Unique samples: 1 Maybe it makes sense for me to increase the prompt length so that we can actually test the behavior of radix cache? I'm resolving the comments meanwhile, thanks for reviews :) |
|
okay I need to do a benchmark again 🤣 I lost the changes to remove disable radix cache logic after rebasing |
|
^ still not hitting the radix cache due to prompt length too small, should we update the tests? |
Yes, we should try some longer prefices to test radix cache logic |
|
with e207d8c I can confirm the cache gets a lot of hits, trition: flashinfer: |
e207d8c to
83188cf
Compare
|
@skyzh Can you also provide some data for performance? Like how much it degrades from normal mode? |
|
Thank you @skyzh for the PR, it is indeed an interesting and important feature. But I am not sure whether enforcing a split is the best way of achieving this. Does it really guarantee determinism or more like reducing variances, especially if the split length is small. Would a large page size somehow achieving similar results? |
|
Hi @xiezhq-hermann,
|
|
(still gathering perf data, but my impression was that there're no significant speed different before/after enable if split_size is small; in other words, as we always align the result with the split size, when it's large, we have to decode more) |
|
perf data: no significant difference with Qwen3-8B:
|
|
@Fridge003 @skyzh is it ready to merge? |
|
@hebiao064 not yet - I have mid-confidence on whether this patch makes sense |
|
@hebiao064 Let's wait for opinions from @hnyls2002 |
c60a7d1 to
35e6d24
Compare
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
35e6d24 to
f67b7e5
Compare
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
|
addressed the comments and ready for review again :) thanks! |
| def match_prefix( | ||
| self, key: RadixKey, is_cache_unfinished: bool = False, **kwargs | ||
| ) -> MatchResult: |
There was a problem hiding this comment.
let's not break API and use kwargs for is_cache_unfinished
| disable=server_args.disable_radix_cache, | ||
| enable_kv_cache_events=self.enable_kv_cache_events, | ||
| eviction_policy=server_args.radix_eviction_policy, | ||
| enable_deterministic_inference=server_args.enable_deterministic_inference, |
There was a problem hiding this comment.
It will be great to add asserts for other radix cache
| return value, node | ||
|
|
||
| # use the access history to first find a split point at split_size and then return the value and node at that point. | ||
| def reconstruct_at_split_point(match_history, value_len): |
There was a problem hiding this comment.
let's move this definition outside of def _match_prefix_helper()
This reverts commit dc965db.
Motivation
The patch (tries) adding determinism to the radix cache. Part of #10278.
Modifications
The patch mainly modifies the match_prefix logic:
The algorithm is two-pass:
Note that this would do two splits in one match_prefix: one during the normal match to split at any prefix match, and another during we split at the split_size point; but it seems to work as desired. Please let me know if this doesn't work (memory leak?)
This patch also changes the test size for test_determinism's prefix size so that it can create more possibility of doing prefix matches in the cache.
Accuracy Tests
gives 1 unique sample in all testable backends with Qwen-8B
Benchmarking and Profiling
Checklist