[Diffusion] Cache dit support parallel#15163
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
cc @DefTruth |
| if _original_similarity is None: | ||
| _original_similarity = cache_manager.CachedContextManager.similarity | ||
|
|
||
| def patched_similarity(self, t1, t2, *, threshold, parallelized=False, prefix="Fn"): |
There was a problem hiding this comment.
does this similarity applies to all cache algorithms?
There was a problem hiding this comment.
Add a code apdapted link: # Adapted from https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/caching/cache_contexts/cache_manager.py#L495-L523
There was a problem hiding this comment.
does this similarity applies to all cache algorithms?
@BBuf @mickqian Sure! The similarity func applies to all cache algorithms. The implementation in cache-dit does not take into account hybrid parallelism now. I think this patch is a suitable modification. If hybrid parallelism is used, it would be more appropriate for us to perform diff all-reduce in the specified group.
|
/tag-and-rerun-ci |
|
I think the HACK scheme is as follows: -> patch the similarity function (to support hybrid parallelism and pass a specific group to the cache manager)
-> build a FAKE cache-dit parallelism configuration to indicate that we are running in a distributed scenario (DO NOT pass it to cache-dit via the `parallelism_config` parameter in the `enable_cache` API)
-> bind the SP group and TP group from SGLang to the cache_manager
-> the patched similarity function uses the specific SP group and TP group |
Agree with it. |
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
A_cat_walks_on_the_grass_realistic_20251215-054255_c162f999.mp4
A_cat_walks_on_the_grass_realistic_20251215-072921_c162f999.mp4
333.6s->244.27s (36%+)