Add install with pip by merrymercy · Pull Request #3 · sgl-project/sglang

merrymercy · 2024-01-09T20:43:13Z

No description provided.

* update layout * bug fix

update

…riton Optimize Triton Kernel of Group GEMM

[LoRA demo] Add the checkpoint file for the lora demo

update Llama4 BKC

use npu_scatter_nd_update_

[Fix] minicpm flashinfer backend: fix begin_forward args

* apply aiter.topk_softmax to dev branch Signed-off-by: apinge <tong.qiu2@amd.com> * remove white line Signed-off-by: apinge <tong.qiu2@amd.com> --------- Signed-off-by: apinge <tong.qiu2@amd.com>

Kp/gemma4 multimodal support

feat: added xpu page and npu installation support page under hardware platforms

New cross-node load/store path that bypasses the closed libubsm_sdk.so entirely and talks to the GPL kernel UAPI /usr/include/ub/obmm.h directly via ioctls on /dev/obmm. Background ---------- Scheme 6 (via libubsm_sdk.so) got stuck on daemon-internal error 800 returned from ubsmem_shmem_allocate. The SDK is a binary blob so we couldn't see what it was actually sending to the kernel. Two earlier fixes (4MB alignment, real cluster hostnames) were both correct in isolation but did not unblock 800 — after three iterations it became clear that the region-based allocation path in the current SDK build is either broken or requires cluster-side configuration we can't see. Scheme 7 side-steps the problem by calling the kernel UAPI directly. obmm.h is 186 lines, GPL-2.0+, and documents exactly the export / import / unimport / unexport ioctls we need. Corresponding kernel source lives in openEuler/kernel (OLK-6.6, migrated to AtomGit). What's in this commit --------------------- benchmark/engram/scheme7_obmm/ obmm_rw.h/c — thin wrapper with four entry points: obmm_rw_open/close obmm_rw_export / obmm_rw_unexport obmm_rw_import / obmm_rw_unimport Plus an 80-byte packed handle struct that carries (mem_id, tokenid, length, uba, seid, deid, scna, pxm_numa, base_dist) across TCP for the cross-node variant that will come next. smoke_test.c — single-node loopback: 1. open /dev/obmm 2. mmap 4 MB anonymous buffer 3. write 1 KB pattern 4. EXPORT_PID with flags=ALLOW_MMAP 5. IMPORT back with flags=ALLOW_MMAP 6. read pattern through the imported VA, verify 7. quick loopback load-latency bench All seid/deid left zero for the simplest first call. Makefile — plain gcc, no link to libubsm_sdk.so (we verify via ldd that nothing sdk-related sneaks in). README.md — architecture diagram, how to run, expected output, and a "what can go wrong" table tied to each likely EINVAL / EPERM / ENOENT failure mode. This is Task sgl-project#2 of a 5-task scheme7 plan tracked in the session. Next tasks: sgl-project#3 extend to cross-node via TCP handle exchange sgl-project#4 bench scheme7 vs scheme5 sgl-project#5 integrate winner into SGLang Engram prefetcher The deferred kernel URMA_SEG_MAPPED patch is documented in memory/project_kernel_urma_mapped_stretch.md and will be revisited later as an independent upstream-contribution track — it answers a different question from scheme7 (API unification, not hardware capability).

…CMD_EXPORT smoke_multi results from run sgl-project#3: [FAIL] v1-v5 EXPORT_PID (all backings) errno=95 ENOTSUP [PASS] v7 EXPORT size[0]=4MB, flags=0, pxm=0 → mem_id=0x1 tokenid=0x4a9 uba=0xffffffc00000 [PASS] v8 flags=ALLOW_MMAP [PASS] v9 flags=FAST [PASS] v10 flags=ALLOW_MMAP|FAST [PASS] v11 fake eid Two takeaways: 1. EXPORT_PID (kernel name export_user_addr) is ENOTSUP for user callers regardless of buffer backing (ANON/PRIVATE/memfd/LOCKED) and flags (0/FAST). This path is probably reserved for specific kernel-internal callers (qemu/KVM/DPU). We give up on it. 2. OBMM_CMD_EXPORT (kernel allocates) works with any flag combo including ALLOW_MMAP. The output uba=0xffffffc00000 looks like a 4MB-aligned UB fabric address in the top of a 48-bit addressing space — a clean handle we can pass to mmap if we find the right offset convention. The remaining unknown: how to mmap the exported region so the local process can write into it. ioctl doesn't return a VA; mmap(2) on the /dev/obmm fd wants an offset. smoke_export.c tries seven candidate offsets in sequence, each on a fresh EXPORT/UNEXPORT cycle: offset=0 (default — probably picks by mem_id) offset=uba (raw fabric address) offset=uba / PAGE_SIZE (paged fabric address) offset=mem_id (handle as offset, tiny) offset=mem_id << 12 (paged handle) offset=mem_id << PAGE_SHIFT (same as above, PAGE_SHIFT=12) offset=tokenid (long shot) For each that mmap()s successfully, we write + read a 16-word pattern to prove the VA is real and coherent. First hit wins. Once we know the offset convention, the obmm_rw.c wrapper will be rewritten to use EXPORT (not EXPORT_PID) as its primary path.

…roject#3) Eliminates the "pool memory leak detected!" SIGQUIT that fired at the longeval->scbench dataset boundary in the 2026-04-29 A10G validation run, with evictable_size_=701881 > total=545241 (overcount of ~156k slots) and a precursor decode-batch line `#token: -129815`. Root cause (verified against the actual leak-trigger.log, not the hand-wavy lock_ref-overlap hypothesis from skill v1.1): model_runner._correct_fuzzy_kv_rope_contiguous tried to allocate fresh pool slots AFTER match_prefix had already committed the fuzzy match (lock_ref'd the donor, set req.fuzzy_match_result, merged donor's KV indices into the recipient's req_to_token_pool slice). When alloc failed under memory pressure the function returned early without rolling back ANY of that state, leaving the recipient's req_to_token_pool[exact_matched_len:prefix_len] permanently pointing at the donor's slots. cache_finished_req later read those entries from req_to_token_pool and freed them via token_to_kv_pool_allocator.free(...) — while they were still in the donor's TreeNode.value (still counted as evictable). Same physical slots ended up in BOTH the allocator's available pool AND the radix tree's evictable accounting, breaking the invariant `total = available + evictable + protected + ...`. Each subsequent failed-alloc fuzzy hit compounded the duplication. Fix: pre-allocate the realization slots in RadixCache.match_prefix BEFORE committing any fuzzy state. If alloc fails, return exact-only with no state mutation. Pool capacity is checked at match time; _correct_fuzzy_kv_rope can no longer fail mid-request. Surfaces: * RadixCache.match_prefix: alloc fuzzy_matched_len slots before inc_lock_ref(donor) / fuzzy_match_result assignment / merged device_indices. On alloc==None, return exact-only MatchResult. Skip the alloc when cached_start_pos == exact_matched_len AND segments is None (no RoPE delta needed; existing fast-path). * Req.fuzzy_realized_locs: new field carrying the pre-allocated slots tensor between match_prefix and _correct_fuzzy_kv_rope. * model_runner._correct_fuzzy_kv_rope_contiguous: consume req.fuzzy_realized_locs instead of allocating; clear after use. * model_runner._correct_fuzzy_kv_rope_segments: slice the pre-allocated block per segment as we iterate; free unused tail explicitly when segments are skipped. * RadixCache.cache_finished_req: defensive cleanup — if fuzzy_realized_locs is still set at finish time (request aborted before forward), free them so they don't leak. * RadixCache.match_prefix: defensive dec_lock_ref of any previously-locked donor when overwriting fuzzy_donor_node (chunked-prefill / post-retraction re-entry). Closes a parallel lock_ref leak that was masked by Bug sgl-project#3. Added diagnostic logging on alloc result so the bench tells us whether the fix is exercised. Validated by running existing semblend-release SGLang integration tests (44/44 pass). End-to-end bench validation pending. Signed-off-by: Zach Bennett <zach@worldflowai.com>

…g is the cause Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode on the SAME tensors the kernel just consumed and logs per-call cos_sim + max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run (prior 8-capture saved-tensor test was 0.7% of production call space). Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1 + shadow-ref enabled. Result across 560 production single_shot calls: Relative diff (kernel vs torch ref): min: 0.0000% median: 0.2232% mean: 0.2129% p99: 0.4464% max: 0.4464% Calls with rel > 0.5%: 0/560 Calls with rel > 1.0%: 0/560 Calls with rel > 5.0%: 0/560 All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP relative diff. NO single call has a catastrophic delta. Yet e2e produces garbage tokens. DEFINITIVE CONCLUSION on Layer-3: The residual e2e regression is hypothesis sgl-project#1 from the user's list: cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens = 1800 calls per generated sequence per worker. Each call is within bf16 floor; the cumulative drift exceeds the model's training-time robustness envelope. Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier calls in the 560-sample distribution. Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran in tier0 (no cuda graph) and still showed garbage e2e despite per-call diffs being uniform and small. Path forward (unchanged from previous commit): (b) Model finetune with kernel's specific bit pattern — out of kernel team scope. (d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers. Layer-3 stopgap (ca6f419) remains the production correctness fix. The kernel + diagnostic infrastructure (now including shadow-ref) is at its best-attainable bf16-precision-equivalent state. This commit closes the Layer-3 root-cause investigation. The "why is it still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT a cache bug, NOT a graph bug. It IS cumulative ULP compounding across 1800 dependent calls — fundamentally a model-tolerance issue against the kernel's bit-equivalent-but-bit-different output bit pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… default-on Adds 5 default-on env knobs to the stacked-best preset: 1. SGLANG_FP8_PAGED_MQA_LOGITS_AITER=1 — routes c4 indexer scoring through aiter's _gluon_deepgemm_fp8_paged_mqa_logits single-kernel path instead of the torch fallback (GEMM + reduce + gather + multiple elementwise ops). Per memory project_dsv4_phase23_aiter_paged_mqa.md, this knob shipped 2026-04-29 (TPOT -0.97 ms, +3.10% throughput) but the launch script was never updated. C=4 aligned bench A/B verified 2026-05-03: knob OFF (torch): TPOT 25.66 / total 287.96 tok/s / output 143.01 tok/s knob ON (aiter): TPOT 24.60 / total 316.83 tok/s / output 157.35 tok/s Recovers the documented TL;DR numbers (24.52/316.47/157.17) within ±0.13ms noise. This single env knob accounts for the +1.06 ms TPOT regression observed in the stacked-best preset since the doc snapshot. 2. SGLANG_FUSED_MHC_POST=1 (-0.42 ms TPOT, +1.57% throughput): replaces the post-sinkhorn (pre*x_flat).sum(1).to(bf16) chain (3 launches) with one Triton kernel at hc_pre@2806. 3. SGLANG_FUSED_RMSNORM_QUANT_PER1x128=1 (-0.10 ms TPOT, +0.37% throughput): single-output rmsnorm + per-1x128 fp8 quant fusion for q_norm; only fires for layers without indexer. 4. SGLANG_AITER_QK_RMSNORM_GROUP_QUANT=1 (-0.23 ms TPOT, +0.83% throughput): stacked superset of sgl-project#3, fuses q_norm + kv_norm + per-128 fp8 quant on q in ONE aiter HEAD launch with q_lora_bf16 materialised for indexer use. 5. SGLANG_FUSED_FREQS_IDX_GATHER=1 (-0.36 ms TPOT, +1.39% throughput): B200-style page-table arithmetic absorption — fuses (seq_lens-1)//ratio* ratio (3 launches) + freqs_cis[idx] (1 index_select + view + contig) into one Triton kernel. Combined with the D=512 megakernel patch and the previously-shipped fusions, chi2774 Flash-Base FP8 hits TPOT 24.60 ms / 316.83 tok/s on the c=4 aligned bench (40 prompts, ISL=OSL=1024, range_ratio=0.8) — matching the doc TL;DR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ct#3 v3) Closes the residual 4831-slot leak observed during the v5 a10g bench. v2 closed the chunked-prefill re-entry path; this closes the insert-walk-dedup path. Root cause: match_prefix sets req.cache_protected_len = exact_matched_len + fuzzy_matched_len because at that point the fuzzy region in req_to_token_pool references the donor's slots (tree-owned; cache_finished_req must not free them). After _correct_fuzzy_kv_rope_{contiguous,segments} writes our freshly- allocated realized_locs into those positions, the slots are ours — not the donor's — but cache_protected_len was never updated. cache_finished_req's `free(kv_indices[cache_protected_len : new_prefix_len])` only reclaims slots BEYOND cache_protected_len; if the insert walk slices our realized values away (because the request's tokens incidentally duplicate an existing tree path) the orphaned realized_locs are at positions inside the protected region and leak silently. v4 didn't surface this because the tree was small (4 hits in 30 min, no incidental token duplicates triggered the walk). v5's longer runtime grew the tree large enough that one hit's realized region matched an existing path, and exactly fuzzy_matched_len = 4831 slots vanished from accounting (sum=540410 vs total=545241). Fix: After realization completes in either RoPE-correction path, decrement req.cache_protected_len by the count of realized positions. The protected range collapses back to just the exact-match prefix (which DOES still reference tree-owned slots). The free at cache_finished_req then correctly catches insert-time duplicates inside the realized region. Segments path adds a safety guard: only decrement when total_realized == fuzzy_matched_len (full realization). For partial realization (segments skipped due to shape mismatch), cache_protected_len is a single integer and cannot selectively protect only the unrealized donor positions — stay conservative and accept a rare leak rather than risk freeing donor-owned slots. Validated: 44/44 SGLang integration tests pass. End-to-end re-bench against the same a10g harness pending. Signed-off-by: Zach Bennett <zach@worldflowai.com>

…ject#3 v4) Closes the per-request pool-slot leak whose size equals the fuzzy_matched_len of each multi-segment hit. Root cause: In the multi-segment fuzzy path, ``RadixCache.match_prefix`` returns ``device_indices`` covering only the exact-match prefix (no fuzzy slot list, because target positions are scattered and the contiguous- prefix model can't represent them). The scheduler therefore sets ``req.prefix_indices`` to length ``exact_matched_len`` and treats the fuzzy region as part of the extend window. ``alloc_for_extend`` then allocates fresh pool slots for the entire ``[exact_matched_len, total)`` span — including every position that the segments path is about to overwrite — and writes those slots into ``req_to_token_pool[req_idx, exact_matched_len:total)``. ``_correct_fuzzy_kv_rope_segments`` runs at the start of ``forward_extend`` and overwrites ``req_to_token_pool[req_idx, target_positions]`` with the realized slots produced by the donor-KV copy. The extend slots that previously occupied those positions are no longer referenced by any request or tree node, but they were allocated from the pool, so the allocator's ``available`` counter stays decremented while the slots never appear in any TreeNode's evictable value. The pool invariant ``total = available + evictable + protected + session_held + uncached`` drifts by exactly ``len(target_positions) == fuzzy_matched_len`` per multi-segment hit. The leak detector at ``on_idle`` eventually fires. Fix: Before overwriting ``req_to_token_pool[req_idx, target_positions]`` with ``new_locs``, read the displaced indices from those positions and free them back to the allocator. The freed slots return to ``available``; the realized slots are written into the pool slice; ``cache_unfinished_req``'s subsequent insert places them into a new TreeNode, where they are correctly tracked as evictable. Within a single batch, ``alloc_for_extend`` runs once at batch preparation time. ``forward_extend`` then writes K/V to the slots named in ``out_cache_loc``, which still includes the just-freed slots; that K/V is harmlessly discarded because the slots are no longer referenced by ``req_to_token_pool``. No concurrent allocator operation can hand the slots to another request before the batch finishes, so the early free is safe. The contiguous (single-segment / TokenBlockMatch) path is unaffected: ``device_indices`` already includes the donor's KV indices, so ``alloc_for_extend`` skips the fuzzy region entirely and the overwrite there displaces donor-owned (tree-protected) slots, not request-owned extend slots. Also reverts the v3 ``cache_protected_len -= num_fuzzy`` decrement in both contiguous and segments paths. v3 was a no-op against the actual leak: ``cache_unfinished_req`` resets ``cache_protected_len`` to ``len(new_indices)`` after every extend, undoing the decrement before ``cache_finished_req`` runs. The diagnostic [FUZZY DBG] logs added alongside v3 are removed for the same reason. Validated: 44/44 SemBlend SGLang adapter tests pass. End-to-end re-bench against the a10g harness pending. Signed-off-by: Zach Bennett <zach@worldflowai.com>

Two fixes to the existing Hook C (NSA forward_decode first-read fingerprint), both forced by the post-Arm-B retract→resume framing. Hook C-narrow (M6 promoted from REVIEW_DEFERRED → fixed): * New _emit_topk_read_fingerprints(forward_batch, layer, physical_pages): fingerprints only the unique pages NSA's indexer chose, dedup'd via _FP_SEEN_PAGES_TOPK, emitted with is_topk: 1. * Wired into forward_decode for both NSA backends: - trtllm path: resolve position-indices → physical pages by gathering metadata.page_table_1[:, topk_indices.clamp(min=0)] before the kernel call (the kernel does this internally; we replicate the lookup so the fp namespace matches Hook A/B/C events). - non-trtllm paths (flashmla_sparse / flashmla_kv / tilelang / fa3 / aiter): call after page_table_1 = transform_index_… (that result is already physical pages). * Skipped during cuda-graph capture (same guard as superset hook). * The original superset Hook C (is_topk: 0) is retained — superset catches H1' / H2' (B≠C on any page in req_to_token); narrow catches H6' (B=C on superset but B≠C-narrow on the topk-selected subset). Why narrow matters now: Arm B confirmed retract→resume is the trigger. SGLang's offload/load path covers the MLA kv_buffer but NOT the NSA index_k_with_scale_buffer (NSATokenToKVPool inherits without override). So post-resume the NSA index is stale → topk picks wrong-but-legitimate pages → surrounding pages still match recv-fp → superset hook is blind. Narrow hook fingerprints exactly the topk-chosen subset. req_to_token slice fp (Fix sgl-project#3): * Add rti_fp = blake2b-8(req_to_token[rpi, :seq_len].bytes) to every superset-Hook-C read event. Lets the post-processor detect when the indirection table itself changes between reads (resume rewrites the row) and disambiguates H1' (block contents wrong, rti stable) from H2' (block contents fine, rti rewritten). REVIEW_DEFERRED.md updated: * New 2026-05-06 banner explaining the Arm B retract→resume framing. * M6 marked PROMOTED TO FIXED with implementation notes. * forward_extend not yet wired with narrow hook — extends are short and the H6'-via-retract surface is decode-side; flagged as a follow-on if a trace ever localizes there.

add pip

d27449a

merrymercy merged commit 30720e7 into main Jan 9, 2024

merrymercy deleted the publish branch January 9, 2024 20:43

Rookie-Kai mentioned this pull request Aug 14, 2024

[Bug] Always Watch Dog TimeOut #1093

Closed

4 tasks

Ying1123 pushed a commit that referenced this pull request Sep 13, 2024

Update sp layout (#3)

4b8203a

* update layout * bug fix

wonderisland mentioned this pull request Sep 19, 2024

[Bug] illegal memory access encountered #1467

Closed

5 tasks

learninmou mentioned this pull request Sep 25, 2024

[Bug] sglang run for few hours, it will stop returning valid response #1270

Closed

5 tasks

kbumsik referenced this pull request in DeepAuto-AI/sglang Jan 23, 2025

Merge pull request #3 from gmlwns2000/hip12-offload-add-hip

422415b

update

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

zhaotyer mentioned this pull request Feb 14, 2025

[Bug] DeepSeek-R1-BF16 can't output with /v1/chat/completions on 4 node*8*A100 #3572

Closed

5 tasks

lambert0312 mentioned this pull request Feb 18, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged

ToughK mentioned this pull request Feb 18, 2025

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Closed

5 tasks

mahaocong90 mentioned this pull request Feb 26, 2025

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

Closed

5 tasks

verigle mentioned this pull request Feb 27, 2025

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

Closed

5 tasks

zcnrex pushed a commit to zcnrex/sglang that referenced this pull request Mar 5, 2025

Merge pull request sgl-project#3 from hebiao064/optimize_group_gemm_t…

cfd1c9b

…riton Optimize Triton Kernel of Group GEMM

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Add install with pip (sgl-project#3)

dfd0bdb

cnwenf pushed a commit to cnwenf/sglang that referenced this pull request Apr 10, 2025

[PD] changes according to new mooncake python package (sgl-project#3)

5898e54

riou-chen mentioned this pull request Apr 17, 2025

[Bug] run eagle3 failed #5448

Closed

NorthmanPKU pushed a commit to NorthmanPKU/sglang that referenced this pull request May 16, 2025

Merge pull request sgl-project#3 from wmdi/main

026c4d5

[LoRA demo] Add the checkpoint file for the lora demo

dongyibo mentioned this pull request May 19, 2025

[Bug] eagle2【CUDA error: an illegal memory access was encountered】 #6309

Closed

5 tasks

yuleiqin mentioned this pull request May 26, 2025

[Bug] main pd version Exception: Failed to encode tensor map: 700 #6590

Closed

5 tasks

chunyuan-w referenced this pull request in chunyuan-w/sglang May 28, 2025

Merge pull request #3 from jianan-gu/zlwang/doc-update

08cf7d6

update Llama4 BKC

EricKing626 mentioned this pull request Jun 10, 2025

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Closed

5 tasks

Xerxes-cn mentioned this pull request Jul 14, 2025

[Bug] Error when running qwen3 moe model on 5090 multi-gpu #8011

Closed

5 tasks

wangqian108 pushed a commit to wangqian108/sglang that referenced this pull request Jul 17, 2025

Merge pull request sgl-project#3 from zhuyijie88/feature/ascend_dsv3

c2d0b92

use npu_scatter_nd_update_

lingyaoluu mentioned this pull request Jul 25, 2025

[Bug] MTP CUDA error: an illegal memory access was encountered #8336

Closed

5 tasks

waynehong666 mentioned this pull request Aug 12, 2025

[Bug] sglang.launch_server error when loading embedding models with --dp 4 #9116

Closed

5 tasks

ericschreiber mentioned this pull request Aug 13, 2025

[Bug] CUDA error: uncorrectable ECC error encountered when using HiCache with xPyD disaggregation. #9151

Closed

5 tasks

xu-yfei mentioned this pull request Mar 24, 2026

[Bug] tree_speculative_sampling_target_only produces non-deterministic results with deterministic=True in tensor parallel mode #21256

Closed

edwingao28 mentioned this pull request Mar 24, 2026

[PCG] Enable piecewise CUDA graph by default for VLM models #21282

Open

5 tasks

0xtoward pushed a commit to 0xtoward/sglang that referenced this pull request Mar 25, 2026

Merge pull request sgl-project#3 from kfeng123/minicpm_sala

a2d3bce

[Fix] minicpm flashinfer backend: fix begin_forward args

lviy mentioned this pull request Mar 26, 2026

[Bug] Enablling DP-Attention cause 'nan' of 'inf' in prob tensor #21460

Open

5 tasks

mmangkad pushed a commit to mmangkad-dev/sglang that referenced this pull request Apr 3, 2026

Merge pull request sgl-project#3 from pyc96/kp/gemma4-audio

52d7fe8

Kp/gemma4 multimodal support

twb1235 mentioned this pull request Apr 7, 2026

[Bug] I noticed that with the node 2 and pp 2 tp8 setup, the workers don't exit on their own when the master goes down. I have to kill them manually #22227

Open

5 tasks

wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026

Merge pull request #3 from AdityaVKochar/main

be009c7

feat: added xpu page and npu installation support page under hardware platforms

AlfredYyong mentioned this pull request Apr 7, 2026

[Bugfix] Fix --preferred-sampling-params not taking effect #21821

Open

4 tasks

This was referenced Apr 8, 2026

[SKILL] add torch profiler analysis workflow #22353

Merged

Upgrade sglang-torch-profiler-analysis SKILLS #22440

Merged

dondetir mentioned this pull request Apr 9, 2026

[Test] Add unit tests for encoding_dsv32.py #21623

Open

whybeyoung mentioned this pull request Apr 13, 2026

[PP + HiCache] HiCache Consistency Fix Plan #22607

Open

11 tasks

JustinTong0323 mentioned this pull request Apr 17, 2026

[Whisper] Automatic language detection via structured generation #22997

Merged

thanhhao98 mentioned this pull request Apr 21, 2026

[Bug Fix] Sync FlashInfer autotune across TP ranks to unblock --enable-symm-mem #23317

Draft

5 tasks

shenxiul mentioned this pull request Apr 23, 2026

Skip torch.cuda.empty_cache() in weight update flush path #22998

Merged

2 tasks

jhinpan mentioned this pull request Apr 24, 2026

Add AMD support for DeepSeek V4 #23608

Open

silencejade mentioned this pull request Apr 25, 2026

[NPU] Fix mrope_position computation in Eagle Worker v2 with PlanStream #23423

Open

5 tasks

Johnsonms mentioned this pull request Apr 25, 2026

Flux2 nvfp4 quantization correctness on Blackwell (B200) #23625

Merged

5 tasks

Jiminator mentioned this pull request Apr 30, 2026

[Model] Laguna-XS.2 Model Support #24184

Closed

5 tasks

Gs1997XX mentioned this pull request May 8, 2026

DeepSeek-V4 Day 0 Support on NPUs #23598

Open

6 tasks

JackLeeHal mentioned this pull request May 9, 2026

[Question] running DeepSeek-V4-Pro on B300 #24776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add install with pip#3

Add install with pip#3
merrymercy merged 1 commit intomainfrom
publish

merrymercy commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

merrymercy commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant