Add install with pip#3
Merged
merrymercy merged 1 commit intomainfrom Jan 9, 2024
Merged
Conversation
4 tasks
5 tasks
5 tasks
kbumsik
referenced
this pull request
in DeepAuto-AI/sglang
Jan 23, 2025
5 tasks
5 tasks
5 tasks
Closed
5 tasks
5 tasks
zcnrex
pushed a commit
to zcnrex/sglang
that referenced
this pull request
Mar 5, 2025
…riton Optimize Triton Kernel of Group GEMM
timethink
pushed a commit
to timethink/sglang
that referenced
this pull request
Mar 9, 2025
cnwenf
pushed a commit
to cnwenf/sglang
that referenced
this pull request
Apr 10, 2025
NorthmanPKU
pushed a commit
to NorthmanPKU/sglang
that referenced
this pull request
May 16, 2025
[LoRA demo] Add the checkpoint file for the lora demo
5 tasks
5 tasks
5 tasks
5 tasks
wangqian108
pushed a commit
to wangqian108/sglang
that referenced
this pull request
Jul 17, 2025
use npu_scatter_nd_update_
5 tasks
5 tasks
5 tasks
5 tasks
0xtoward
pushed a commit
to 0xtoward/sglang
that referenced
this pull request
Mar 25, 2026
[Fix] minicpm flashinfer backend: fix begin_forward args
5 tasks
apinge
added a commit
to apinge/sglang
that referenced
this pull request
Mar 31, 2026
* apply aiter.topk_softmax to dev branch Signed-off-by: apinge <tong.qiu2@amd.com> * remove white line Signed-off-by: apinge <tong.qiu2@amd.com> --------- Signed-off-by: apinge <tong.qiu2@amd.com>
mmangkad
pushed a commit
to mmangkad-dev/sglang
that referenced
this pull request
Apr 3, 2026
Kp/gemma4 multimodal support
wisclmy0611
pushed a commit
that referenced
this pull request
Apr 7, 2026
feat: added xpu page and npu installation support page under hardware platforms
4 tasks
This was referenced Apr 8, 2026
YChange01
added a commit
to YChange01/sglang
that referenced
this pull request
Apr 9, 2026
New cross-node load/store path that bypasses the closed libubsm_sdk.so
entirely and talks to the GPL kernel UAPI /usr/include/ub/obmm.h
directly via ioctls on /dev/obmm.
Background
----------
Scheme 6 (via libubsm_sdk.so) got stuck on daemon-internal error 800
returned from ubsmem_shmem_allocate. The SDK is a binary blob so we
couldn't see what it was actually sending to the kernel. Two earlier
fixes (4MB alignment, real cluster hostnames) were both correct in
isolation but did not unblock 800 — after three iterations it became
clear that the region-based allocation path in the current SDK build
is either broken or requires cluster-side configuration we can't see.
Scheme 7 side-steps the problem by calling the kernel UAPI directly.
obmm.h is 186 lines, GPL-2.0+, and documents exactly the export /
import / unimport / unexport ioctls we need. Corresponding kernel
source lives in openEuler/kernel (OLK-6.6, migrated to AtomGit).
What's in this commit
---------------------
benchmark/engram/scheme7_obmm/
obmm_rw.h/c — thin wrapper with four entry points:
obmm_rw_open/close
obmm_rw_export / obmm_rw_unexport
obmm_rw_import / obmm_rw_unimport
Plus an 80-byte packed handle struct that carries
(mem_id, tokenid, length, uba, seid, deid, scna,
pxm_numa, base_dist) across TCP for the cross-node
variant that will come next.
smoke_test.c — single-node loopback:
1. open /dev/obmm
2. mmap 4 MB anonymous buffer
3. write 1 KB pattern
4. EXPORT_PID with flags=ALLOW_MMAP
5. IMPORT back with flags=ALLOW_MMAP
6. read pattern through the imported VA, verify
7. quick loopback load-latency bench
All seid/deid left zero for the simplest first call.
Makefile — plain gcc, no link to libubsm_sdk.so (we verify via
ldd that nothing sdk-related sneaks in).
README.md — architecture diagram, how to run, expected output,
and a "what can go wrong" table tied to each likely
EINVAL / EPERM / ENOENT failure mode.
This is Task sgl-project#2 of a 5-task scheme7 plan tracked in the session.
Next tasks:
sgl-project#3 extend to cross-node via TCP handle exchange
sgl-project#4 bench scheme7 vs scheme5
sgl-project#5 integrate winner into SGLang Engram prefetcher
The deferred kernel URMA_SEG_MAPPED patch is documented in
memory/project_kernel_urma_mapped_stretch.md and will be revisited
later as an independent upstream-contribution track — it answers a
different question from scheme7 (API unification, not hardware
capability).
YChange01
added a commit
to YChange01/sglang
that referenced
this pull request
Apr 9, 2026
…CMD_EXPORT smoke_multi results from run sgl-project#3: [FAIL] v1-v5 EXPORT_PID (all backings) errno=95 ENOTSUP [PASS] v7 EXPORT size[0]=4MB, flags=0, pxm=0 → mem_id=0x1 tokenid=0x4a9 uba=0xffffffc00000 [PASS] v8 flags=ALLOW_MMAP [PASS] v9 flags=FAST [PASS] v10 flags=ALLOW_MMAP|FAST [PASS] v11 fake eid Two takeaways: 1. EXPORT_PID (kernel name export_user_addr) is ENOTSUP for user callers regardless of buffer backing (ANON/PRIVATE/memfd/LOCKED) and flags (0/FAST). This path is probably reserved for specific kernel-internal callers (qemu/KVM/DPU). We give up on it. 2. OBMM_CMD_EXPORT (kernel allocates) works with any flag combo including ALLOW_MMAP. The output uba=0xffffffc00000 looks like a 4MB-aligned UB fabric address in the top of a 48-bit addressing space — a clean handle we can pass to mmap if we find the right offset convention. The remaining unknown: how to mmap the exported region so the local process can write into it. ioctl doesn't return a VA; mmap(2) on the /dev/obmm fd wants an offset. smoke_export.c tries seven candidate offsets in sequence, each on a fresh EXPORT/UNEXPORT cycle: offset=0 (default — probably picks by mem_id) offset=uba (raw fabric address) offset=uba / PAGE_SIZE (paged fabric address) offset=mem_id (handle as offset, tiny) offset=mem_id << 12 (paged handle) offset=mem_id << PAGE_SHIFT (same as above, PAGE_SHIFT=12) offset=tokenid (long shot) For each that mmap()s successfully, we write + read a 16-word pattern to prove the VA is real and coherent. First hit wins. Once we know the offset convention, the obmm_rw.c wrapper will be rewritten to use EXPORT (not EXPORT_PID) as its primary path.
11 tasks
5 tasks
2 tasks
5 tasks
5 tasks
5 tasks
zbennett10
added a commit
to WorldFlowAI/sglang
that referenced
this pull request
May 1, 2026
…roject#3) Eliminates the "pool memory leak detected!" SIGQUIT that fired at the longeval->scbench dataset boundary in the 2026-04-29 A10G validation run, with evictable_size_=701881 > total=545241 (overcount of ~156k slots) and a precursor decode-batch line `#token: -129815`. Root cause (verified against the actual leak-trigger.log, not the hand-wavy lock_ref-overlap hypothesis from skill v1.1): model_runner._correct_fuzzy_kv_rope_contiguous tried to allocate fresh pool slots AFTER match_prefix had already committed the fuzzy match (lock_ref'd the donor, set req.fuzzy_match_result, merged donor's KV indices into the recipient's req_to_token_pool slice). When alloc failed under memory pressure the function returned early without rolling back ANY of that state, leaving the recipient's req_to_token_pool[exact_matched_len:prefix_len] permanently pointing at the donor's slots. cache_finished_req later read those entries from req_to_token_pool and freed them via token_to_kv_pool_allocator.free(...) — while they were still in the donor's TreeNode.value (still counted as evictable). Same physical slots ended up in BOTH the allocator's available pool AND the radix tree's evictable accounting, breaking the invariant `total = available + evictable + protected + ...`. Each subsequent failed-alloc fuzzy hit compounded the duplication. Fix: pre-allocate the realization slots in RadixCache.match_prefix BEFORE committing any fuzzy state. If alloc fails, return exact-only with no state mutation. Pool capacity is checked at match time; _correct_fuzzy_kv_rope can no longer fail mid-request. Surfaces: * RadixCache.match_prefix: alloc fuzzy_matched_len slots before inc_lock_ref(donor) / fuzzy_match_result assignment / merged device_indices. On alloc==None, return exact-only MatchResult. Skip the alloc when cached_start_pos == exact_matched_len AND segments is None (no RoPE delta needed; existing fast-path). * Req.fuzzy_realized_locs: new field carrying the pre-allocated slots tensor between match_prefix and _correct_fuzzy_kv_rope. * model_runner._correct_fuzzy_kv_rope_contiguous: consume req.fuzzy_realized_locs instead of allocating; clear after use. * model_runner._correct_fuzzy_kv_rope_segments: slice the pre-allocated block per segment as we iterate; free unused tail explicitly when segments are skipped. * RadixCache.cache_finished_req: defensive cleanup — if fuzzy_realized_locs is still set at finish time (request aborted before forward), free them so they don't leak. * RadixCache.match_prefix: defensive dec_lock_ref of any previously-locked donor when overwriting fuzzy_donor_node (chunked-prefill / post-retraction re-entry). Closes a parallel lock_ref leak that was masked by Bug sgl-project#3. Added diagnostic logging on alloc result so the bench tells us whether the fix is exercised. Validated by running existing semblend-release SGLang integration tests (44/44 pass). End-to-end bench validation pending. Signed-off-by: Zach Bennett <zach@worldflowai.com>
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 3, 2026
…g is the cause
Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode
on the SAME tensors the kernel just consumed and logs per-call cos_sim +
max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run
(prior 8-capture saved-tensor test was 0.7% of production call space).
Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1
+ shadow-ref enabled. Result across 560 production single_shot calls:
Relative diff (kernel vs torch ref):
min: 0.0000%
median: 0.2232%
mean: 0.2129%
p99: 0.4464%
max: 0.4464%
Calls with rel > 0.5%: 0/560
Calls with rel > 1.0%: 0/560
Calls with rel > 5.0%: 0/560
All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP
relative diff. NO single call has a catastrophic delta. Yet e2e
produces garbage tokens.
DEFINITIVE CONCLUSION on Layer-3:
The residual e2e regression is hypothesis sgl-project#1 from the user's list:
cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens
= 1800 calls per generated sequence per worker. Each call is within
bf16 floor; the cumulative drift exceeds the model's training-time
robustness envelope.
Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier
calls in the 560-sample distribution.
Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran
in tier0 (no cuda graph) and still showed garbage e2e despite per-call
diffs being uniform and small.
Path forward (unchanged from previous commit):
(b) Model finetune with kernel's specific bit pattern — out of kernel
team scope.
(d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers.
Layer-3 stopgap (ca6f419) remains the production correctness fix.
The kernel + diagnostic infrastructure (now including shadow-ref) is at
its best-attainable bf16-precision-equivalent state.
This commit closes the Layer-3 root-cause investigation. The "why is it
still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT
a cache bug, NOT a graph bug. It IS cumulative ULP compounding across
1800 dependent calls — fundamentally a model-tolerance issue against
the kernel's bit-equivalent-but-bit-different output bit pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 3, 2026
… default-on
Adds 5 default-on env knobs to the stacked-best preset:
1. SGLANG_FP8_PAGED_MQA_LOGITS_AITER=1 — routes c4 indexer scoring through
aiter's _gluon_deepgemm_fp8_paged_mqa_logits single-kernel path instead of
the torch fallback (GEMM + reduce + gather + multiple elementwise ops).
Per memory project_dsv4_phase23_aiter_paged_mqa.md, this knob shipped
2026-04-29 (TPOT -0.97 ms, +3.10% throughput) but the launch script was
never updated. C=4 aligned bench A/B verified 2026-05-03:
knob OFF (torch): TPOT 25.66 / total 287.96 tok/s / output 143.01 tok/s
knob ON (aiter): TPOT 24.60 / total 316.83 tok/s / output 157.35 tok/s
Recovers the documented TL;DR numbers (24.52/316.47/157.17) within ±0.13ms
noise. This single env knob accounts for the +1.06 ms TPOT regression
observed in the stacked-best preset since the doc snapshot.
2. SGLANG_FUSED_MHC_POST=1 (-0.42 ms TPOT, +1.57% throughput): replaces the
post-sinkhorn (pre*x_flat).sum(1).to(bf16) chain (3 launches) with one
Triton kernel at hc_pre@2806.
3. SGLANG_FUSED_RMSNORM_QUANT_PER1x128=1 (-0.10 ms TPOT, +0.37% throughput):
single-output rmsnorm + per-1x128 fp8 quant fusion for q_norm; only fires
for layers without indexer.
4. SGLANG_AITER_QK_RMSNORM_GROUP_QUANT=1 (-0.23 ms TPOT, +0.83% throughput):
stacked superset of sgl-project#3, fuses q_norm + kv_norm + per-128 fp8 quant on q
in ONE aiter HEAD launch with q_lora_bf16 materialised for indexer use.
5. SGLANG_FUSED_FREQS_IDX_GATHER=1 (-0.36 ms TPOT, +1.39% throughput):
B200-style page-table arithmetic absorption — fuses (seq_lens-1)//ratio*
ratio (3 launches) + freqs_cis[idx] (1 index_select + view + contig) into
one Triton kernel.
Combined with the D=512 megakernel patch and the previously-shipped fusions,
chi2774 Flash-Base FP8 hits TPOT 24.60 ms / 316.83 tok/s on the c=4 aligned
bench (40 prompts, ISL=OSL=1024, range_ratio=0.8) — matching the doc TL;DR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zbennett10
added a commit
to WorldFlowAI/sglang
that referenced
this pull request
May 5, 2026
…ct#3 v3) Closes the residual 4831-slot leak observed during the v5 a10g bench. v2 closed the chunked-prefill re-entry path; this closes the insert-walk-dedup path. Root cause: match_prefix sets req.cache_protected_len = exact_matched_len + fuzzy_matched_len because at that point the fuzzy region in req_to_token_pool references the donor's slots (tree-owned; cache_finished_req must not free them). After _correct_fuzzy_kv_rope_{contiguous,segments} writes our freshly- allocated realized_locs into those positions, the slots are ours — not the donor's — but cache_protected_len was never updated. cache_finished_req's `free(kv_indices[cache_protected_len : new_prefix_len])` only reclaims slots BEYOND cache_protected_len; if the insert walk slices our realized values away (because the request's tokens incidentally duplicate an existing tree path) the orphaned realized_locs are at positions inside the protected region and leak silently. v4 didn't surface this because the tree was small (4 hits in 30 min, no incidental token duplicates triggered the walk). v5's longer runtime grew the tree large enough that one hit's realized region matched an existing path, and exactly fuzzy_matched_len = 4831 slots vanished from accounting (sum=540410 vs total=545241). Fix: After realization completes in either RoPE-correction path, decrement req.cache_protected_len by the count of realized positions. The protected range collapses back to just the exact-match prefix (which DOES still reference tree-owned slots). The free at cache_finished_req then correctly catches insert-time duplicates inside the realized region. Segments path adds a safety guard: only decrement when total_realized == fuzzy_matched_len (full realization). For partial realization (segments skipped due to shape mismatch), cache_protected_len is a single integer and cannot selectively protect only the unrealized donor positions — stay conservative and accept a rare leak rather than risk freeing donor-owned slots. Validated: 44/44 SGLang integration tests pass. End-to-end re-bench against the same a10g harness pending. Signed-off-by: Zach Bennett <zach@worldflowai.com>
zbennett10
added a commit
to WorldFlowAI/sglang
that referenced
this pull request
May 5, 2026
…ject#3 v4) Closes the per-request pool-slot leak whose size equals the fuzzy_matched_len of each multi-segment hit. Root cause: In the multi-segment fuzzy path, ``RadixCache.match_prefix`` returns ``device_indices`` covering only the exact-match prefix (no fuzzy slot list, because target positions are scattered and the contiguous- prefix model can't represent them). The scheduler therefore sets ``req.prefix_indices`` to length ``exact_matched_len`` and treats the fuzzy region as part of the extend window. ``alloc_for_extend`` then allocates fresh pool slots for the entire ``[exact_matched_len, total)`` span — including every position that the segments path is about to overwrite — and writes those slots into ``req_to_token_pool[req_idx, exact_matched_len:total)``. ``_correct_fuzzy_kv_rope_segments`` runs at the start of ``forward_extend`` and overwrites ``req_to_token_pool[req_idx, target_positions]`` with the realized slots produced by the donor-KV copy. The extend slots that previously occupied those positions are no longer referenced by any request or tree node, but they were allocated from the pool, so the allocator's ``available`` counter stays decremented while the slots never appear in any TreeNode's evictable value. The pool invariant ``total = available + evictable + protected + session_held + uncached`` drifts by exactly ``len(target_positions) == fuzzy_matched_len`` per multi-segment hit. The leak detector at ``on_idle`` eventually fires. Fix: Before overwriting ``req_to_token_pool[req_idx, target_positions]`` with ``new_locs``, read the displaced indices from those positions and free them back to the allocator. The freed slots return to ``available``; the realized slots are written into the pool slice; ``cache_unfinished_req``'s subsequent insert places them into a new TreeNode, where they are correctly tracked as evictable. Within a single batch, ``alloc_for_extend`` runs once at batch preparation time. ``forward_extend`` then writes K/V to the slots named in ``out_cache_loc``, which still includes the just-freed slots; that K/V is harmlessly discarded because the slots are no longer referenced by ``req_to_token_pool``. No concurrent allocator operation can hand the slots to another request before the batch finishes, so the early free is safe. The contiguous (single-segment / TokenBlockMatch) path is unaffected: ``device_indices`` already includes the donor's KV indices, so ``alloc_for_extend`` skips the fuzzy region entirely and the overwrite there displaces donor-owned (tree-protected) slots, not request-owned extend slots. Also reverts the v3 ``cache_protected_len -= num_fuzzy`` decrement in both contiguous and segments paths. v3 was a no-op against the actual leak: ``cache_unfinished_req`` resets ``cache_protected_len`` to ``len(new_indices)`` after every extend, undoing the decrement before ``cache_finished_req`` runs. The diagnostic [FUZZY DBG] logs added alongside v3 are removed for the same reason. Validated: 44/44 SemBlend SGLang adapter tests pass. End-to-end re-bench against the a10g harness pending. Signed-off-by: Zach Bennett <zach@worldflowai.com>
nvyutwu
added a commit
to nvyutwu/sglang
that referenced
this pull request
May 7, 2026
Two fixes to the existing Hook C (NSA forward_decode first-read
fingerprint), both forced by the post-Arm-B retract→resume framing.
Hook C-narrow (M6 promoted from REVIEW_DEFERRED → fixed):
* New _emit_topk_read_fingerprints(forward_batch, layer, physical_pages):
fingerprints only the unique pages NSA's indexer chose, dedup'd
via _FP_SEEN_PAGES_TOPK, emitted with is_topk: 1.
* Wired into forward_decode for both NSA backends:
- trtllm path: resolve position-indices → physical pages by
gathering metadata.page_table_1[:, topk_indices.clamp(min=0)]
before the kernel call (the kernel does this internally; we
replicate the lookup so the fp namespace matches Hook A/B/C
events).
- non-trtllm paths (flashmla_sparse / flashmla_kv / tilelang /
fa3 / aiter): call after page_table_1 = transform_index_…
(that result is already physical pages).
* Skipped during cuda-graph capture (same guard as superset hook).
* The original superset Hook C (is_topk: 0) is retained — superset
catches H1' / H2' (B≠C on any page in req_to_token); narrow catches
H6' (B=C on superset but B≠C-narrow on the topk-selected subset).
Why narrow matters now: Arm B confirmed retract→resume is the trigger.
SGLang's offload/load path covers the MLA kv_buffer but NOT the NSA
index_k_with_scale_buffer (NSATokenToKVPool inherits without override).
So post-resume the NSA index is stale → topk picks wrong-but-legitimate
pages → surrounding pages still match recv-fp → superset hook is blind.
Narrow hook fingerprints exactly the topk-chosen subset.
req_to_token slice fp (Fix sgl-project#3):
* Add rti_fp = blake2b-8(req_to_token[rpi, :seq_len].bytes) to every
superset-Hook-C read event. Lets the post-processor detect when
the indirection table itself changes between reads (resume rewrites
the row) and disambiguates H1' (block contents wrong, rti stable)
from H2' (block contents fine, rti rewritten).
REVIEW_DEFERRED.md updated:
* New 2026-05-06 banner explaining the Arm B retract→resume framing.
* M6 marked PROMOTED TO FIXED with implementation notes.
* forward_extend not yet wired with narrow hook — extends are short
and the H6'-via-retract surface is decode-side; flagged as a
follow-on if a trace ever localizes there.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.