Skip to content

Add install with pip#3

Merged
merrymercy merged 1 commit intomainfrom
publish
Jan 9, 2024
Merged

Add install with pip#3
merrymercy merged 1 commit intomainfrom
publish

Conversation

@merrymercy
Copy link
Copy Markdown
Contributor

No description provided.

@merrymercy merrymercy merged commit 30720e7 into main Jan 9, 2024
@merrymercy merrymercy deleted the publish branch January 9, 2024 20:43
@Rookie-Kai Rookie-Kai mentioned this pull request Aug 14, 2024
4 tasks
Ying1123 pushed a commit that referenced this pull request Sep 13, 2024
* update layout

* bug fix
kbumsik referenced this pull request in DeepAuto-AI/sglang Jan 23, 2025
zcnrex pushed a commit to zcnrex/sglang that referenced this pull request Mar 5, 2025
…riton

Optimize Triton Kernel of Group GEMM
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
cnwenf pushed a commit to cnwenf/sglang that referenced this pull request Apr 10, 2025
NorthmanPKU pushed a commit to NorthmanPKU/sglang that referenced this pull request May 16, 2025
[LoRA demo] Add the checkpoint file for the lora demo
chunyuan-w referenced this pull request in chunyuan-w/sglang May 28, 2025
wangqian108 pushed a commit to wangqian108/sglang that referenced this pull request Jul 17, 2025
0xtoward pushed a commit to 0xtoward/sglang that referenced this pull request Mar 25, 2026
[Fix] minicpm flashinfer backend: fix begin_forward args
apinge added a commit to apinge/sglang that referenced this pull request Mar 31, 2026
* apply aiter.topk_softmax to dev branch

Signed-off-by: apinge <tong.qiu2@amd.com>

* remove white line

Signed-off-by: apinge <tong.qiu2@amd.com>

---------

Signed-off-by: apinge <tong.qiu2@amd.com>
mmangkad pushed a commit to mmangkad-dev/sglang that referenced this pull request Apr 3, 2026
wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026
feat: added xpu page and npu installation support page under hardware platforms
YChange01 added a commit to YChange01/sglang that referenced this pull request Apr 9, 2026
New cross-node load/store path that bypasses the closed libubsm_sdk.so
entirely and talks to the GPL kernel UAPI /usr/include/ub/obmm.h
directly via ioctls on /dev/obmm.

Background
----------

Scheme 6 (via libubsm_sdk.so) got stuck on daemon-internal error 800
returned from ubsmem_shmem_allocate. The SDK is a binary blob so we
couldn't see what it was actually sending to the kernel. Two earlier
fixes (4MB alignment, real cluster hostnames) were both correct in
isolation but did not unblock 800 — after three iterations it became
clear that the region-based allocation path in the current SDK build
is either broken or requires cluster-side configuration we can't see.

Scheme 7 side-steps the problem by calling the kernel UAPI directly.
obmm.h is 186 lines, GPL-2.0+, and documents exactly the export /
import / unimport / unexport ioctls we need. Corresponding kernel
source lives in openEuler/kernel (OLK-6.6, migrated to AtomGit).

What's in this commit
---------------------

benchmark/engram/scheme7_obmm/
  obmm_rw.h/c  — thin wrapper with four entry points:
                   obmm_rw_open/close
                   obmm_rw_export / obmm_rw_unexport
                   obmm_rw_import / obmm_rw_unimport
                 Plus an 80-byte packed handle struct that carries
                 (mem_id, tokenid, length, uba, seid, deid, scna,
                 pxm_numa, base_dist) across TCP for the cross-node
                 variant that will come next.
  smoke_test.c — single-node loopback:
                   1. open /dev/obmm
                   2. mmap 4 MB anonymous buffer
                   3. write 1 KB pattern
                   4. EXPORT_PID with flags=ALLOW_MMAP
                   5. IMPORT back with flags=ALLOW_MMAP
                   6. read pattern through the imported VA, verify
                   7. quick loopback load-latency bench
                 All seid/deid left zero for the simplest first call.
  Makefile     — plain gcc, no link to libubsm_sdk.so (we verify via
                 ldd that nothing sdk-related sneaks in).
  README.md    — architecture diagram, how to run, expected output,
                 and a "what can go wrong" table tied to each likely
                 EINVAL / EPERM / ENOENT failure mode.

This is Task sgl-project#2 of a 5-task scheme7 plan tracked in the session.
Next tasks:
  sgl-project#3 extend to cross-node via TCP handle exchange
  sgl-project#4 bench scheme7 vs scheme5
  sgl-project#5 integrate winner into SGLang Engram prefetcher

The deferred kernel URMA_SEG_MAPPED patch is documented in
memory/project_kernel_urma_mapped_stretch.md and will be revisited
later as an independent upstream-contribution track — it answers a
different question from scheme7 (API unification, not hardware
capability).
YChange01 added a commit to YChange01/sglang that referenced this pull request Apr 9, 2026
…CMD_EXPORT

smoke_multi results from run sgl-project#3:

  [FAIL] v1-v5 EXPORT_PID (all backings)  errno=95 ENOTSUP
  [PASS] v7 EXPORT size[0]=4MB, flags=0, pxm=0
         → mem_id=0x1 tokenid=0x4a9 uba=0xffffffc00000
  [PASS] v8  flags=ALLOW_MMAP
  [PASS] v9  flags=FAST
  [PASS] v10 flags=ALLOW_MMAP|FAST
  [PASS] v11 fake eid

Two takeaways:

1. EXPORT_PID (kernel name export_user_addr) is ENOTSUP for user
   callers regardless of buffer backing (ANON/PRIVATE/memfd/LOCKED)
   and flags (0/FAST). This path is probably reserved for specific
   kernel-internal callers (qemu/KVM/DPU). We give up on it.

2. OBMM_CMD_EXPORT (kernel allocates) works with any flag combo
   including ALLOW_MMAP. The output uba=0xffffffc00000 looks like a
   4MB-aligned UB fabric address in the top of a 48-bit addressing
   space — a clean handle we can pass to mmap if we find the right
   offset convention.

The remaining unknown: how to mmap the exported region so the local
process can write into it. ioctl doesn't return a VA; mmap(2) on the
/dev/obmm fd wants an offset. smoke_export.c tries seven candidate
offsets in sequence, each on a fresh EXPORT/UNEXPORT cycle:

    offset=0                  (default — probably picks by mem_id)
    offset=uba                (raw fabric address)
    offset=uba / PAGE_SIZE    (paged fabric address)
    offset=mem_id             (handle as offset, tiny)
    offset=mem_id << 12       (paged handle)
    offset=mem_id << PAGE_SHIFT   (same as above, PAGE_SHIFT=12)
    offset=tokenid            (long shot)

For each that mmap()s successfully, we write + read a 16-word pattern
to prove the VA is real and coherent. First hit wins.

Once we know the offset convention, the obmm_rw.c wrapper will be
rewritten to use EXPORT (not EXPORT_PID) as its primary path.
zbennett10 added a commit to WorldFlowAI/sglang that referenced this pull request May 1, 2026
…roject#3)

Eliminates the "pool memory leak detected!" SIGQUIT that fired at the
longeval->scbench dataset boundary in the 2026-04-29 A10G validation
run, with evictable_size_=701881 > total=545241 (overcount of ~156k
slots) and a precursor decode-batch line `#token: -129815`.

Root cause (verified against the actual leak-trigger.log, not the
hand-wavy lock_ref-overlap hypothesis from skill v1.1):

  model_runner._correct_fuzzy_kv_rope_contiguous tried to allocate
  fresh pool slots AFTER match_prefix had already committed the
  fuzzy match (lock_ref'd the donor, set req.fuzzy_match_result,
  merged donor's KV indices into the recipient's req_to_token_pool
  slice). When alloc failed under memory pressure the function
  returned early without rolling back ANY of that state, leaving
  the recipient's req_to_token_pool[exact_matched_len:prefix_len]
  permanently pointing at the donor's slots. cache_finished_req
  later read those entries from req_to_token_pool and freed them
  via token_to_kv_pool_allocator.free(...) — while they were still
  in the donor's TreeNode.value (still counted as evictable).
  Same physical slots ended up in BOTH the allocator's available
  pool AND the radix tree's evictable accounting, breaking the
  invariant `total = available + evictable + protected + ...`.
  Each subsequent failed-alloc fuzzy hit compounded the duplication.

Fix: pre-allocate the realization slots in RadixCache.match_prefix
BEFORE committing any fuzzy state. If alloc fails, return exact-only
with no state mutation. Pool capacity is checked at match time;
_correct_fuzzy_kv_rope can no longer fail mid-request.

Surfaces:

  * RadixCache.match_prefix: alloc fuzzy_matched_len slots before
    inc_lock_ref(donor) / fuzzy_match_result assignment / merged
    device_indices. On alloc==None, return exact-only MatchResult.
    Skip the alloc when cached_start_pos == exact_matched_len AND
    segments is None (no RoPE delta needed; existing fast-path).
  * Req.fuzzy_realized_locs: new field carrying the pre-allocated
    slots tensor between match_prefix and _correct_fuzzy_kv_rope.
  * model_runner._correct_fuzzy_kv_rope_contiguous: consume
    req.fuzzy_realized_locs instead of allocating; clear after use.
  * model_runner._correct_fuzzy_kv_rope_segments: slice the
    pre-allocated block per segment as we iterate; free unused
    tail explicitly when segments are skipped.
  * RadixCache.cache_finished_req: defensive cleanup — if
    fuzzy_realized_locs is still set at finish time (request
    aborted before forward), free them so they don't leak.
  * RadixCache.match_prefix: defensive dec_lock_ref of any
    previously-locked donor when overwriting fuzzy_donor_node
    (chunked-prefill / post-retraction re-entry). Closes a
    parallel lock_ref leak that was masked by Bug sgl-project#3.

Added diagnostic logging on alloc result so the bench tells us
whether the fix is exercised.

Validated by running existing semblend-release SGLang integration
tests (44/44 pass). End-to-end bench validation pending.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 3, 2026
…g is the cause

Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode
on the SAME tensors the kernel just consumed and logs per-call cos_sim +
max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run
(prior 8-capture saved-tensor test was 0.7% of production call space).

Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1
+ shadow-ref enabled. Result across 560 production single_shot calls:

  Relative diff (kernel vs torch ref):
    min:    0.0000%
    median: 0.2232%
    mean:   0.2129%
    p99:    0.4464%
    max:    0.4464%
  Calls with rel > 0.5%: 0/560
  Calls with rel > 1.0%: 0/560
  Calls with rel > 5.0%: 0/560

All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP
relative diff. NO single call has a catastrophic delta. Yet e2e
produces garbage tokens.

DEFINITIVE CONCLUSION on Layer-3:
  The residual e2e regression is hypothesis sgl-project#1 from the user's list:
  cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens
  = 1800 calls per generated sequence per worker. Each call is within
  bf16 floor; the cumulative drift exceeds the model's training-time
  robustness envelope.

  Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier
  calls in the 560-sample distribution.
  Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran
  in tier0 (no cuda graph) and still showed garbage e2e despite per-call
  diffs being uniform and small.

Path forward (unchanged from previous commit):
  (b) Model finetune with kernel's specific bit pattern — out of kernel
      team scope.
  (d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers.

Layer-3 stopgap (ca6f419) remains the production correctness fix.
The kernel + diagnostic infrastructure (now including shadow-ref) is at
its best-attainable bf16-precision-equivalent state.

This commit closes the Layer-3 root-cause investigation. The "why is it
still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT
a cache bug, NOT a graph bug. It IS cumulative ULP compounding across
1800 dependent calls — fundamentally a model-tolerance issue against
the kernel's bit-equivalent-but-bit-different output bit pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 3, 2026
… default-on

Adds 5 default-on env knobs to the stacked-best preset:

1. SGLANG_FP8_PAGED_MQA_LOGITS_AITER=1 — routes c4 indexer scoring through
   aiter's _gluon_deepgemm_fp8_paged_mqa_logits single-kernel path instead of
   the torch fallback (GEMM + reduce + gather + multiple elementwise ops).
   Per memory project_dsv4_phase23_aiter_paged_mqa.md, this knob shipped
   2026-04-29 (TPOT -0.97 ms, +3.10% throughput) but the launch script was
   never updated. C=4 aligned bench A/B verified 2026-05-03:
     knob OFF (torch): TPOT 25.66 / total 287.96 tok/s / output 143.01 tok/s
     knob ON (aiter):  TPOT 24.60 / total 316.83 tok/s / output 157.35 tok/s
   Recovers the documented TL;DR numbers (24.52/316.47/157.17) within ±0.13ms
   noise. This single env knob accounts for the +1.06 ms TPOT regression
   observed in the stacked-best preset since the doc snapshot.

2. SGLANG_FUSED_MHC_POST=1 (-0.42 ms TPOT, +1.57% throughput): replaces the
   post-sinkhorn (pre*x_flat).sum(1).to(bf16) chain (3 launches) with one
   Triton kernel at hc_pre@2806.

3. SGLANG_FUSED_RMSNORM_QUANT_PER1x128=1 (-0.10 ms TPOT, +0.37% throughput):
   single-output rmsnorm + per-1x128 fp8 quant fusion for q_norm; only fires
   for layers without indexer.

4. SGLANG_AITER_QK_RMSNORM_GROUP_QUANT=1 (-0.23 ms TPOT, +0.83% throughput):
   stacked superset of sgl-project#3, fuses q_norm + kv_norm + per-128 fp8 quant on q
   in ONE aiter HEAD launch with q_lora_bf16 materialised for indexer use.

5. SGLANG_FUSED_FREQS_IDX_GATHER=1 (-0.36 ms TPOT, +1.39% throughput):
   B200-style page-table arithmetic absorption — fuses (seq_lens-1)//ratio*
   ratio (3 launches) + freqs_cis[idx] (1 index_select + view + contig) into
   one Triton kernel.

Combined with the D=512 megakernel patch and the previously-shipped fusions,
chi2774 Flash-Base FP8 hits TPOT 24.60 ms / 316.83 tok/s on the c=4 aligned
bench (40 prompts, ISL=OSL=1024, range_ratio=0.8) — matching the doc TL;DR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zbennett10 added a commit to WorldFlowAI/sglang that referenced this pull request May 5, 2026
…ct#3 v3)

Closes the residual 4831-slot leak observed during the v5 a10g
bench. v2 closed the chunked-prefill re-entry path; this closes
the insert-walk-dedup path.

Root cause:

  match_prefix sets req.cache_protected_len = exact_matched_len +
  fuzzy_matched_len because at that point the fuzzy region in
  req_to_token_pool references the donor's slots (tree-owned;
  cache_finished_req must not free them). After
  _correct_fuzzy_kv_rope_{contiguous,segments} writes our freshly-
  allocated realized_locs into those positions, the slots are
  ours — not the donor's — but cache_protected_len was never
  updated. cache_finished_req's
    `free(kv_indices[cache_protected_len : new_prefix_len])`
  only reclaims slots BEYOND cache_protected_len; if the insert
  walk slices our realized values away (because the request's
  tokens incidentally duplicate an existing tree path) the
  orphaned realized_locs are at positions inside the protected
  region and leak silently.

  v4 didn't surface this because the tree was small (4 hits in
  30 min, no incidental token duplicates triggered the walk).
  v5's longer runtime grew the tree large enough that one hit's
  realized region matched an existing path, and exactly
  fuzzy_matched_len = 4831 slots vanished from accounting
  (sum=540410 vs total=545241).

Fix:

  After realization completes in either RoPE-correction path,
  decrement req.cache_protected_len by the count of realized
  positions. The protected range collapses back to just the
  exact-match prefix (which DOES still reference tree-owned
  slots). The free at cache_finished_req then correctly catches
  insert-time duplicates inside the realized region.

  Segments path adds a safety guard: only decrement when
  total_realized == fuzzy_matched_len (full realization). For
  partial realization (segments skipped due to shape mismatch),
  cache_protected_len is a single integer and cannot
  selectively protect only the unrealized donor positions —
  stay conservative and accept a rare leak rather than risk
  freeing donor-owned slots.

Validated: 44/44 SGLang integration tests pass. End-to-end
re-bench against the same a10g harness pending.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
zbennett10 added a commit to WorldFlowAI/sglang that referenced this pull request May 5, 2026
…ject#3 v4)

Closes the per-request pool-slot leak whose size equals the
fuzzy_matched_len of each multi-segment hit.

Root cause:

  In the multi-segment fuzzy path, ``RadixCache.match_prefix`` returns
  ``device_indices`` covering only the exact-match prefix (no fuzzy
  slot list, because target positions are scattered and the contiguous-
  prefix model can't represent them). The scheduler therefore sets
  ``req.prefix_indices`` to length ``exact_matched_len`` and treats the
  fuzzy region as part of the extend window. ``alloc_for_extend`` then
  allocates fresh pool slots for the entire ``[exact_matched_len, total)``
  span — including every position that the segments path is about to
  overwrite — and writes those slots into
  ``req_to_token_pool[req_idx, exact_matched_len:total)``.

  ``_correct_fuzzy_kv_rope_segments`` runs at the start of
  ``forward_extend`` and overwrites
  ``req_to_token_pool[req_idx, target_positions]`` with the realized
  slots produced by the donor-KV copy. The extend slots that previously
  occupied those positions are no longer referenced by any request or
  tree node, but they were allocated from the pool, so the allocator's
  ``available`` counter stays decremented while the slots never appear
  in any TreeNode's evictable value. The pool invariant
  ``total = available + evictable + protected + session_held + uncached``
  drifts by exactly ``len(target_positions) == fuzzy_matched_len`` per
  multi-segment hit. The leak detector at ``on_idle`` eventually fires.

Fix:

  Before overwriting ``req_to_token_pool[req_idx, target_positions]``
  with ``new_locs``, read the displaced indices from those positions
  and free them back to the allocator. The freed slots return to
  ``available``; the realized slots are written into the pool slice;
  ``cache_unfinished_req``'s subsequent insert places them into a new
  TreeNode, where they are correctly tracked as evictable.

  Within a single batch, ``alloc_for_extend`` runs once at batch
  preparation time. ``forward_extend`` then writes K/V to the slots
  named in ``out_cache_loc``, which still includes the just-freed
  slots; that K/V is harmlessly discarded because the slots are no
  longer referenced by ``req_to_token_pool``. No concurrent allocator
  operation can hand the slots to another request before the batch
  finishes, so the early free is safe.

  The contiguous (single-segment / TokenBlockMatch) path is unaffected:
  ``device_indices`` already includes the donor's KV indices, so
  ``alloc_for_extend`` skips the fuzzy region entirely and the
  overwrite there displaces donor-owned (tree-protected) slots, not
  request-owned extend slots.

Also reverts the v3 ``cache_protected_len -= num_fuzzy`` decrement in
both contiguous and segments paths. v3 was a no-op against the actual
leak: ``cache_unfinished_req`` resets ``cache_protected_len`` to
``len(new_indices)`` after every extend, undoing the decrement before
``cache_finished_req`` runs. The diagnostic [FUZZY DBG] logs added
alongside v3 are removed for the same reason.

Validated: 44/44 SemBlend SGLang adapter tests pass. End-to-end re-bench
against the a10g harness pending.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
nvyutwu added a commit to nvyutwu/sglang that referenced this pull request May 7, 2026
Two fixes to the existing Hook C (NSA forward_decode first-read
fingerprint), both forced by the post-Arm-B retract→resume framing.

Hook C-narrow (M6 promoted from REVIEW_DEFERRED → fixed):
  * New _emit_topk_read_fingerprints(forward_batch, layer, physical_pages):
    fingerprints only the unique pages NSA's indexer chose, dedup'd
    via _FP_SEEN_PAGES_TOPK, emitted with is_topk: 1.
  * Wired into forward_decode for both NSA backends:
      - trtllm path: resolve position-indices → physical pages by
        gathering metadata.page_table_1[:, topk_indices.clamp(min=0)]
        before the kernel call (the kernel does this internally; we
        replicate the lookup so the fp namespace matches Hook A/B/C
        events).
      - non-trtllm paths (flashmla_sparse / flashmla_kv / tilelang /
        fa3 / aiter): call after page_table_1 = transform_index_…
        (that result is already physical pages).
  * Skipped during cuda-graph capture (same guard as superset hook).
  * The original superset Hook C (is_topk: 0) is retained — superset
    catches H1' / H2' (B≠C on any page in req_to_token); narrow catches
    H6' (B=C on superset but B≠C-narrow on the topk-selected subset).

Why narrow matters now: Arm B confirmed retract→resume is the trigger.
SGLang's offload/load path covers the MLA kv_buffer but NOT the NSA
index_k_with_scale_buffer (NSATokenToKVPool inherits without override).
So post-resume the NSA index is stale → topk picks wrong-but-legitimate
pages → surrounding pages still match recv-fp → superset hook is blind.
Narrow hook fingerprints exactly the topk-chosen subset.

req_to_token slice fp (Fix sgl-project#3):
  * Add rti_fp = blake2b-8(req_to_token[rpi, :seq_len].bytes) to every
    superset-Hook-C read event. Lets the post-processor detect when
    the indirection table itself changes between reads (resume rewrites
    the row) and disambiguates H1' (block contents wrong, rti stable)
    from H2' (block contents fine, rti rewritten).

REVIEW_DEFERRED.md updated:
  * New 2026-05-06 banner explaining the Arm B retract→resume framing.
  * M6 marked PROMOTED TO FIXED with implementation notes.
  * forward_extend not yet wired with narrow hook — extends are short
    and the H6'-via-retract surface is decode-side; flagged as a
    follow-on if a trace ever localizes there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant