Skip to content

Update Readme#11

Merged
merrymercy merged 3 commits intomainfrom
fix
Jan 16, 2024
Merged

Update Readme#11
merrymercy merged 3 commits intomainfrom
fix

Conversation

@merrymercy
Copy link
Copy Markdown
Contributor

No description provided.

@merrymercy merrymercy merged this pull request into main Jan 16, 2024
@merrymercy merrymercy deleted the fix branch January 16, 2024 10:46
merrymercy added a commit that referenced this pull request Jan 16, 2024
Ying1123 pushed a commit that referenced this pull request Sep 13, 2024
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 14, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 14, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 14, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
yanbing-j pushed a commit to yanbing-j/sglang that referenced this pull request Mar 18, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
NorthmanPKU added a commit to NorthmanPKU/sglang that referenced this pull request May 16, 2025
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 27, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Jun 3, 2025
* switch to weight_packed_linear if cpu_has_amx_support

* add self.use_intel_amx_backend
sleepcoo pushed a commit to shuaills/sglang that referenced this pull request Jun 24, 2025
siuhunh pushed a commit to xing-wenjin/sglang that referenced this pull request Jul 23, 2025
[bugfix]rotary_embedding fix precision error
yichiche pushed a commit to yichiche/sglang that referenced this pull request Jul 30, 2025
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
yichiche pushed a commit to yichiche/sglang that referenced this pull request Aug 7, 2025
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
yichiche pushed a commit to yichiche/sglang that referenced this pull request Aug 11, 2025
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Xia-Weiwen pushed a commit to Xia-Weiwen/sglang that referenced this pull request Sep 9, 2025
kalyank007 pushed a commit to kalyank007/sglang that referenced this pull request Nov 7, 2025
…gl-project#10739 (sgl-project#11)

* [Intel XPU]Add XPU device support to Triton attention kernel tests

* Update test_triton_attention_kernels.py

* Update test_triton_attention_kernels.py

---------

Co-authored-by: svc_repro_tool <svc_repro_tool@habana.ai>
amd-youchen pushed a commit to amd-youchen/sglang that referenced this pull request Nov 13, 2025
[script] add Qwen3-VL README and run scripts
yhyang201 pushed a commit that referenced this pull request Dec 13, 2025
* fix: skip embed init for mm_only mode

* fix: skip send health-check-req  to encoder with epd mode
triple-mu pushed a commit to triple-mu/sglang that referenced this pull request Jan 1, 2026
…3f8fc50101861f

parent 45081af
author Your Name <you@example.com> 1767269369 +0800
committer Your Name <you@example.com> 1767269369 +0800

rebase

# This is the commit message sgl-project#11:

clear cache once
triple-mu pushed a commit to triple-mu/sglang that referenced this pull request Jan 1, 2026
parent 45081af
author Your Name <you@example.com> 1767269369 +0800
committer Your Name <you@example.com> 1767269369 +0800

rebase

# This is the commit message sgl-project#11:

clear cache once

# This is the commit message sgl-project#12:

simplified VAE cache logic for qwenimage and wan

# This is the commit message sgl-project#14:

remove duplicated code
tpoisonooo pushed a commit to tpoisonooo/sglang that referenced this pull request Feb 12, 2026
MatejKosec added a commit to MatejKosec/sglang that referenced this pull request Feb 25, 2026
- Validate alloc reply_id matches request_id (sgl-project#3)
- Remove dead variable num_gen_tokens (sgl-project#4)
- Move inline imports to top level (sgl-project#5)
- Replace hasattr guards with proper None checks (sgl-project#6)
- Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11)
- Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)
MatejKosec added a commit to MatejKosec/sglang that referenced this pull request Feb 26, 2026
- Validate alloc reply_id matches request_id (sgl-project#3)
- Remove dead variable num_gen_tokens (sgl-project#4)
- Move inline imports to top level (sgl-project#5)
- Replace hasattr guards with proper None checks (sgl-project#6)
- Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11)
- Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)
Estrella-xx added a commit to Estrella-xx/sglang that referenced this pull request Mar 13, 2026
wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026
Co-authored-by: longGGGGGG <553746008@qq.com>
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
sgl-project#10 Sweep 1: 3 seeds × 5 ratios. Std 3-5% of mean across all ratios;
swing 1.71× (4711→8075) reproduces within noise of original 1.91×.
Variance bands now in paper Table 1.

sgl-project#11 Setting 4 fallback rule:
- Implementation: SGLANG_XPOOL_QDEPTH_TRIGGER added to
  cross_pool_planner.py (gated, legacy preserved).
- Unit tests: 5/5 PASS.
- E2E: both arms fired 21 transfers on Phase 1+2+3 (workload doesn't
  dual-saturate; KV stays <1%). Honest finding documented in §6.4.
- Deeper fix (per-pool admission signal everywhere) is follow-up.

SETTINGS.md scoreboard reflects both items DONE.
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
…s 28 xfers

v9 pool-binding-shift trace produces real differentiation:
- Phase B (KV-bound 8K random): L1+L2 -37% mean TTFT vs stock
- Phase C (mixed 4K random):     L1+L2 -38% median E2E vs stock
- Cross-pool transfers: stock 0, L1-only 0, L2-only 0, L1+L2 28

Two surprising findings documented:
1. Layer 2 alone fires zero transfers — Layer 1 retention is what
   makes Layer 2 cross the firing threshold.
2. Phase A regresses with L1 (-20% TPS) because K_big=8192 hurts on
   prefix-friendly GSP. Consistent with A2's K_big=0-wins finding.
   Adaptive K_big control marked as follow-up.

Settings status: Setting 1 marked **DONE v6 NULL + v9 PASS**.
All 4 user-requested follow-ups (sgl-project#9 Q3.A 4-arm, sgl-project#10 Sweep 1
multi-seed, sgl-project#11 Setting 4 fallback rule, sgl-project#12 Setting 1 v9 trace)
now complete.
lujangus added a commit to tails-mpt/sglang that referenced this pull request May 1, 2026
Replaces the sparse_attn_v4 stub (which raised NotImplementedError)
with a correct Python reference implementation. Direct port of V4
reference inference/kernel.py:sparse_attn_kernel (lines 277-352) +
sparse_attn dispatcher (line 355).

What this enables:
- V4Attention forward path runs end-to-end (was raising at the
  sparse_attn call site)
- HCA layers (compress_ratio=128) use deterministic-stride topk
  (get_compress_topk_idxs) and now compute attention correctly via the
  Python reference
- Window-only layers (compress_ratio=0) compute attention correctly
- CSA layers (compress_ratio=4) currently fall through to the HCA path
  (deterministic stride) until the NSA Indexer is wired
  (TODO(phase1-nsa))

What this does NOT enable:
- Performance: Python reference is slow. NOT for production use.
  Phase 5 launches require the NSA tilelang kernel + attn_sink extension
  per architecture-notes.md "Open risks sgl-project#11"
- Numerical agreement vs V4 HF reference: that's a separate validation
  task on a real GPU with a loaded V4 checkpoint
- CSA quality: until NSAIndexer wiring lands, CSA layers use
  deterministic stride (HCA's behavior) which approximates but doesn't
  match the V4 reference's learned-index Indexer

Algorithm details (matches V4 reference exactly):
- Mask topk_idxs == -1 to -inf scores
- Compute scaled QK scores: einsum("bshd,bskd->bshk", q, kv)
- Numerically-stable softmax with attn_sink contribution to denominator:
    scores_max = scores.amax(dim=K)
    sum_exp = sum(exp(scores - scores_max))
    sink_term = exp(attn_sink - scores_max)  # per-head sink in denominator
    weights = exp(scores - scores_max) / (sum_exp + sink_term)
- Output = einsum("bshk,bskd->bshd", weights, kv) — sink slot has v=0
- Handles all-invalid-row case (output all zeros, sink absorbs mass)

V4Attention.forward updated:
- The CSA Indexer branch now CALLS self.indexer(x, qr, start_pos, offset)
  when self.indexer is not None. Currently always None (TODO(phase1-nsa))
  so falls through to the deterministic-stride branch.
- Comments updated to make the CSA-quality fallback explicit and
  cross-reference architecture-notes.md "Open risks sgl-project#11".

Tests added:
- test_sparse_attn_v4_basic_shape: shape contract (B, S, H, D output;
  no NaN, no Inf)
- test_sparse_attn_v4_invalid_indices_zero_contribution: validates the
  -1 mask handling. Single-valid-idx case: output == that kv. All-
  invalid case: output == zeros (sink absorbs all softmax mass).

test_v4attention_forward_shape stays skipped (depends on
DeepseekV4ForCausalLM trunk + load_weights — separate from sparse_attn).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant