Skip to content

Improve docs#17

Merged
merrymercy merged 2 commits intomainfrom
doc
Jan 17, 2024
Merged

Improve docs#17
merrymercy merged 2 commits intomainfrom
doc

Conversation

@merrymercy
Copy link
Copy Markdown
Contributor

No description provided.

@merrymercy merrymercy merged commit c4707f1 into main Jan 17, 2024
@merrymercy merrymercy deleted the doc branch January 17, 2024 03:53
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
chunyuan-w pushed a commit to chunyuan-w/sglang that referenced this pull request Mar 24, 2025
* Use rms norm kernel instead of vllm

* update
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
chunyuan-w pushed a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 3, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 4, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 10, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 18, 2025
* Use rms norm kernel instead of vllm

* update
pengxin99 pushed a commit to pengxin99/sglang that referenced this pull request Jun 19, 2025
yichiche pushed a commit to yichiche/sglang that referenced this pull request Jul 30, 2025
* fix decode

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
yichiche pushed a commit to yichiche/sglang that referenced this pull request Aug 7, 2025
* fix decode

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
yichiche pushed a commit to yichiche/sglang that referenced this pull request Aug 11, 2025
* fix decode

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
JustinTong0323 added a commit that referenced this pull request Oct 30, 2025
* Fix dtype mismatch in rotary embedding with FP8 KV cache

When using FP8 KV cache quantization (e.g., with ModelOpt FP8 models),
the query and key tensors may have different dtypes during CUDA graph
capture. The query tensor remains in bfloat16 for computation, while
the key tensor might need to be in FP8 format for KV cache storage.

The issue was in DeepseekScalingRotaryEmbedding.forward_native() which
only captured query's dtype and then converted both query and key to
that same dtype. This caused a dtype mismatch error during CUDA graph
capture: "query and key must have the same dtype".

The fix preserves the original dtypes of both query and key tensors
separately, ensuring they maintain their intended dtypes after the
rotary position embedding computation.

This resolves the CUDA graph capture failure with Qwen3MoE and other
models using FP8 KV cache quantization.

* Fix FA4 dtype mismatch with FP8 KV cache

When using FlashAttention 4 (FA4) with FP8 KV cache quantization,
there was a dtype mismatch between the query tensor (bfloat16) and
the cached key/value tensors (FP8). FA4 requires all input tensors
(q, k, v) to have the same dtype.

The previous code only converted the query to FP8 when NOT using FA4
(fa_impl_ver != 4). This was based on the assumption that FA4 doesn't
support FP8, but actually FA4 CAN work with FP8 tensors as long as
all tensors have matching dtypes.

The key difference is that FA4 doesn't support descale parameters for
on-the-fly dequantization (unlike FA3). So we:
1. Convert query to FP8 to match the KV cache dtype for both FA3 and FA4
2. Only set k_descale/v_descale for FA3 (FA4 doesn't support them)

This resolves the "query and key must have the same dtype" error when
using FP8 KV cache with FA4.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
JustinTong0323 added a commit that referenced this pull request Oct 30, 2025
key4ng pushed a commit to key4ng/sglang that referenced this pull request Nov 9, 2025
Muqi1029 pushed a commit to Muqi1029/sglang that referenced this pull request Jan 4, 2026
…tr-usage

Remove redundant getattr in gRPC request manager
Garrybest pushed a commit to Garrybest/sglang that referenced this pull request Jan 9, 2026
fix: update kernel benchmark to support paged attention
wzrf pushed a commit to wzrf/sglang-fusionrag that referenced this pull request Feb 8, 2026
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Mar 14, 2026
[RFC] refactor: RESTful and sgl/smg compliant API
chx96642264 added a commit to chx96642264/sglang that referenced this pull request Mar 25, 2026
…n temperature is not 0 (sgl-project#17)

* adaptation to deterministic inference with operator mean on NPU

* adaptation to deterministic inference with operator log_softmax on NPU, new def adaptation

* adaptation to deterministic inference with keeping sample correct when temperature is not 0

---------

Co-authored-by: chx96642264 <chenhaoxuan2@h-partners.com>
chx96642264 added a commit to chx96642264/sglang that referenced this pull request Mar 25, 2026
…n temperature is not 0 (sgl-project#17)

* adaptation to deterministic inference with operator mean on NPU

* adaptation to deterministic inference with operator log_softmax on NPU, new def adaptation

* adaptation to deterministic inference with keeping sample correct when temperature is not 0

---------

Co-authored-by: chx96642264 <chenhaoxuan2@h-partners.com>
mmangkad pushed a commit to mmangkad-dev/sglang that referenced this pull request Apr 3, 2026
wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026
Co-authored-by: Richard <richardchen@radixark.ai>
zhuzilin pushed a commit that referenced this pull request Apr 13, 2026
(cherry picked from commit 6a1a64f)

Fix #17 #19
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
(1) Arena memory transparency (issue sgl-project#17). When SGLANG_ARENA_SHARED=1
    is set, MambaPool and KV pool each reserve 4 chunks
    (SGLANG_ARENA_*_HEADROOM_CHUNKS, default 4) of unmapped VA on top of
    the budget the user controls via mem_fraction_static. With
    chunk_bytes=1GB this is 8 GiB silently taken on top of the user
    budget, and was responsible for OOMs at mem_fraction=0.8 on
    H200 (Setting 1 v9 first run).

    Fix in `model_runner_kv_cache_mixin._profile_available_bytes`: when
    SGLANG_ARENA_SHARED=1, subtract the headroom from rest_memory before
    pool sizing. Arena's reservation = pool_size + headroom now fits in
    the original mem_fraction_static budget. Logs the subtraction
    (rest_memory: X → Y GiB) at boot so the change is visible. Verified
    on Qwen3.5-35B-A3B / H200 / TP=1: rest_memory 45.74 → 37.74 GiB
    with chunk_bytes=1GB, headroom=4+4 chunks. No more OOM at
    mem_fraction=0.8 with full Layer 2 stack.

(2) Suite job_dir fix. jobs.py uses the same `name` for both the
    baseline and prelude Jobs of a workload (by design — they share the
    same workload). run_job() built `out_dir / job.name` as the per-job
    output directory, so the two arms wrote to and read from the same
    metrics.json. Whichever arm finished last overwrote the other's
    metrics → suite v2 reported delta=0.0% across all rows.

    Fix: per-arm subdirectory `out_dir / "{name}__{arm}"`.
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
(1) Arena memory transparency, path A (issue sgl-project#17 follow-up). The
    earlier path-B fix subtracted arena headroom from rest_memory,
    which kept Layer 2 from OOM-ing but shrank KV+mamba budget below
    baseline at the same mem_fraction_static. Path A keeps KV+mamba at
    the baseline budget and draws arena headroom from the
    (1-mem_fraction)·pre activations/cuda-graph reserve band instead.
    This makes Layer 2 transparent to mem_fraction_static: same setting
    → same KV+mamba capacity → baseline-comparable max_total_num_tokens.

    Falls back to path B (with a logger.warning) when reserve band is
    too small (mem_fraction so high that pre × (1-mf) < arena_headroom).

(2) LoRA workload (R3) configuration. The original R3 inherited
    --enforce-piecewise-cuda-graph + --reasoning-parser qwen3 from
    _common.sh (mamba-only flags). On Qwen3-4B these break the LoRA
    Triton dispatch (assert x.shape[-1] == K in chunked_sgmv_shrink.py
    L154). Fix: introduce MAMBA_FLAGS env var in _common.sh; default to
    the mamba flags; r3_lora.sh sets MAMBA_FLAGS="" to skip them. This
    matches the flag set that the original Sweep 2 LoRA bench
    (dev/eval/05_sweep_lora.sh) used to score 192× TTFT swing — known
    to work end-to-end.

(3) Suite mem_fraction bumped to 0.8 for both arms now that path A
    makes Layer 2 + 0.8 work. R3 re-added to the manifest.
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
Issue sgl-project#17 root cause was twofold:
1. The arena's max_tokens > init_tokens "growth window" was redundant.
   Cross-pool transfer is zero-sum on physical handles (mamba grows X iff
   KV releases X), so reserved-but-unmapped VA past init_tokens can never
   be backed. Removing the headroom collapses max_tokens == init_tokens.
2. Regression suite v6 still OOM'd in FLA on prelude jobs because
   chunk_bytes=1 GiB caused KV's 1.26M tokens to round up to 2.10M
   (n_subpools=20 → ~10 GiB excess physical) and mamba's 362 → 512
   (n_subpools=30 → ~8.7 GiB excess). With both pools allocated as
   pinned VMM partitions, PyTorch's caching allocator can't reclaim
   that excess for FLA temporaries. Diagnosed via available_gpu_mem
   delta (baseline 25.65 GB → prelude 0.97 GB) and arena log lines.

Engine changes:
- memory_pool.py (KV+mamba arenas): set max_tokens = init_tokens =
  tot_aligned. Drop SGLANG_ARENA_{KV,MAMBA}_HEADROOM_CHUNKS reads;
  the env vars are now no-ops.
- model_runner_kv_cache_mixin.py: remove path-B per-pool deduction.
  KV+mamba get the full baseline split of total_rest_memory at the
  same mem_fraction as baseline.

Suite changes (jobs.py + r3_lora.sh):
- PRELUDE_ENV: SGLANG_ARENA_CHUNK_BYTES 1 GiB → 256 MiB. Brings
  total chunk-rounding excess from ~25 GiB to ~3 GiB across both
  pools — well within the (1-mem_fraction)·pre activation reserve.
- r3_lora.sh: prelude arm overrides SGLANG_ARENA_SHARED=0. Qwen3-4B
  has 36 layers × 2 KV kinds = 72 sub-pools, exceeding
  arena_multi64.so's hardcoded 64 cap. Non-mamba models have no
  cross-pool semantics anyway, so disabling arena there still
  exercises the "L2 silent on non-mamba" property.

Suite v7 results (in flight): R1/R2/R3 within ±2.5% TPS of baseline.
B2 cold_burst prelude shows mean TTFT 280→206 ms (-26%) and p99
1083→418 ms (-61%) — the headline benefit case. B1 phase-shift still
running; full table to follow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant