Skip to content

Json Decode && Mutl-Turns#4

Merged
merrymercy merged 25 commits intomainfrom
json-mt-bench
Jan 15, 2024
Merged

Json Decode && Mutl-Turns#4
merrymercy merged 25 commits intomainfrom
json-mt-bench

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Jan 10, 2024

Still in progress...

  • multi-turns with uniformly sampled length.
  • Fix sglArgument.
  • New sleep policy(and it works).
  • Long and short multi-turns benchmark.
  • Mixtral on 8xA10
  • Possible improvement about manager/model_rpc/backend_config

Copy link
Copy Markdown
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run this unit test

Comment thread benchmark/multi_turns/README.md
Comment thread python/sglang/srt/managers/router/manager.py
Comment thread python/sglang/srt/managers/router/model_rpc.py Outdated
@merrymercy merrymercy merged commit 08ab2a1 into main Jan 15, 2024
@merrymercy merrymercy deleted the json-mt-bench branch January 15, 2024 08:49
@Rookie-Kai Rookie-Kai mentioned this pull request Aug 14, 2024
4 tasks
tpoisonooo pushed a commit to tpoisonooo/sglang that referenced this pull request Feb 12, 2026
…e_batch

[jit kernel] add get_compressed_k triton kernel, now enabled for single-batch end to end inference
MatejKosec added a commit to MatejKosec/sglang that referenced this pull request Feb 25, 2026
- Validate alloc reply_id matches request_id (sgl-project#3)
- Remove dead variable num_gen_tokens (sgl-project#4)
- Move inline imports to top level (sgl-project#5)
- Replace hasattr guards with proper None checks (sgl-project#6)
- Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11)
- Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)
MatejKosec added a commit to MatejKosec/sglang that referenced this pull request Feb 26, 2026
- Validate alloc reply_id matches request_id (sgl-project#3)
- Remove dead variable num_gen_tokens (sgl-project#4)
- Move inline imports to top level (sgl-project#5)
- Replace hasattr guards with proper None checks (sgl-project#6)
- Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11)
- Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)
@alisonshao alisonshao mentioned this pull request Mar 1, 2026
21 tasks
AndyLi429 pushed a commit to AndyLi429/sglang that referenced this pull request Mar 10, 2026
lawrence-harmonic added a commit to lawrence-harmonic/sglang that referenced this pull request Mar 19, 2026
apinge added a commit to apinge/sglang that referenced this pull request Mar 31, 2026
* apply aiter gemma_rmsnorm

Signed-off-by: apinge <tong.qiu2@amd.com>

* remove comment

Signed-off-by: apinge <tong.qiu2@amd.com>

---------

Signed-off-by: apinge <tong.qiu2@amd.com>
wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026
…tion (#4)

* feat: Update documentation theme to Aspen, introduce custom fonts, and color scheme.

* feat: Getting Started Section

* feat: theme change

* feat: Supported models section, theme fixes

* feat: theme, features

* feat: Base for Supported Models Section

* feat: card assets, multi modal LM change to VLM

* doc structure fix

---------

Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com>
YChange01 added a commit to YChange01/sglang that referenced this pull request Apr 9, 2026
New cross-node load/store path that bypasses the closed libubsm_sdk.so
entirely and talks to the GPL kernel UAPI /usr/include/ub/obmm.h
directly via ioctls on /dev/obmm.

Background
----------

Scheme 6 (via libubsm_sdk.so) got stuck on daemon-internal error 800
returned from ubsmem_shmem_allocate. The SDK is a binary blob so we
couldn't see what it was actually sending to the kernel. Two earlier
fixes (4MB alignment, real cluster hostnames) were both correct in
isolation but did not unblock 800 — after three iterations it became
clear that the region-based allocation path in the current SDK build
is either broken or requires cluster-side configuration we can't see.

Scheme 7 side-steps the problem by calling the kernel UAPI directly.
obmm.h is 186 lines, GPL-2.0+, and documents exactly the export /
import / unimport / unexport ioctls we need. Corresponding kernel
source lives in openEuler/kernel (OLK-6.6, migrated to AtomGit).

What's in this commit
---------------------

benchmark/engram/scheme7_obmm/
  obmm_rw.h/c  — thin wrapper with four entry points:
                   obmm_rw_open/close
                   obmm_rw_export / obmm_rw_unexport
                   obmm_rw_import / obmm_rw_unimport
                 Plus an 80-byte packed handle struct that carries
                 (mem_id, tokenid, length, uba, seid, deid, scna,
                 pxm_numa, base_dist) across TCP for the cross-node
                 variant that will come next.
  smoke_test.c — single-node loopback:
                   1. open /dev/obmm
                   2. mmap 4 MB anonymous buffer
                   3. write 1 KB pattern
                   4. EXPORT_PID with flags=ALLOW_MMAP
                   5. IMPORT back with flags=ALLOW_MMAP
                   6. read pattern through the imported VA, verify
                   7. quick loopback load-latency bench
                 All seid/deid left zero for the simplest first call.
  Makefile     — plain gcc, no link to libubsm_sdk.so (we verify via
                 ldd that nothing sdk-related sneaks in).
  README.md    — architecture diagram, how to run, expected output,
                 and a "what can go wrong" table tied to each likely
                 EINVAL / EPERM / ENOENT failure mode.

This is Task sgl-project#2 of a 5-task scheme7 plan tracked in the session.
Next tasks:
  sgl-project#3 extend to cross-node via TCP handle exchange
  sgl-project#4 bench scheme7 vs scheme5
  sgl-project#5 integrate winner into SGLang Engram prefetcher

The deferred kernel URMA_SEG_MAPPED patch is documented in
memory/project_kernel_urma_mapped_stretch.md and will be revisited
later as an independent upstream-contribution track — it answers a
different question from scheme7 (API unification, not hardware
capability).
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
sgl-project#4 18_Q3A_host_tier.sh: adds the missing 4th arm to Q3.A — engine
default MambaRadixCache WITH HiMambaRadixCache host-DRAM tier on
(--enable-hierarchical-cache --hicache-ratio 2.0). Same GSP workload
as 09_setting3a v2 so results are directly comparable to default,
extra_buffer, layer1.

sgl-project#5 19_sweep1_multiseed.sh: single-(ratio, seed) launcher for Sweep 1.
Outer driver fans 3 seeds × 5 ratios across GPUs to characterize
run-to-run variance and put error bars on the 1.91× throughput swing
claim in paper Table 1.
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
sgl-project#4 Q3.A 4-arm: added host-tier-on row to RESULTS.md table, paper §6.3
tab:q3a updated. Default + HiMambaRadixCache costs 7-11% latency vs
default, reproducing the paper's offload-fetch tax claim.

sgl-project#2 Setting 4 saturation-blind fix:
- cross_pool_planner.py: new SGLANG_XPOOL_QDEPTH_TRIGGER env var (default
  0 = legacy behavior preserved). When >0, the planner ALSO fires a
  transfer when one pool is saturated (above its high watermark) AND
  queue_depth >= trigger — even if the other pool is above its low
  watermark. Recovers gradient information at saturation.
- agent.py: passes num_queue_reqs to planner.decide(); logs
  xpool_plan_queue_depth in the JSONL stream.
- 35_planner_qdepth_unit.py: 5/5 unit tests pass — qdepth=0 preserves
  legacy, qdepth>0 fires saturation+queue, queue_depth field
  populated.

The fix is gated so existing runs see no behavior change. Sweep 1
multi-seed re-run with the new mode pending (will compare proxy V_kv'
+ V_mamba' decisions across ratios with vs without queue signal).
SammLSH added a commit to SammLSH/sglang that referenced this pull request May 4, 2026
Drops the sglang-native session.start/.end + binary-PCM-frame protocol
that landed in M1 and replaces it with the OpenAI Realtime
transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription).
Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime.

Wire protocol (JSON only, no binary frames):
  client -> session.update {session.type=transcription, audio.input.{format,
            sample_rate, transcription.{model,language}, noise_reduction,
            turn_detection}}
         -> input_audio_buffer.append {audio: base64-PCM16-LE}
         -> input_audio_buffer.commit
         -> input_audio_buffer.clear
  server -> session.created / session.updated
         -> input_audio_buffer.committed {item_id, previous_item_id}
         -> input_audio_buffer.cleared
         -> conversation.item.created {previous_item_id, item}
         -> conversation.item.input_audio_transcription.delta
         -> conversation.item.input_audio_transcription.completed
         -> conversation.item.input_audio_transcription.failed
         -> error {error: {type, code, message, param}}

sglang-specific deltas vs the spec, all documented in the module docstring:
  * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm
    default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz
    internally via librosa before feeding the model.
  * Server-side VAD is not implemented; turn_detection != null is
    rejected with vad_not_supported. Clients must commit explicitly.
  * noise_reduction != null is rejected; include[] is silently dropped.
  * Deltas stream continuously as audio is appended (one inference per
    chunk_size_sec of new audio, anchored by the previously emitted
    prefix). Clients do not need to commit to start receiving deltas;
    commit only finalizes the turn and emits the committed/item.created/
    completed triplet, then resets state for the next turn within the
    same session.
  * audio.input.transcription.model stays echo-only per the existing
    sglang single-model design; multi-model routing belongs upstream.

Reviewer-requested changes also bundled in:
  * sgl-project#1 (encapsulation): handle_realtime_transcription now takes
    tokenizer_manager, adapter, server_args, and session_semaphore as
    explicit kwargs; the WS module never reaches into
    OpenAIServingTranscription privates.
  * sgl-project#4 (type hints): all new functions and dataclasses are fully
    annotated.
  * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default
    32). Excess connections are accepted, sent error{code:
    too_many_sessions}, and closed.

Out-of-scope follow-ups (TODO in module docstring):
  * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept
    pre-decoded ndarrays; punted to a separate PR.

Test refresh in test/manual/models/test_qwen3_asr.py:
  * _stream_websocket_async rewritten to drive the new protocol
    (session.update -> append events with base64 -> commit -> drain
    delta + committed + item.created + completed).
  * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs
    (/tmp/asr_openai_run1..5.log).
SammLSH added a commit to SammLSH/sglang that referenced this pull request May 4, 2026
Move /v1/audio/transcriptions/stream to /v1/realtime and switch from
the M1 session.start/binary-PCM protocol to OpenAI's Realtime
transcription wire format. The shared inference driver is untouched,
so HTTP SSE and WS still produce byte-identical transcripts; this is
purely a transport rewrite.

sglang deviations from the spec live in the module docstring:
sample_rate is a sglang extension accepting 16/24/48 kHz with internal
resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and
noise_reduction must be null (no server-side VAD), include[] is
dropped, model is echo-only.

Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription),
sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32).
sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's
input contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants