Json Decode && Mutl-Turns by hnyls2002 · Pull Request #4 · sgl-project/sglang

hnyls2002 · 2024-01-10T16:52:42Z

Still in progress...

multi-turns with uniformly sampled length.
Fix sglArgument.
New sleep policy(and it works).
Long and short multi-turns benchmark.
Mixtral on 8xA10
Possible improvement about manager/model_rpc/backend_config

merrymercy

Run this unit test

…e_batch [jit kernel] add get_compressed_k triton kernel, now enabled for single-batch end to end inference

- Validate alloc reply_id matches request_id (sgl-project#3) - Remove dead variable num_gen_tokens (sgl-project#4) - Move inline imports to top level (sgl-project#5) - Replace hasattr guards with proper None checks (sgl-project#6) - Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11) - Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)

Pcp common

* apply aiter gemma_rmsnorm Signed-off-by: apinge <tong.qiu2@amd.com> * remove comment Signed-off-by: apinge <tong.qiu2@amd.com> --------- Signed-off-by: apinge <tong.qiu2@amd.com>

…tion (#4) * feat: Update documentation theme to Aspen, introduce custom fonts, and color scheme. * feat: Getting Started Section * feat: theme change * feat: Supported models section, theme fixes * feat: theme, features * feat: Base for Supported Models Section * feat: card assets, multi modal LM change to VLM * doc structure fix --------- Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com>

New cross-node load/store path that bypasses the closed libubsm_sdk.so entirely and talks to the GPL kernel UAPI /usr/include/ub/obmm.h directly via ioctls on /dev/obmm. Background ---------- Scheme 6 (via libubsm_sdk.so) got stuck on daemon-internal error 800 returned from ubsmem_shmem_allocate. The SDK is a binary blob so we couldn't see what it was actually sending to the kernel. Two earlier fixes (4MB alignment, real cluster hostnames) were both correct in isolation but did not unblock 800 — after three iterations it became clear that the region-based allocation path in the current SDK build is either broken or requires cluster-side configuration we can't see. Scheme 7 side-steps the problem by calling the kernel UAPI directly. obmm.h is 186 lines, GPL-2.0+, and documents exactly the export / import / unimport / unexport ioctls we need. Corresponding kernel source lives in openEuler/kernel (OLK-6.6, migrated to AtomGit). What's in this commit --------------------- benchmark/engram/scheme7_obmm/ obmm_rw.h/c — thin wrapper with four entry points: obmm_rw_open/close obmm_rw_export / obmm_rw_unexport obmm_rw_import / obmm_rw_unimport Plus an 80-byte packed handle struct that carries (mem_id, tokenid, length, uba, seid, deid, scna, pxm_numa, base_dist) across TCP for the cross-node variant that will come next. smoke_test.c — single-node loopback: 1. open /dev/obmm 2. mmap 4 MB anonymous buffer 3. write 1 KB pattern 4. EXPORT_PID with flags=ALLOW_MMAP 5. IMPORT back with flags=ALLOW_MMAP 6. read pattern through the imported VA, verify 7. quick loopback load-latency bench All seid/deid left zero for the simplest first call. Makefile — plain gcc, no link to libubsm_sdk.so (we verify via ldd that nothing sdk-related sneaks in). README.md — architecture diagram, how to run, expected output, and a "what can go wrong" table tied to each likely EINVAL / EPERM / ENOENT failure mode. This is Task sgl-project#2 of a 5-task scheme7 plan tracked in the session. Next tasks: sgl-project#3 extend to cross-node via TCP handle exchange sgl-project#4 bench scheme7 vs scheme5 sgl-project#5 integrate winner into SGLang Engram prefetcher The deferred kernel URMA_SEG_MAPPED patch is documented in memory/project_kernel_urma_mapped_stretch.md and will be revisited later as an independent upstream-contribution track — it answers a different question from scheme7 (API unification, not hardware capability).

sgl-project#4 18_Q3A_host_tier.sh: adds the missing 4th arm to Q3.A — engine default MambaRadixCache WITH HiMambaRadixCache host-DRAM tier on (--enable-hierarchical-cache --hicache-ratio 2.0). Same GSP workload as 09_setting3a v2 so results are directly comparable to default, extra_buffer, layer1. sgl-project#5 19_sweep1_multiseed.sh: single-(ratio, seed) launcher for Sweep 1. Outer driver fans 3 seeds × 5 ratios across GPUs to characterize run-to-run variance and put error bars on the 1.91× throughput swing claim in paper Table 1.

sgl-project#4 Q3.A 4-arm: added host-tier-on row to RESULTS.md table, paper §6.3 tab:q3a updated. Default + HiMambaRadixCache costs 7-11% latency vs default, reproducing the paper's offload-fetch tax claim. sgl-project#2 Setting 4 saturation-blind fix: - cross_pool_planner.py: new SGLANG_XPOOL_QDEPTH_TRIGGER env var (default 0 = legacy behavior preserved). When >0, the planner ALSO fires a transfer when one pool is saturated (above its high watermark) AND queue_depth >= trigger — even if the other pool is above its low watermark. Recovers gradient information at saturation. - agent.py: passes num_queue_reqs to planner.decide(); logs xpool_plan_queue_depth in the JSONL stream. - 35_planner_qdepth_unit.py: 5/5 unit tests pass — qdepth=0 preserves legacy, qdepth>0 fires saturation+queue, queue_depth field populated. The fix is gated so existing runs see no behavior change. Sweep 1 multi-seed re-run with the new mode pending (will compare proxy V_kv' + V_mamba' decisions across ratios with vs without queue signal).

Drops the sglang-native session.start/.end + binary-PCM-frame protocol that landed in M1 and replaces it with the OpenAI Realtime transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription). Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime. Wire protocol (JSON only, no binary frames): client -> session.update {session.type=transcription, audio.input.{format, sample_rate, transcription.{model,language}, noise_reduction, turn_detection}} -> input_audio_buffer.append {audio: base64-PCM16-LE} -> input_audio_buffer.commit -> input_audio_buffer.clear server -> session.created / session.updated -> input_audio_buffer.committed {item_id, previous_item_id} -> input_audio_buffer.cleared -> conversation.item.created {previous_item_id, item} -> conversation.item.input_audio_transcription.delta -> conversation.item.input_audio_transcription.completed -> conversation.item.input_audio_transcription.failed -> error {error: {type, code, message, param}} sglang-specific deltas vs the spec, all documented in the module docstring: * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz internally via librosa before feeding the model. * Server-side VAD is not implemented; turn_detection != null is rejected with vad_not_supported. Clients must commit explicitly. * noise_reduction != null is rejected; include[] is silently dropped. * Deltas stream continuously as audio is appended (one inference per chunk_size_sec of new audio, anchored by the previously emitted prefix). Clients do not need to commit to start receiving deltas; commit only finalizes the turn and emits the committed/item.created/ completed triplet, then resets state for the next turn within the same session. * audio.input.transcription.model stays echo-only per the existing sglang single-model design; multi-model routing belongs upstream. Reviewer-requested changes also bundled in: * sgl-project#1 (encapsulation): handle_realtime_transcription now takes tokenizer_manager, adapter, server_args, and session_semaphore as explicit kwargs; the WS module never reaches into OpenAIServingTranscription privates. * sgl-project#4 (type hints): all new functions and dataclasses are fully annotated. * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default 32). Excess connections are accepted, sent error{code: too_many_sessions}, and closed. Out-of-scope follow-ups (TODO in module docstring): * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept pre-decoded ndarrays; punted to a separate PR. Test refresh in test/manual/models/test_qwen3_asr.py: * _stream_websocket_async rewritten to drive the new protocol (session.update -> append events with base64 -> commit -> drain delta + committed + item.created + completed). * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs (/tmp/asr_openai_run1..5.log).

Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. The shared inference driver is untouched, so HTTP SSE and WS still produce byte-identical transcripts; this is purely a transport rewrite. sglang deviations from the spec live in the module docstring: sample_rate is a sglang extension accepting 16/24/48 kHz with internal resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and noise_reduction must be null (no server-side VAD), include[] is dropped, model is echo-only. Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription), sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32). sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's input contract.

hnyls2002 added 23 commits January 10, 2024 04:53

copy folder

af89999

update dataset

ef51e1b

update srt script

4765aa5

add fmt free block

9a8fc71

add bench_other for json

fb82ba5

add version restrict

5c0d71c

add reset_backend_config

6476ee3

multi-turns

1a7d4aa

default num-qa to 10

3b8c367

bench other

3e61150

it works

c9037e7

remove vllm's --gpu 0.97

40b7e30

remove reset_backend_config

04fe142

fixed arguments

43d2bf8

rename ir's "SamplingParams" -> "SglSamplingParams"

cbabeb7

add ignore_eos in frontend

b876149

update multi-turns

f14601f

update flashinfer's version

9eacc32

update benchmark readme

eec94d1

update qas num

ddee6d6

fixed radix cache

bbc0830

update result output

5f1b722

update benchmark

acde022

merrymercy requested changes Jan 15, 2024

View reviewed changes

Comment thread benchmark/multi_turns/README.md

Comment thread python/sglang/srt/managers/router/manager.py

Comment thread python/sglang/srt/managers/router/model_rpc.py Outdated

hnyls2002 added 2 commits January 15, 2024 07:52

mem_frac && assert -> warning

b65f83d

check finished

767b262

merrymercy merged commit 08ab2a1 into main Jan 15, 2024

merrymercy deleted the json-mt-bench branch January 15, 2024 08:49

Rookie-Kai mentioned this pull request Aug 14, 2024

[Bug] Always Watch Dog TimeOut #1093

Closed

4 tasks

wonderisland mentioned this pull request Sep 19, 2024

[Bug] illegal memory access encountered #1467

Closed

5 tasks

tpoisonooo pushed a commit to tpoisonooo/sglang that referenced this pull request Feb 12, 2026

Merge pull request sgl-project#4 from kfeng123/get_compressed_k_singl…

aea4424

…e_batch [jit kernel] add get_compressed_k triton kernel, now enabled for single-batch end to end inference

Martion-z mentioned this pull request Feb 13, 2026

[Bug] CUDA error: an illegal memory access was encountered with SGLang v0.5.8 + HiCache #18785

Closed

5 tasks

dongyibo mentioned this pull request Feb 13, 2026

[Bug] Deepseek v3.2 prefill workers crash when PD disaggregation #18799

Closed

5 tasks

chenkaiyue mentioned this pull request Feb 28, 2026

Fix: Cuda Graph + HiCache + Speculative Decoding Working Together were giving Cuda Illegal memory access error. #19177

Open

alisonshao mentioned this pull request Mar 1, 2026

Upgrade transformers==5.3.0 #17784

Merged

21 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2026

[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models #18154

Merged

5 tasks

putdanil mentioned this pull request Mar 4, 2026

[Bug] FLUX.2-dev FP8 transformer crashes with 4 reference images during denoising #19873

Closed

5 tasks

0xymoro mentioned this pull request Mar 6, 2026

[Bug] Illegal memory access on 0.5.9 nvfp4 #20011

Closed

5 tasks

AndyLi429 pushed a commit to AndyLi429/sglang that referenced this pull request Mar 10, 2026

Merge pull request sgl-project#4 from randgun/pcp_common

9149a7f

Pcp common

lawrence-harmonic added a commit to lawrence-harmonic/sglang that referenced this pull request Mar 19, 2026

fix: do not strip whitespace from GLM tool call values (sgl-project#4)

f441b60

Jacob0226 mentioned this pull request Mar 26, 2026

[AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 #21403

Merged

lviy mentioned this pull request Mar 26, 2026

[Bug] Enablling DP-Attention cause 'nan' of 'inf' in prob tensor #21460

Open

5 tasks

twb1235 mentioned this pull request Apr 7, 2026

[Bug] I noticed that with the node 2 and pp 2 tp8 setup, the workers don't exit on their own when the master goes down. I have to kill them manually #22227

Open

5 tasks

AlfredYyong mentioned this pull request Apr 7, 2026

[Bugfix] Fix --preferred-sampling-params not taking effect #21821

Open

4 tasks

BBuf mentioned this pull request Apr 8, 2026

[SKILL] add torch profiler analysis workflow #22353

Merged

samuellees mentioned this pull request Apr 8, 2026

fix: enable custom all-reduce coexistence with NCCL symmetric memory #22354

Closed

5 tasks

thanhhao98 mentioned this pull request Apr 21, 2026

[Bug Fix] Sync FlashInfer autotune across TP ranks to unblock --enable-symm-mem #23317

Draft

5 tasks

Johnsonms mentioned this pull request Apr 24, 2026

Flux2 nvfp4 quantization correctness on Blackwell (B200) #23625

Merged

5 tasks

silencejade mentioned this pull request Apr 25, 2026

[NPU] Fix mrope_position computation in Eagle Worker v2 with PlanStream #23423

Open

5 tasks

Gs1997XX mentioned this pull request May 8, 2026

DeepSeek-V4 Day 0 Support on NPUs #23598

Open

6 tasks

JackLeeHal mentioned this pull request May 9, 2026

[Question] running DeepSeek-V4-Pro on B300 #24776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Json Decode && Mutl-Turns#4

Json Decode && Mutl-Turns#4
merrymercy merged 25 commits intomainfrom
json-mt-bench

hnyls2002 commented Jan 10, 2024 •

edited

Loading

Uh oh!

merrymercy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hnyls2002 commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hnyls2002 commented Jan 10, 2024 •

edited

Loading