Json Decode && Mutl-Turns#4
Merged
merrymercy merged 25 commits intomainfrom Jan 15, 2024
Merged
Conversation
merrymercy
requested changes
Jan 15, 2024
4 tasks
5 tasks
tpoisonooo
pushed a commit
to tpoisonooo/sglang
that referenced
this pull request
Feb 12, 2026
…e_batch [jit kernel] add get_compressed_k triton kernel, now enabled for single-batch end to end inference
Closed
5 tasks
5 tasks
MatejKosec
added a commit
to MatejKosec/sglang
that referenced
this pull request
Feb 25, 2026
- Validate alloc reply_id matches request_id (sgl-project#3) - Remove dead variable num_gen_tokens (sgl-project#4) - Move inline imports to top level (sgl-project#5) - Replace hasattr guards with proper None checks (sgl-project#6) - Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11) - Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)
MatejKosec
added a commit
to MatejKosec/sglang
that referenced
this pull request
Feb 26, 2026
- Validate alloc reply_id matches request_id (sgl-project#3) - Remove dead variable num_gen_tokens (sgl-project#4) - Move inline imports to top level (sgl-project#5) - Replace hasattr guards with proper None checks (sgl-project#6) - Demote per-request logs to DEBUG, keep milestones at INFO (sgl-project#11) - Remove unused tree_cache param from start_kv_return_receiver (sgl-project#14)
21 tasks
5 tasks
5 tasks
5 tasks
AndyLi429
pushed a commit
to AndyLi429/sglang
that referenced
this pull request
Mar 10, 2026
lawrence-harmonic
added a commit
to lawrence-harmonic/sglang
that referenced
this pull request
Mar 19, 2026
5 tasks
apinge
added a commit
to apinge/sglang
that referenced
this pull request
Mar 31, 2026
* apply aiter gemma_rmsnorm Signed-off-by: apinge <tong.qiu2@amd.com> * remove comment Signed-off-by: apinge <tong.qiu2@amd.com> --------- Signed-off-by: apinge <tong.qiu2@amd.com>
wisclmy0611
pushed a commit
that referenced
this pull request
Apr 7, 2026
…tion (#4) * feat: Update documentation theme to Aspen, introduce custom fonts, and color scheme. * feat: Getting Started Section * feat: theme change * feat: Supported models section, theme fixes * feat: theme, features * feat: Base for Supported Models Section * feat: card assets, multi modal LM change to VLM * doc structure fix --------- Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com>
4 tasks
5 tasks
YChange01
added a commit
to YChange01/sglang
that referenced
this pull request
Apr 9, 2026
New cross-node load/store path that bypasses the closed libubsm_sdk.so
entirely and talks to the GPL kernel UAPI /usr/include/ub/obmm.h
directly via ioctls on /dev/obmm.
Background
----------
Scheme 6 (via libubsm_sdk.so) got stuck on daemon-internal error 800
returned from ubsmem_shmem_allocate. The SDK is a binary blob so we
couldn't see what it was actually sending to the kernel. Two earlier
fixes (4MB alignment, real cluster hostnames) were both correct in
isolation but did not unblock 800 — after three iterations it became
clear that the region-based allocation path in the current SDK build
is either broken or requires cluster-side configuration we can't see.
Scheme 7 side-steps the problem by calling the kernel UAPI directly.
obmm.h is 186 lines, GPL-2.0+, and documents exactly the export /
import / unimport / unexport ioctls we need. Corresponding kernel
source lives in openEuler/kernel (OLK-6.6, migrated to AtomGit).
What's in this commit
---------------------
benchmark/engram/scheme7_obmm/
obmm_rw.h/c — thin wrapper with four entry points:
obmm_rw_open/close
obmm_rw_export / obmm_rw_unexport
obmm_rw_import / obmm_rw_unimport
Plus an 80-byte packed handle struct that carries
(mem_id, tokenid, length, uba, seid, deid, scna,
pxm_numa, base_dist) across TCP for the cross-node
variant that will come next.
smoke_test.c — single-node loopback:
1. open /dev/obmm
2. mmap 4 MB anonymous buffer
3. write 1 KB pattern
4. EXPORT_PID with flags=ALLOW_MMAP
5. IMPORT back with flags=ALLOW_MMAP
6. read pattern through the imported VA, verify
7. quick loopback load-latency bench
All seid/deid left zero for the simplest first call.
Makefile — plain gcc, no link to libubsm_sdk.so (we verify via
ldd that nothing sdk-related sneaks in).
README.md — architecture diagram, how to run, expected output,
and a "what can go wrong" table tied to each likely
EINVAL / EPERM / ENOENT failure mode.
This is Task sgl-project#2 of a 5-task scheme7 plan tracked in the session.
Next tasks:
sgl-project#3 extend to cross-node via TCP handle exchange
sgl-project#4 bench scheme7 vs scheme5
sgl-project#5 integrate winner into SGLang Engram prefetcher
The deferred kernel URMA_SEG_MAPPED patch is documented in
memory/project_kernel_urma_mapped_stretch.md and will be revisited
later as an independent upstream-contribution track — it answers a
different question from scheme7 (API unification, not hardware
capability).
5 tasks
5 tasks
5 tasks
rucnyz
added a commit
to rucnyz/sglang
that referenced
this pull request
Apr 30, 2026
sgl-project#4 18_Q3A_host_tier.sh: adds the missing 4th arm to Q3.A — engine default MambaRadixCache WITH HiMambaRadixCache host-DRAM tier on (--enable-hierarchical-cache --hicache-ratio 2.0). Same GSP workload as 09_setting3a v2 so results are directly comparable to default, extra_buffer, layer1. sgl-project#5 19_sweep1_multiseed.sh: single-(ratio, seed) launcher for Sweep 1. Outer driver fans 3 seeds × 5 ratios across GPUs to characterize run-to-run variance and put error bars on the 1.91× throughput swing claim in paper Table 1.
rucnyz
added a commit
to rucnyz/sglang
that referenced
this pull request
Apr 30, 2026
sgl-project#4 Q3.A 4-arm: added host-tier-on row to RESULTS.md table, paper §6.3 tab:q3a updated. Default + HiMambaRadixCache costs 7-11% latency vs default, reproducing the paper's offload-fetch tax claim. sgl-project#2 Setting 4 saturation-blind fix: - cross_pool_planner.py: new SGLANG_XPOOL_QDEPTH_TRIGGER env var (default 0 = legacy behavior preserved). When >0, the planner ALSO fires a transfer when one pool is saturated (above its high watermark) AND queue_depth >= trigger — even if the other pool is above its low watermark. Recovers gradient information at saturation. - agent.py: passes num_queue_reqs to planner.decide(); logs xpool_plan_queue_depth in the JSONL stream. - 35_planner_qdepth_unit.py: 5/5 unit tests pass — qdepth=0 preserves legacy, qdepth>0 fires saturation+queue, queue_depth field populated. The fix is gated so existing runs see no behavior change. Sweep 1 multi-seed re-run with the new mode pending (will compare proxy V_kv' + V_mamba' decisions across ratios with vs without queue signal).
SammLSH
added a commit
to SammLSH/sglang
that referenced
this pull request
May 4, 2026
Drops the sglang-native session.start/.end + binary-PCM-frame protocol that landed in M1 and replaces it with the OpenAI Realtime transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription). Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime. Wire protocol (JSON only, no binary frames): client -> session.update {session.type=transcription, audio.input.{format, sample_rate, transcription.{model,language}, noise_reduction, turn_detection}} -> input_audio_buffer.append {audio: base64-PCM16-LE} -> input_audio_buffer.commit -> input_audio_buffer.clear server -> session.created / session.updated -> input_audio_buffer.committed {item_id, previous_item_id} -> input_audio_buffer.cleared -> conversation.item.created {previous_item_id, item} -> conversation.item.input_audio_transcription.delta -> conversation.item.input_audio_transcription.completed -> conversation.item.input_audio_transcription.failed -> error {error: {type, code, message, param}} sglang-specific deltas vs the spec, all documented in the module docstring: * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz internally via librosa before feeding the model. * Server-side VAD is not implemented; turn_detection != null is rejected with vad_not_supported. Clients must commit explicitly. * noise_reduction != null is rejected; include[] is silently dropped. * Deltas stream continuously as audio is appended (one inference per chunk_size_sec of new audio, anchored by the previously emitted prefix). Clients do not need to commit to start receiving deltas; commit only finalizes the turn and emits the committed/item.created/ completed triplet, then resets state for the next turn within the same session. * audio.input.transcription.model stays echo-only per the existing sglang single-model design; multi-model routing belongs upstream. Reviewer-requested changes also bundled in: * sgl-project#1 (encapsulation): handle_realtime_transcription now takes tokenizer_manager, adapter, server_args, and session_semaphore as explicit kwargs; the WS module never reaches into OpenAIServingTranscription privates. * sgl-project#4 (type hints): all new functions and dataclasses are fully annotated. * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default 32). Excess connections are accepted, sent error{code: too_many_sessions}, and closed. Out-of-scope follow-ups (TODO in module docstring): * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept pre-decoded ndarrays; punted to a separate PR. Test refresh in test/manual/models/test_qwen3_asr.py: * _stream_websocket_async rewritten to drive the new protocol (session.update -> append events with base64 -> commit -> drain delta + committed + item.created + completed). * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs (/tmp/asr_openai_run1..5.log).
SammLSH
added a commit
to SammLSH/sglang
that referenced
this pull request
May 4, 2026
Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. The shared inference driver is untouched, so HTTP SSE and WS still produce byte-identical transcripts; this is purely a transport rewrite. sglang deviations from the spec live in the module docstring: sample_rate is a sglang extension accepting 16/24/48 kHz with internal resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and noise_reduction must be null (no server-side VAD), include[] is dropped, model is echo-only. Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription), sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32). sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's input contract.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Still in progress...