Multi platform Plugin#21388
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes a robust, multi-platform plugin architecture for SGLang. The primary goal is to enable external extensibility, allowing developers to integrate new hardware platforms or customize core behaviors through a non-invasive plugin system. This significantly enhances SGLang's adaptability by abstracting hardware-specific logic and providing flexible function hooking and class replacement mechanisms, all while maintaining a clean core codebase. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive plugin-based architecture to SGLang, enabling "out-of-tree" (OOT) hardware platform support and general engine extensibility. It defines a Platform abstraction with a PlatformEnum and a CudaPlatform implementation, handling dynamic platform discovery and lazy initialization. This new abstraction is integrated across various core components, including compilation, engine entrypoints, multi-platform operations, memory management, model runner, and server argument processing, allowing OOT platforms to customize behavior and provide specific implementations for various subsystems. A review comment suggests refactoring the KV pool initialization logic for OOT platforms to reduce code duplication and improve maintainability.
| if current_platform.is_out_of_tree() and not self.mambaish_config: | ||
| if self.use_mla_backend and is_nsa_model: | ||
| PoolCls = current_platform.get_nsa_kv_pool_cls() | ||
| self.token_to_kv_pool = PoolCls( | ||
| self.max_total_num_tokens, | ||
| page_size=self.page_size, | ||
| dtype=self.kv_cache_dtype, | ||
| kv_lora_rank=self.model_config.kv_lora_rank, | ||
| qk_rope_head_dim=self.model_config.qk_rope_head_dim, | ||
| layer_num=self.num_effective_layers, | ||
| device=self.device, | ||
| kv_cache_dim=self.calculate_mla_kv_cache_dim(), | ||
| enable_memory_saver=self.server_args.enable_memory_saver, | ||
| start_layer=self.start_layer, | ||
| end_layer=self.end_layer, | ||
| index_head_dim=get_nsa_index_head_dim( | ||
| self.model_config.hf_config | ||
| ), | ||
| ) | ||
| elif self.use_mla_backend: | ||
| PoolCls = current_platform.get_mla_kv_pool_cls() | ||
| self.token_to_kv_pool = PoolCls( | ||
| self.max_total_num_tokens, | ||
| page_size=self.page_size, | ||
| dtype=self.kv_cache_dtype, | ||
| kv_lora_rank=self.model_config.kv_lora_rank, | ||
| qk_rope_head_dim=self.model_config.qk_rope_head_dim, | ||
| index_head_dim=( | ||
| self.model_config.index_head_dim if is_nsa_model else None | ||
| ), | ||
| layer_num=self.num_effective_layers, | ||
| device=self.device, | ||
| enable_memory_saver=self.server_args.enable_memory_saver, | ||
| start_layer=self.start_layer, | ||
| end_layer=self.end_layer, | ||
| ) | ||
| else: | ||
| PoolCls = current_platform.get_mha_kv_pool_cls() | ||
| self.token_to_kv_pool = PoolCls( | ||
| self.max_total_num_tokens, | ||
| page_size=self.page_size, | ||
| dtype=self.kv_cache_dtype, | ||
| head_num=self.model_config.get_num_kv_heads( | ||
| get_attention_tp_size() | ||
| ), | ||
| head_dim=self.model_config.head_dim, | ||
| layer_num=self.num_effective_layers, | ||
| device=self.device, | ||
| enable_memory_saver=self.server_args.enable_memory_saver, | ||
| start_layer=self.start_layer, | ||
| end_layer=self.end_layer, | ||
| ) |
There was a problem hiding this comment.
The logic for initializing the different KV pool types for out-of-tree platforms involves significant code duplication, especially for the constructor arguments. This can be refactored to improve readability and maintainability by extracting common arguments into a dictionary.
if current_platform.is_out_of_tree() and not self.mambaish_config:
pool_args = {
"max_total_num_tokens": self.max_total_num_tokens,
"page_size": self.page_size,
"dtype": self.kv_cache_dtype,
"layer_num": self.num_effective_layers,
"device": self.device,
"enable_memory_saver": self.server_args.enable_memory_saver,
"start_layer": self.start_layer,
"end_layer": self.end_layer,
}
if self.use_mla_backend and is_nsa_model:
PoolCls = current_platform.get_nsa_kv_pool_cls()
pool_args.update({
"kv_lora_rank": self.model_config.kv_lora_rank,
"qk_rope_head_dim": self.model_config.qk_rope_head_dim,
"kv_cache_dim": self.calculate_mla_kv_cache_dim(),
"index_head_dim": get_nsa_index_head_dim(
self.model_config.hf_config
),
})
elif self.use_mla_backend:
PoolCls = current_platform.get_mla_kv_pool_cls()
pool_args.update({
"kv_lora_rank": self.model_config.kv_lora_rank,
"qk_rope_head_dim": self.model_config.qk_rope_head_dim,
"index_head_dim": (
self.model_config.index_head_dim if is_nsa_model else None
),
})
else:
PoolCls = current_platform.get_mha_kv_pool_cls()
pool_args.update({
"head_num": self.model_config.get_num_kv_heads(
get_attention_tp_size()
),
"head_dim": self.model_config.head_dim,
})
self.token_to_kv_pool = PoolCls(**pool_args)| # Apply worker-level platform patches (phase 2 monkey patching). | ||
| from sglang.srt.platforms import current_platform | ||
|
|
||
| current_platform.apply_worker_patches() | ||
|
|
||
| # Apply deferred hooks (phase 2, idempotent). | ||
| # Re-discover plugins in subprocess (spawn'd processes lose main-process state). | ||
| from sglang.srt.plugins import load_general_plugins | ||
| from sglang.srt.plugins.hook_registry import HookRegistry | ||
| load_general_plugins() | ||
| HookRegistry.apply_hooks() |
There was a problem hiding this comment.
We should call apply_hooks right after the process is created, so the hook can override anything including Scheduler and TpWorker.
A good place is run_scheduler_process
There was a problem hiding this comment.
Thanks for the suggestion! Done — load_plugins() (which calls HookRegistry.apply_hooks() internally) is now called in run_scheduler_process() before Scheduler() construction. The call in tp_worker.py has been removed since TpModelWorker is always created inside the Scheduler process, so it was redundant.
| load_general_plugins() | ||
| HookRegistry.apply_hooks() |
There was a problem hiding this comment.
simplify the API, we should reduce the call to only one function call.
There was a problem hiding this comment.
Done — load_plugins() is now the single entry point. It discovers plugins, executes them, and calls HookRegistry.apply_hooks() internally. Callers no longer need a separate apply_hooks() step.
| | Plugin Type | Entry Point Group | Purpose | | ||
| |---|---|---| | ||
| | **Hardware Platform Plugin** | `sglang.platform_plugins` | Register a custom hardware platform (device operations, KV cache pools, attention backends, CUDA Graph, compilation backends, etc.) | | ||
| | **General Function Plugin** | `sglang.general_plugins` | Inject hooks (before/after/around/replace) into any function/method in sglang, or replace entire classes | |
There was a problem hiding this comment.
The name sglang.general_plugins is confusing. We do not have a folder with this name sglang.general_plugins. Can you improve the name?
There was a problem hiding this comment.
Makes sense, thanks! Renamed to sglang.plugins — shorter and consistent with the actual package structure.
alexnails
left a comment
There was a problem hiding this comment.
I probably have some more comments but going to drop this now so there is stuff to go over. Please slack me if you have any questions as I want to help you as much as I can
| def is_musa(self) -> bool: | ||
| return self._enum == PlatformEnum.MUSA | ||
|
|
||
| def is_cuda_alike(self) -> bool: |
There was a problem hiding this comment.
@yeahdongcn multi modal and SRT handle is cuda alike differently wrt MUSA. leave as a note that this needs to be resolved later?
There was a problem hiding this comment.
In DeviceMixin, we unify 'is_cuda_alike()' as CUDA+ROM+MUSA. Therefore, once MM inherits DeviceMixin, it will automatically obtain a consistent definition. The SRT prefix in SRTPlatform (as in the example you provided) is intended to distinguish it from future MMPPlatform (DeviceMixin) - both share the same DeviceMixin foundation but carry subsystem specific factory methods. If you have better naming suggestions, I would be happy to make adjustments.
There was a problem hiding this comment.
Sorry, I missed that comment. Yes, the behavior differs between multimodal_gen and SRT. In multimodal_gen, current_platform.is_cuda_alike() is only used to determine graph capture behavior and the communication method.
In contrast, in SRT, it is used to decide whether certain kernels can be imported.
For example:
if _is_cuda_alike:
from sgl_kernel import (
cutlass_w4a8_moe_mm,
get_cutlass_w4a8_moe_mm_data,
)|
|
||
| class PlatformEnum(enum.Enum): | ||
| """Enumeration of known platform types.""" | ||
|
|
There was a problem hiding this comment.
this is missing hardware types (e.g NPU is one of the top of my head) and if following my other PR comments, this should be a mixin.
There was a problem hiding this comment.
Thanks, addressed! PlatformEnum now covers all current hardware: CUDA, ROCM, CPU, XPU, MUSA, NPU, TPU, MPS, OOT, UNSPECIFIED. All identity queries (is_cuda(), is_npu(), is_musa(), etc.) are defined in DeviceMixin and derived automatically from _enum.
| SGLang Hardware Platform Abstraction. | ||
|
|
||
| Defines the Platform base class and PlatformEnum. Each hardware backend | ||
| (CUDA, ROCm, NPU, XPU, etc.) implements a Platform subclass providing |
There was a problem hiding this comment.
this approach has a few things we need to address:
- We should have the multimodal and srt platforms share the same core functionalities as multimodal already has a platform object (which will at some point inherit from our base abstraction). We want to avoid diamond inheritance though, so we should move to a device mixin approach IMO.
- Future ModelRunner / SpecDec refactor: With a mixin, ModelRunner can compose with just DeviceMixin for device operations without needing the full SRTPlatform. This is cleaner than an ABC hierarchy for the planned SpecDec refactor.
- As far as I understand, the @classmethod decorators / implementation don't focus around failing fast. If a hardware platform does not support something it should be
NotImplementedError
Example
# Mixin -- device operations only, no ABC chain
class DeviceMixin:
name: str
_enum: PlatformEnum
def get_device(self, local_rank) -> torch.device: ...
def get_device_name(self, device_id=0) -> str: ...
def get_distributed_backend(self) -> str: ...
def get_available_memory(self, device_id=0) -> tuple[int, int]: ...
# ... ~15 shared device/memory/distributed methods
class SRTPlatform(DeviceMixin):
# SRT-specific: graph runners, KV pools, quant, compilation
def get_graph_runner_class(self) -> type: ...
def get_kv_pool_class(self, use_mla: bool) -> type: ...
class MMPlatform(DeviceMixin):
# MM-specific methods
def get_attn_backend_cls_str(self) -> str: ...
# External packages -- no diamond inheritance!
class CudaDeviceMixin(DeviceMixin):
name = "npu"
def get_device(self, local_rank): return torch.device("npu", local_rank)
def get_distributed_backend(self): return "hccl"
class CudaSRTPlatform(SRTPlatform, CudaDeviceMixin): # clean MRO
...
class CudaMMPlatform(MMPlatform, CudaDeviceMixin): # clean MRO
...
There was a problem hiding this comment.
Great suggestion, thank you for the detailed design! Implemented this pattern. Created DeviceMixin in platforms/device_mixin.py with identity queries + device operations (all raising NotImplementedError). SRTPlatform extends DeviceMixin for SRT-specific factory methods. OOT plugins compose via MySRTPlatform(SRTPlatform, MyDeviceMixin) — clean MRO, no diamond inheritance. The MMPlatform(DeviceMixin) slot is ready for when the multimodal subsystem migrates to this pattern.
| from sglang.srt.platforms import current_platform | ||
|
|
||
| if current_platform.is_out_of_tree(): | ||
| backend_cls = current_platform.get_piecewise_backend_cls() | ||
| elif is_npu(): | ||
| backend_cls = NPUPiecewiseBackend | ||
| else: | ||
| backend_cls = CUDAPiecewiseBackend |
There was a problem hiding this comment.
really what this should be is just
from sglang.srt.platforms import current_platform
current_platform.get_piecewise_backend_cls()we do not care that the platform is OOT, the hardware plugin itself should be able to implement a PiecewiseBackend Class that we can run (and we should know from flags / server args /. etc wherever we already determine if piecewise backend can be used)
Ideally, we have a unified platform dispatch
There was a problem hiding this comment.
Completely agree with the direction! This is tracked as part of the future migration described in plugin.md under "Current Scope & Future Direction". Currently, in-tree platforms (CUDA/NPU) still use direct imports rather than the platform interface. Once each in-tree backend is migrated to its own SRTPlatform subclass, the if/elif chain here will collapse into a single current_platform.get_piecewise_backend_cls() call. For this PR, we took the minimal non-intrusive approach of adding the OOT branch alongside existing logic.
| @classmethod | ||
| def support_cuda_graph(cls) -> bool: | ||
| """Whether this platform supports CUDA graph capture.""" | ||
| return True | ||
|
|
||
| @classmethod | ||
| def support_cublas(cls) -> bool: | ||
| """Whether this platform supports cuBLAS initialization.""" | ||
| return False |
There was a problem hiding this comment.
why are these in an interface and not part of the cuda implementation? This comments can be seen as a paint brush for quite a few things in here. Let's chat more about this
There was a problem hiding this comment.
Cleaned this up — CUDA-specific methods have been removed from the base class. Methods now raise NotImplementedError (fail-fast). support_cublas() has been deleted entirely. support_cuda_graph() is kept — Many non-CUDA platforms (ROCm, MUSA, and potentially OOT devices) support a similar graph capture mechanism. In the codebase it gates init_device_graphs() and disable_piecewise_cuda_graph, which are hardware-agnostic graph capture paths. It defaults to False (conservative), so platforms opt in explicitly. We could consider renaming it to support_device_graph() in a follow-up if the naming feels misleading.
| from sglang.srt.platforms import current_platform | ||
|
|
||
| if current_platform.is_out_of_tree(): | ||
| mem_bytes = current_platform.get_device_total_memory() |
There was a problem hiding this comment.
is this not numerically safe call? (will default to 0)
and same comment as https://github.com/sgl-project/sglang/pull/21388/changes#r2999364494.
the bottom of these should be pinned to their platform implementations
There was a problem hiding this comment.
Thanks for looking closely. Without the OOT branch, the if/elif chain falls through with no return value (None), which causes a TypeError downstream when used in arithmetic. The OOT branch calls current_platform.get_device_total_memory() which OOT plugins must implement (raises NotImplementedError if not).
| @@ -0,0 +1,197 @@ | |||
| """ | |||
| Function-level hook registry for SGLang plugins. | |||
|
|
|||
There was a problem hiding this comment.
I really like this! However.... we can probably (and should) make this a decorator
@sglang_hook("sglang.srt.managers.scheduler.Scheduler.schedule", type="around")
def my_timer(original_fn, *args, **kwargs):
...
Makes things easier (especially if we ever expose this as some form of JIT to something being hooked)
There was a problem hiding this comment.
Added the sglang_hook decorator as suggested. Both the imperative HookRegistry.register() API and the decorator @sglang_hook(target, type=HookType.AROUND) are now available, so plugin authors can pick whichever style they prefer.
|
|
||
| Allows plugins to transparently replace classes in the sglang engine | ||
| with custom implementations. Similar to vLLM's CustomOp.register_oot pattern. | ||
|
|
There was a problem hiding this comment.
this entire class is also just
@hook("sglang.srt.some.Class", type="replace")
what do you think?
There was a problem hiding this comment.
Done! ClassReplacer has been merged into HookRegistry and class_replacer.py is deleted.
Class replacement is now a special case of HookType.REPLACE within the unified hook system. The plugin_hook decorator handles both function hooks and class replacement
|
|
||
| # Entry point group names | ||
| PLATFORM_PLUGINS_GROUP = "sglang.platform_plugins" | ||
| GENERAL_PLUGINS_GROUP = "sglang.general_plugins" |
There was a problem hiding this comment.
+1 to @merrymercy comments. I do not like this name (slack me if u want and I can help workshop)
There was a problem hiding this comment.
Renamed to sglang.plugins — shorter and consistent with the actual package structure.
| Returns: | ||
| Dictionary mapping plugin name to its loaded callable. | ||
| """ | ||
| from importlib.metadata import entry_points |
There was a problem hiding this comment.
this logic can be simplified via some toml work
def discover_platforms() -> dict[str, type[SRTPlatform]]:
platforms: dict[str, type[SRTPlatform]] = {}
# 1. Built-in (always available)
from sglang.srt.platforms.cuda import CUDASRTPlatform
from sglang.srt.platforms.cpu import CPUSRTPlatform
platforms["cuda"] = CUDASRTPlatform
platforms["cpu"] = CPUSRTPlatform
# rest of in tree platforms
# 2. entry_points from pip-installed packages
for ep in importlib.metadata.entry_points(group="sglang.platforms"):
try:
cls = ep.load()
platforms[ep.name] = cls
except Exception as e:
logger.warning(f"Failed to load platform plugin {ep.name}: {e}")
# 3. SGLANG_PLATFORM_PLUGIN override (dev/testing)
if plugin_spec := os.environ.get("SGLANG_PLATFORM_PLUGIN"):
name, qualname = plugin_spec.split(":", 1) if ":" in plugin_spec else (plugin_spec, plugin_spec)
cls = resolve_obj_by_qualname(qualname)
platforms[name] = cls
return platforms
def get_platform(device: str) -> SRTPlatform:
"""Return the platform for a given device string. Caches instances."""
# Looks up from discovered platforms, instantiates, caches
...
packages register via pyproject.toml:
[project.entry-points."sglang.platforms"]
npu = "sglang_npu.platform:NPUSRTPlatform"
There was a problem hiding this comment.
Thank you for the thoughtful design! We simplified init.py to only do entry_points discovery for OOT platforms. Built-in platforms (CUDA/ROCm/NPU/XPU) are not registered as platform plugins yet — they continue to use the existing is_cuda() utility functions (432+ call sites across 195 files). This avoids a massive refactor in this PR while achieving the same goal of clean OOT extensibility. Once the interfaces stabilize, built-in platforms can be gradually migrated to the same plugin architecture, which aligns with your suggested end-state.
| def is_out_of_tree(self) -> bool: | ||
| """Returns True for externally-registered OOT platforms.""" | ||
| return self._enum == PlatformEnum.OOT | ||
|
|
There was a problem hiding this comment.
as a general comment shouldn't quite a few of these be instant methods and not class methods?
There was a problem hiding this comment.
Agreed, thanks! All methods in both DeviceMixin and SRTPlatform are now instance methods. current_platform is an instance (lazy singleton), so everything is accessed via current_platform.some_method().
| """ | ||
| Discover and instantiate the active platform. | ||
|
|
||
| Priority: OOT plugins > builtin detection. |
There was a problem hiding this comment.
Do you think it's a good idea to have a class member for priority? @alexnails
# OOT
class OOTPlatform(Platform):
priority = 500
# built-in
class CUDAPlatform(Platform):
priority = 100
class CPUPlatform(Platform):
priority = 10
@merrymercy Added unit tests for the plugin system: test_hook_registry.py covers hook semantics (AROUND/BEFORE/AFTER/REPLACE), classmethod/staticmethod descriptor preservation, hook ordering, cross-target conflict detection, onion model, and idempotency; test_load_plugins.py covers plugin loading idempotency, exception resilience, SGLANG_PLUGINS whitelist filtering, and SGLANG_PLATFORM dist exclusion. 21 tests total. |
…ormatting in test_platform_interface.py
|
hi @Baidu-AIAK , can you use |
| def test_custom_supports_fp8(self): | ||
| """Test platform can override supports_fp8.""" | ||
|
|
||
| class CustomPlatform(SRTPlatform): |
There was a problem hiding this comment.
there are many duplications of class CustomPlatform(SRTPlatform).
Can we extract it and make it shared across tests?
There was a problem hiding this comment.
Done. Extracted a shared _StubPlatform subclass that provides minimal concrete implementations, replacing the repeated CustomPlatform(SRTPlatform) boilerplate across tests.
| """Test is_out_of_tree returns False for non-OOT platform.""" | ||
| self.assertFalse(self.mixin.is_out_of_tree()) | ||
|
|
||
| def test_empty_cache_noop(self): |
There was a problem hiding this comment.
Some *_noop and *_raises_not_implemented tests here are actually testing the "python language itself" instead of sglang logic.
The related methods of DeviceMixin are not overrided by a real device mixin class.
Maybe we should consider removing them.
There was a problem hiding this comment.
Done. Removed all *_raises_not_implemented and *_noop tests, as well as other tests that only verify Python behavior (default return values, no-op methods, repr formatting, trivial overrides). The remaining tests focus on SGLang-specific logic: classification rules, validation boundaries, error paths, and branching.
There was a problem hiding this comment.
We may need some platform discovery and _resolve_platform related tests, which are very important sglang logic.
There was a problem hiding this comment.
Done. Added comprehensive tests for platform discovery: _resolve_platform (both SGLANG_PLATFORM env branch and auto-discover branch), _load_platform_class qualname resolution with type validation
merrymercy
left a comment
There was a problem hiding this comment.
Looks good! We can merge once CI is green
…d model_runner.py
@kjuuii # __init__.py
def activate():
"""Return FQN of platform class, or None to skip."""
if _mlu_is_available():
return "my_platform_plugin.platform.MluSRTPlatform"
return NoneAnd use a flat, self-contained package structure: |
|
/rerun-failed-ci |
|
@merrymercy Could we skip this stage? |
|
/rerun-failed-ci |
Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0183.tjzj.baidu.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai> Co-authored-by: Alex Nails <alexj.nails@gmail.com> Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0000.tjzj.baidu.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0183.tjzj.baidu.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai> Co-authored-by: Alex Nails <alexj.nails@gmail.com> Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0000.tjzj.baidu.com> Co-authored-by: Mick <mickjagger19@icloud.com>
|
Hi, Do we have plan to implement the same multi-platform plugin system for multimodal-gen too? |
|
@afei6 the plug in itself will have a MMPlatform to be composed to, it just not the current set of tasks we are working on as the current priority is SRTPlatform side. I will write a Roadmap docs for tasks people can take |
Summary
Introduce a unified plugin framework for SGLang, inspired by vLLM's platform abstraction, enabling hardware vendors and advanced users to extend SGLang without forking or modifying the main repository.
The framework provides two plugin types, both discovered via Python's standard setuptools
entry_pointsmechanism:sglang.platform_plugins): Register custom hardware platforms — device ops, KV cache pools, attention backends, CUDA Graph runners, compilation backends, multi-platform dispatch, etc.sglang.plugins): Inject hooks (BEFORE/AFTER/AROUND/REPLACE) into arbitrary functions/methods in SGLang, or replace entire classes — all managed by a singleHookRegistry.Key Design Principles
is_cuda()/is_npu()utility functions (432+ call sites across 195 files). The Platform system is exclusively for OOT discovery.pip install— no SGLang code changes required.SGLANG_PLUGINSenv var provides comma-separated allowlist filtering.Changes
New framework core files (5 files)
srt/plugins/__init__.pyload_plugins(),SGLANG_PLUGINSenv var filteringsrt/plugins/hook_registry.pyHookRegistrywithBEFORE/AFTER/AROUND/REPLACEsupport,plugin_hookdecorator, class replacement viasetattr,resolve_obj()srt/platforms/device_mixin.pyPlatformEnum(10 members) +DeviceMixinbase class (identity queries + device operations)srt/platforms/interface.pySRTPlatform(DeviceMixin)— factory methods, capability flags, lifecycle hookssrt/platforms/__init__.pycurrent_platformsingleton, pure entry_points OOT discovery +SGLANG_PLATFORMenv var overridePlugin loading integration (4 call sites)
cli/serve.pyserve()topsglang serveCLI — before model type dispatchlaunch_server.py__main__python -m sglang.launch_serverentrypointentrypoints/engine.py_launch_subprocesses()Engine(model_path=...)— beforecheck_server_args()managers/scheduler.pyrun_scheduler_process()Schedulerinstantiationload_plugins()is idempotent (boolean guard). The subprocess call is necessary becausemp.Process(spawn)creates a fresh Python interpreter that does not inherit main process memory state.OOT code paths (7 files, all additive — no existing code removed)
server_args.pyapply_server_args_defaults(), disable piecewise CUDA Graph, default attention backendmodel_runner.pyinit_backend(), post-init initialization, graph recapture after weight update, graph log label, custom GraphRunner classmodel_runner_kv_cache_mixin.pymemory_pool.pyis_cuda_alike()for alt_streammulti_platform.pyforward_{key}method lookupcompilation/backend.pyutils/common.pyget_available_gpu_memory()OOT fallback,get_device_memory_capacity(),get_compiler_backend()Architecture
Class hierarchy
Vendors implement
MyDeviceMixin(DeviceMixin)once for device operations, then mix it into bothSRTPlatformand futureMMPlatformsubclasses via Python MRO.Hook system
The
plugin_hookdecorator andHookRegistry.register()provide a unified API for both function hooks and class replacement:Class replacement uses direct
setattr(notfunctools.wrapswrapper), preservingisinstance/issubclass/ inheritance semantics. A dual validation mechanism prevents misuse: classes can only useREPLACE, and function hooks cannot target class objects.Platform discovery flow
Documentation
docs/platforms/plugin.md— comprehensive guide covering both plugin types, architecture diagrams, API reference, and quickstart examples.