Ipv6 pr#19981
Closed
psaab wants to merge 20 commits intosgl-project:mainfrom
Closed
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
- Prefer IPv6 sockets (try first, fall back to IPv4) in is_port_available, get_free_port, bind_port, and get_open_port - Wrap IPv6 addresses with brackets in gRPC listen address and log messages - Use format_tcp_address/parse_host_port for torch distributed init and DP controller endpoints instead of hardcoded 127.0.0.1 - Use maybe_wrap_ipv6_address for PortArgs fallback host - Add ::1 to loopback exclusion checks in dumper.py and common.py - Replace gethostbyname with getaddrinfo (supports both IPv4 and IPv6) - Use parse_host_port in conn.py, server_args.py, data_parallel_controller.py - Add IPv6 support to get_zmq_socket_on_host - Fix normalize_base_url to handle bracketed IPv6 addresses
Change the default host from 127.0.0.1 to ::1 across all runtime code: - ServerArgs default host in both srt and multimodal_gen - All fallback loopback addresses in model_runner, data_parallel_controller, PortArgs, shm_broadcast, encode_server, encode_grpc_server, and gpu_worker - Use format_tcp_address for proper bracket wrapping in tcp:// URIs - Normalize localhost to ::1 in multimodal_gen scheduler_endpoint Tests, eval scripts, docs, and loopback exclusion checks are left unchanged.
Include the prefill bootstrap address and room number in the error message when a kvcache transfer fails, making it easier to identify which prefill instance is unreachable.
Log all parameters passed to store.setup() and store.setup_dummy() before the call, making it easier to diagnose configuration issues.
Replace the temp bracket hack with proper maybe_wrap_ipv6_address on the local_hostname config value before passing to store.setup(). The shared engine path (get_session_id) already returns bracketed IPv6 so it needs no wrapping. Also fix the embedding store.
The bootstrap_addr was built as f"{host}:{port}" which breaks IPv6
addresses in health check URLs and mooncake session lookups. Wrap the
host with maybe_wrap_ipv6_address so IPv6 addresses get brackets,
producing correct URLs like http://[2803:...]:30000/health.
Wrap all remaining unprotected host:port string constructions with maybe_wrap_ipv6_address() or format_tcp_address() to prevent malformed URLs and TCP addresses when using IPv6 addresses. Files fixed: - grpc_server.py: gRPC warmup URL - http_server_engine.py: log message and HTTP URL - model_runner.py: transfer engine session_id, init_method tcp:// URIs, log messages - remote_instance.py: init_method tcp:// URI - encode_server.py: ZMQ tcp:// endpoint - encode_grpc_server.py: gRPC listen address - encode_receiver.py: receive_url host:port - mindspore_runner.py: dist_init_method tcp:// URIs - ascend/transfer_engine.py: session_id - dumper.py: ZMQ local_addr - loader.py: instance:// and http:// URLs for remote weight loading - remote_instance_weight_loader_utils.py: seed instance service URLs - mooncake_store.py: master_server_ip parsing and URL (fix broken split on IPv6) - mini_3fs_metadata_server.py: log message - elastic_ep/expert_backup_manager.py: ZMQ bind addresses - elastic_ep/expert_backup_client.py: ZMQ connect addresses - bench_serving.py: all http:// URL constructions - compile_deep_gemm.py: base_url construction - multimodal_gen/benchmarks/bench_serving.py: base_url construction
ZMQ requires zmq.IPV6=1 on each socket before bind/connect to IPv6 endpoints, otherwise it silently fails on IPv4-only mode.
- Add info-level init/session_id logs to MooncakeTransferEngine - Upgrade transfer failure logs from debug to error with full context (session_id, addresses, lengths, return codes) - Upgrade memory registration failures from debug to warning - Include return code, endpoint, room, and chunk info in conn.py session failure logs - Wrap IPv6 endpoints in log messages with maybe_wrap_ipv6_address
- Log registered buffer ptrs and lengths at init time - Log failing transfer block address ranges (src..src+len, dst..dst+len) to correlate with C++ "address not found" errors - Add debug logging of base pointers and layer count in send_kvcache - Fix kv_chunk_idx NameError in session failure log
Now logs: local_session_id, prefill/dst index counts, MLA/MHA backend, return code, and includes ret/session/indices in the record_failure message that propagates to KVTransferError.
Log endpoint, room, session_id, and payload metadata at each ZMQ send_multipart site: aux data sends, status syncs, decode registration, and decode init.
Wrap all 4 ZMQ send_multipart calls with try/except: - Log debug before send (what we're about to send) - Log debug after successful send - Log error on exception with full context, then re-raise
SGLang default log_level is 'warning', so info/debug logs are suppressed. Upgrade all new diagnostic logs to warning level so they appear with default configuration.
- Log why a session is being skipped (num_failures, endpoint, room) when hitting the "not alive" early exit - Prefix session failure log with "Marking session as failed" for easy correlation with later "not alive" messages - Log when a session is cleared from failed state on re-registration - Upgrade KVArgs registration log from debug to warning
- isort: fix import ordering in gpu_worker.py - ruff: remove 3 unused imports (socket, is_valid_ipv6_address, maybe_wrap_ipv6_address) - black: reformat long lines and ternary expressions across 8 files
Add fallback in get_zmq_socket_on_host: if binding to a specific address fails (e.g., address from get_local_ip_by_remote() is not assigned to any local interface due to tunneling/NAT66), fall back to binding all interfaces with tcp://*.
…mote - reserve_port(): try AF_INET6 first with AF_INET fallback - get_local_ip_by_remote(): try IPv6 (2001:4860:4860::8888) first, then fall back to IPv4 (8.8.8.8), then hostname resolution
Collaborator
|
We do accept vibe coding PRs, but for such a large refactor, please submit a roadmap first and make sure you know every single line you're about to change. |
4 tasks
Collaborator
Contributor
Author
|
@hnyls2002 yes I can do that |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
SGLang currently assumes IPv4 in many places —
socket.gethostbyname()calls (IPv4-only), naivehost.split(":")parsing that breaks on IPv6 colons, hard-coded127.0.0.1loopback, and bare IPv6 addresses in URLs without bracket wrapping. This makes SGLang unusable on IPv6-only or dual-stack networks.
This PR adds comprehensive IPv6 support so that SGLang works correctly on both IPv4 and IPv6 networks without any special configuration.
Modifications
The changes are organized into 7 logical groups:
1. Core IPv6 utilities (
srt/utils/common.py)resolve_hostname()— usessocket.getaddrinfo()instead ofgethostbyname()to support both IPv4 and IPv6parse_host_port()— safely parseshost:portstrings, handling bracketed IPv6 ([::1]:8000) and plain IPv4 (127.0.0.1:8000)is_port_available(),get_free_port(),bind_port(),get_open_port()to try IPv6 first with IPv4 fallbackzmq.IPV6flag inget_zmq_socket_on_host()when the host is IPv6::1from local IP detection inget_local_ip_by_remote()andget_local_ip_by_nic()2. Replace
gethostbynamewith IPv6-compatible alternativesdumper.py— uses inlinesocket.getaddrinfo()(avoids sglang imports)loader.py,model_runner.py— useresolve_hostname()conn.py(disaggregation) — simplified withparse_host_port()+resolve_hostname()3. Fix host:port parsing
server_args.py— useparse_host_port()instead of.split(":")data_parallel_controller.py— replace multi-branch parsing withparse_host_port()+format_tcp_address()model_runner.py,encode_server.py,encode_grpc_server.py,remote_instance.py,mindspore_runner.py,mooncake_store.py— useparse_host_port()/format_tcp_address()consistently4. Wrap IPv6 addresses in brackets for URLs and address strings
maybe_wrap_ipv6_address()across 18 files whereverhost:portstrings are constructed for URLs, log messages, or network addressesnormalize_base_url()inutils.pyto wrap IPv6 hosts in bracketsbench_serving.py,compile_deep_gemm.py, server entrypoints, disaggregation modules, model loader, weight loader utils, etc.5. Default to IPv6 loopback (
::1) instead of127.0.0.1ServerArgs.hostdefault from"127.0.0.1"to"::1"(works on both dual-stack and IPv6-only systems)multimodal_gen/server_args.py,gpu_worker.py,shm_broadcast.pysimilarly6. Set
zmq.IPV6flag on ZMQ socketscommon.py,dumper.py,scheduler_client.py,kv_events.py,encode_server.py,expert_backup_client.py,expert_backup_manager.pykv_events.py"::" in endpointheuristic that falsely matched IPv6 addresses7. Enhanced Mooncake transfer engine logging
mooncake/conn.pyandmooncake_transfer_engine.pyfor debugging connectivity issues on IPv6 networksAccuracy Tests
This PR does not modify model forward code, kernels, or inference logic. All changes are to networking/address handling code paths. No accuracy impact.
Benchmarking and Profiling
This PR does not affect inference speed. Changes are limited to:
Checklist
parse_host_port()andresolve_hostname()utilities::1) and configuration