Ipv6 pr by psaab · Pull Request #19981 · sgl-project/sglang

psaab · 2026-03-05T20:57:51Z

Motivation

SGLang currently assumes IPv4 in many places — socket.gethostbyname() calls (IPv4-only), naive host.split(":") parsing that breaks on IPv6 colons, hard-coded 127.0.0.1 loopback, and bare IPv6 addresses in URLs withou
t bracket wrapping. This makes SGLang unusable on IPv6-only or dual-stack networks.

This PR adds comprehensive IPv6 support so that SGLang works correctly on both IPv4 and IPv6 networks without any special configuration.

Modifications

The changes are organized into 7 logical groups:

1. Core IPv6 utilities (`srt/utils/common.py`)

Add resolve_hostname() — uses socket.getaddrinfo() instead of gethostbyname() to support both IPv4 and IPv6
Add parse_host_port() — safely parses host:port strings, handling bracketed IPv6 ([::1]:8000) and plain IPv4 (127.0.0.1:8000)
Update is_port_available(), get_free_port(), bind_port(), get_open_port() to try IPv6 first with IPv4 fallback
Add zmq.IPV6 flag in get_zmq_socket_on_host() when the host is IPv6
Exclude ::1 from local IP detection in get_local_ip_by_remote() and get_local_ip_by_nic()

2. Replace `gethostbyname` with IPv6-compatible alternatives

dumper.py — uses inline socket.getaddrinfo() (avoids sglang imports)
loader.py, model_runner.py — use resolve_hostname()
conn.py (disaggregation) — simplified with parse_host_port() + resolve_hostname()

3. Fix host:port parsing

server_args.py — use parse_host_port() instead of .split(":")
data_parallel_controller.py — replace multi-branch parsing with parse_host_port() + format_tcp_address()
model_runner.py, encode_server.py, encode_grpc_server.py, remote_instance.py, mindspore_runner.py, mooncake_store.py — use parse_host_port() / format_tcp_address() consistently

4. Wrap IPv6 addresses in brackets for URLs and address strings

Apply maybe_wrap_ipv6_address() across 18 files wherever host:port strings are constructed for URLs, log messages, or network addresses
Fix normalize_base_url() in utils.py to wrap IPv6 hosts in brackets
Covers: bench_serving.py, compile_deep_gemm.py, server entrypoints, disaggregation modules, model loader, weight loader utils, etc.

5. Default to IPv6 loopback (`::1`) instead of `127.0.0.1`

Change ServerArgs.host default from "127.0.0.1" to "::1" (works on both dual-stack and IPv6-only systems)
Update multimodal_gen/server_args.py, gpu_worker.py, shm_broadcast.py similarly

6. Set `zmq.IPV6` flag on ZMQ sockets

Enable IPv6 on all ZMQ PUB, SUB, PUSH, PULL, ROUTER, DEALER sockets that may bind/connect to IPv6 endpoints
Files: common.py, dumper.py, scheduler_client.py, kv_events.py, encode_server.py, expert_backup_client.py, expert_backup_manager.py
Fix kv_events.py "::" in endpoint heuristic that falsely matched IPv6 addresses

7. Enhanced Mooncake transfer engine logging

Add detailed logging around transfer failures, session lifecycle, and ZMQ operations in mooncake/conn.py and mooncake_transfer_engine.py for debugging connectivity issues on IPv6 networks

Accuracy Tests

This PR does not modify model forward code, kernels, or inference logic. All changes are to networking/address handling code paths. No accuracy impact.

Benchmarking and Profiling

This PR does not affect inference speed. Changes are limited to:

Server startup address binding (one-time cost)
Log message formatting (negligible)
Socket creation order (IPv6 first, IPv4 fallback — same total cost)

Checklist

Format code according to the project style guide
Add unit tests for parse_host_port() and resolve_hostname() utilities
Update documentation for IPv6 default (::1) and configuration
No accuracy or speed impact (networking-only changes)
Follow the SGLang code style guidance

gemini-code-assist · 2026-03-05T20:57:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yhyang201 · 2026-03-05T21:16:41Z

/tag-and-rerun-ci

Kangyan-Zhou · 2026-03-07T04:57:24Z

/tag-and-rerun-ci

- Prefer IPv6 sockets (try first, fall back to IPv4) in is_port_available, get_free_port, bind_port, and get_open_port - Wrap IPv6 addresses with brackets in gRPC listen address and log messages - Use format_tcp_address/parse_host_port for torch distributed init and DP controller endpoints instead of hardcoded 127.0.0.1 - Use maybe_wrap_ipv6_address for PortArgs fallback host - Add ::1 to loopback exclusion checks in dumper.py and common.py - Replace gethostbyname with getaddrinfo (supports both IPv4 and IPv6) - Use parse_host_port in conn.py, server_args.py, data_parallel_controller.py - Add IPv6 support to get_zmq_socket_on_host - Fix normalize_base_url to handle bracketed IPv6 addresses

Change the default host from 127.0.0.1 to ::1 across all runtime code: - ServerArgs default host in both srt and multimodal_gen - All fallback loopback addresses in model_runner, data_parallel_controller, PortArgs, shm_broadcast, encode_server, encode_grpc_server, and gpu_worker - Use format_tcp_address for proper bracket wrapping in tcp:// URIs - Normalize localhost to ::1 in multimodal_gen scheduler_endpoint Tests, eval scripts, docs, and loopback exclusion checks are left unchanged.

Include the prefill bootstrap address and room number in the error message when a kvcache transfer fails, making it easier to identify which prefill instance is unreachable.

Log all parameters passed to store.setup() and store.setup_dummy() before the call, making it easier to diagnose configuration issues.

Replace the temp bracket hack with proper maybe_wrap_ipv6_address on the local_hostname config value before passing to store.setup(). The shared engine path (get_session_id) already returns bracketed IPv6 so it needs no wrapping. Also fix the embedding store.

The bootstrap_addr was built as f"{host}:{port}" which breaks IPv6 addresses in health check URLs and mooncake session lookups. Wrap the host with maybe_wrap_ipv6_address so IPv6 addresses get brackets, producing correct URLs like http://[2803:...]:30000/health.

Wrap all remaining unprotected host:port string constructions with maybe_wrap_ipv6_address() or format_tcp_address() to prevent malformed URLs and TCP addresses when using IPv6 addresses. Files fixed: - grpc_server.py: gRPC warmup URL - http_server_engine.py: log message and HTTP URL - model_runner.py: transfer engine session_id, init_method tcp:// URIs, log messages - remote_instance.py: init_method tcp:// URI - encode_server.py: ZMQ tcp:// endpoint - encode_grpc_server.py: gRPC listen address - encode_receiver.py: receive_url host:port - mindspore_runner.py: dist_init_method tcp:// URIs - ascend/transfer_engine.py: session_id - dumper.py: ZMQ local_addr - loader.py: instance:// and http:// URLs for remote weight loading - remote_instance_weight_loader_utils.py: seed instance service URLs - mooncake_store.py: master_server_ip parsing and URL (fix broken split on IPv6) - mini_3fs_metadata_server.py: log message - elastic_ep/expert_backup_manager.py: ZMQ bind addresses - elastic_ep/expert_backup_client.py: ZMQ connect addresses - bench_serving.py: all http:// URL constructions - compile_deep_gemm.py: base_url construction - multimodal_gen/benchmarks/bench_serving.py: base_url construction

…check

…line check

ZMQ requires zmq.IPV6=1 on each socket before bind/connect to IPv6 endpoints, otherwise it silently fails on IPv4-only mode.

- Add info-level init/session_id logs to MooncakeTransferEngine - Upgrade transfer failure logs from debug to error with full context (session_id, addresses, lengths, return codes) - Upgrade memory registration failures from debug to warning - Include return code, endpoint, room, and chunk info in conn.py session failure logs - Wrap IPv6 endpoints in log messages with maybe_wrap_ipv6_address

- Log registered buffer ptrs and lengths at init time - Log failing transfer block address ranges (src..src+len, dst..dst+len) to correlate with C++ "address not found" errors - Add debug logging of base pointers and layer count in send_kvcache - Fix kv_chunk_idx NameError in session failure log

Now logs: local_session_id, prefill/dst index counts, MLA/MHA backend, return code, and includes ret/session/indices in the record_failure message that propagates to KVTransferError.

Log endpoint, room, session_id, and payload metadata at each ZMQ send_multipart site: aux data sends, status syncs, decode registration, and decode init.

Wrap all 4 ZMQ send_multipart calls with try/except: - Log debug before send (what we're about to send) - Log debug after successful send - Log error on exception with full context, then re-raise

SGLang default log_level is 'warning', so info/debug logs are suppressed. Upgrade all new diagnostic logs to warning level so they appear with default configuration.

- Log why a session is being skipped (num_failures, endpoint, room) when hitting the "not alive" early exit - Prefix session failure log with "Marking session as failed" for easy correlation with later "not alive" messages - Log when a session is cleared from failed state on re-registration - Upgrade KVArgs registration log from debug to warning

- isort: fix import ordering in gpu_worker.py - ruff: remove 3 unused imports (socket, is_valid_ipv6_address, maybe_wrap_ipv6_address) - black: reformat long lines and ternary expressions across 8 files

Add fallback in get_zmq_socket_on_host: if binding to a specific address fails (e.g., address from get_local_ip_by_remote() is not assigned to any local interface due to tunneling/NAT66), fall back to binding all interfaces with tcp://*.

…mote - reserve_port(): try AF_INET6 first with AF_INET fallback - get_local_ip_by_remote(): try IPv6 (2001:4860:4860::8888) first, then fall back to IPv4 (8.8.8.8), then hostname resolution

hnyls2002 · 2026-03-09T19:29:15Z

We do accept vibe coding PRs, but for such a large refactor, please submit a roadmap first and make sure you know every single line you're about to change.

hnyls2002 · 2026-03-11T05:04:50Z

@psaab Hi, I have refactored the ipv6 utils in SGLang through #20306. Could you please reintroduce IPv6 support based on this PR and make minimal changes to support your workload?

psaab · 2026-03-11T05:37:10Z

@hnyls2002 yes I can do that

psaab requested review from ByronHsu, CatherineSue, Fridge003, JustinTong0323, Ying1123, ch-wan, hanming-lu, hnyls2002, iforgetmyname, ispobock, merrymercy, ping1jing2, slin1237, xiezhq-hermann and yizhang2077 as code owners March 5, 2026 20:57

psaab requested review from ShangmingCai, mickqian and yhyang201 as code owners March 5, 2026 20:57

github-actions Bot added the diffusion SGLang Diffusion label Mar 5, 2026

github-actions Bot added the run-ci label Mar 5, 2026

mickqian removed the run-ci label Mar 6, 2026

psaab force-pushed the ipv6-pr branch from 4156d36 to e185c66 Compare March 6, 2026 21:06

github-actions Bot added the run-ci label Mar 7, 2026

psaab added 4 commits March 9, 2026 10:43

Log prefill instance address on kvcache transfer failure

dd1a9d8

Include the prefill bootstrap address and room number in the error message when a kvcache transfer fails, making it easier to identify which prefill instance is unreachable.

Log Mooncake store setup parameters for debugging

2d441f3

Log all parameters passed to store.setup() and store.setup_dummy() before the call, making it easier to diagnose configuration issues.

psaab force-pushed the ipv6-pr branch from ddda02a to 35a3db5 Compare March 9, 2026 17:44

psaab added 15 commits March 9, 2026 10:47

Use maybe_wrap_ipv6_address in bench_serving files instead of inline …

7420a5d

…check

Use maybe_wrap_ipv6_address in mini_3fs_metadata_server instead of in…

23cb2da

…line check

Set zmq.IPV6 on all ZMQ sockets that bind/connect to IPv6 addresses

ed7b964

ZMQ requires zmq.IPV6=1 on each socket before bind/connect to IPv6 endpoints, otherwise it silently fails on IPv4-only mode.

Enrich session failure log with local session, index counts, and backend

6fc792c

Now logs: local_session_id, prefill/dst index counts, MLA/MHA backend, return code, and includes ret/session/indices in the record_failure message that propagates to KVTransferError.

Add debug logging to ZMQ send operations in mooncake conn

d463f18

Log endpoint, room, session_id, and payload metadata at each ZMQ send_multipart site: aux data sends, status syncs, decode registration, and decode init.

Add before/after logging with error handling to ZMQ sends

7df5212

Wrap all 4 ZMQ send_multipart calls with try/except: - Log debug before send (what we're about to send) - Log debug after successful send - Log error on exception with full context, then re-raise

Upgrade Mooncake logging from info/debug to warning level

1cb5826

SGLang default log_level is 'warning', so info/debug logs are suppressed. Upgrade all new diagnostic logs to warning level so they appear with default configuration.

Apply pre-commit formatting fixes (isort, black, ruff)

59e6b8a

- isort: fix import ordering in gpu_worker.py - ruff: remove 3 unused imports (socket, is_valid_ipv6_address, maybe_wrap_ipv6_address) - black: reformat long lines and ternary expressions across 8 files

psaab force-pushed the ipv6-pr branch from 35a3db5 to 6960306 Compare March 9, 2026 17:47

Fix remaining IPv4-only patterns: reserve_port and get_local_ip_by_re…

0e4714d

…mote - reserve_port(): try AF_INET6 first with AF_INET fallback - get_local_ip_by_remote(): try IPv6 (2001:4860:4860::8888) first, then fall back to IPv4 (8.8.8.8), then hostname resolution

psaab force-pushed the ipv6-pr branch from 6960306 to 0e4714d Compare March 9, 2026 18:01

hnyls2002 closed this Mar 9, 2026

hnyls2002 mentioned this pull request Mar 11, 2026

[Utils] Add NetworkAddress abstraction for IPv6-safe address handling #20306

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ipv6 pr#19981

Ipv6 pr#19981
psaab wants to merge 20 commits intosgl-project:mainfrom
psaab:ipv6-pr

psaab commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

yhyang201 commented Mar 5, 2026

Uh oh!

Kangyan-Zhou commented Mar 7, 2026

Uh oh!

hnyls2002 commented Mar 9, 2026

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

psaab commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

psaab commented Mar 5, 2026

Motivation

Modifications

1. Core IPv6 utilities (srt/utils/common.py)

2. Replace gethostbyname with IPv6-compatible alternatives

3. Fix host:port parsing

4. Wrap IPv6 addresses in brackets for URLs and address strings

5. Default to IPv6 loopback (::1) instead of 127.0.0.1

6. Set zmq.IPV6 flag on ZMQ sockets

7. Enhanced Mooncake transfer engine logging

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

yhyang201 commented Mar 5, 2026

Uh oh!

Kangyan-Zhou commented Mar 7, 2026

Uh oh!

hnyls2002 commented Mar 9, 2026

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

psaab commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1. Core IPv6 utilities (`srt/utils/common.py`)

2. Replace `gethostbyname` with IPv6-compatible alternatives

5. Default to IPv6 loopback (`::1`) instead of `127.0.0.1`

6. Set `zmq.IPV6` flag on ZMQ sockets