Skip to content

Ipv6 pr#19981

Closed
psaab wants to merge 20 commits intosgl-project:mainfrom
psaab:ipv6-pr
Closed

Ipv6 pr#19981
psaab wants to merge 20 commits intosgl-project:mainfrom
psaab:ipv6-pr

Conversation

@psaab
Copy link
Copy Markdown
Contributor

@psaab psaab commented Mar 5, 2026

Motivation

SGLang currently assumes IPv4 in many places — socket.gethostbyname() calls (IPv4-only), naive host.split(":") parsing that breaks on IPv6 colons, hard-coded 127.0.0.1 loopback, and bare IPv6 addresses in URLs withou
t bracket wrapping. This makes SGLang unusable on IPv6-only or dual-stack networks.

This PR adds comprehensive IPv6 support so that SGLang works correctly on both IPv4 and IPv6 networks without any special configuration.

Modifications

The changes are organized into 7 logical groups:

1. Core IPv6 utilities (srt/utils/common.py)

  • Add resolve_hostname() — uses socket.getaddrinfo() instead of gethostbyname() to support both IPv4 and IPv6
  • Add parse_host_port() — safely parses host:port strings, handling bracketed IPv6 ([::1]:8000) and plain IPv4 (127.0.0.1:8000)
  • Update is_port_available(), get_free_port(), bind_port(), get_open_port() to try IPv6 first with IPv4 fallback
  • Add zmq.IPV6 flag in get_zmq_socket_on_host() when the host is IPv6
  • Exclude ::1 from local IP detection in get_local_ip_by_remote() and get_local_ip_by_nic()

2. Replace gethostbyname with IPv6-compatible alternatives

  • dumper.py — uses inline socket.getaddrinfo() (avoids sglang imports)
  • loader.py, model_runner.py — use resolve_hostname()
  • conn.py (disaggregation) — simplified with parse_host_port() + resolve_hostname()

3. Fix host:port parsing

  • server_args.py — use parse_host_port() instead of .split(":")
  • data_parallel_controller.py — replace multi-branch parsing with parse_host_port() + format_tcp_address()
  • model_runner.py, encode_server.py, encode_grpc_server.py, remote_instance.py, mindspore_runner.py, mooncake_store.py — use parse_host_port() / format_tcp_address() consistently

4. Wrap IPv6 addresses in brackets for URLs and address strings

  • Apply maybe_wrap_ipv6_address() across 18 files wherever host:port strings are constructed for URLs, log messages, or network addresses
  • Fix normalize_base_url() in utils.py to wrap IPv6 hosts in brackets
  • Covers: bench_serving.py, compile_deep_gemm.py, server entrypoints, disaggregation modules, model loader, weight loader utils, etc.

5. Default to IPv6 loopback (::1) instead of 127.0.0.1

  • Change ServerArgs.host default from "127.0.0.1" to "::1" (works on both dual-stack and IPv6-only systems)
  • Update multimodal_gen/server_args.py, gpu_worker.py, shm_broadcast.py similarly

6. Set zmq.IPV6 flag on ZMQ sockets

  • Enable IPv6 on all ZMQ PUB, SUB, PUSH, PULL, ROUTER, DEALER sockets that may bind/connect to IPv6 endpoints
  • Files: common.py, dumper.py, scheduler_client.py, kv_events.py, encode_server.py, expert_backup_client.py, expert_backup_manager.py
  • Fix kv_events.py "::" in endpoint heuristic that falsely matched IPv6 addresses

7. Enhanced Mooncake transfer engine logging

  • Add detailed logging around transfer failures, session lifecycle, and ZMQ operations in mooncake/conn.py and mooncake_transfer_engine.py for debugging connectivity issues on IPv6 networks

Accuracy Tests

This PR does not modify model forward code, kernels, or inference logic. All changes are to networking/address handling code paths. No accuracy impact.

Benchmarking and Profiling

This PR does not affect inference speed. Changes are limited to:

  • Server startup address binding (one-time cost)
  • Log message formatting (negligible)
  • Socket creation order (IPv6 first, IPv4 fallback — same total cost)

Checklist

  • Format code according to the project style guide
  • Add unit tests for parse_host_port() and resolve_hostname() utilities
  • Update documentation for IPv6 default (::1) and configuration
  • No accuracy or speed impact (networking-only changes)
  • Follow the SGLang code style guidance

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 5, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Mar 7, 2026
psaab added 4 commits March 9, 2026 10:43
- Prefer IPv6 sockets (try first, fall back to IPv4) in is_port_available,
  get_free_port, bind_port, and get_open_port
- Wrap IPv6 addresses with brackets in gRPC listen address and log messages
- Use format_tcp_address/parse_host_port for torch distributed init and
  DP controller endpoints instead of hardcoded 127.0.0.1
- Use maybe_wrap_ipv6_address for PortArgs fallback host
- Add ::1 to loopback exclusion checks in dumper.py and common.py
- Replace gethostbyname with getaddrinfo (supports both IPv4 and IPv6)
- Use parse_host_port in conn.py, server_args.py, data_parallel_controller.py
- Add IPv6 support to get_zmq_socket_on_host
- Fix normalize_base_url to handle bracketed IPv6 addresses
Change the default host from 127.0.0.1 to ::1 across all runtime code:
- ServerArgs default host in both srt and multimodal_gen
- All fallback loopback addresses in model_runner, data_parallel_controller,
  PortArgs, shm_broadcast, encode_server, encode_grpc_server, and gpu_worker
- Use format_tcp_address for proper bracket wrapping in tcp:// URIs
- Normalize localhost to ::1 in multimodal_gen scheduler_endpoint

Tests, eval scripts, docs, and loopback exclusion checks are left unchanged.
Include the prefill bootstrap address and room number in the error
message when a kvcache transfer fails, making it easier to identify
which prefill instance is unreachable.
Log all parameters passed to store.setup() and store.setup_dummy()
before the call, making it easier to diagnose configuration issues.
psaab added 15 commits March 9, 2026 10:47
Replace the temp bracket hack with proper maybe_wrap_ipv6_address on
the local_hostname config value before passing to store.setup(). The
shared engine path (get_session_id) already returns bracketed IPv6
so it needs no wrapping. Also fix the embedding store.
The bootstrap_addr was built as f"{host}:{port}" which breaks IPv6
addresses in health check URLs and mooncake session lookups. Wrap the
host with maybe_wrap_ipv6_address so IPv6 addresses get brackets,
producing correct URLs like http://[2803:...]:30000/health.
Wrap all remaining unprotected host:port string constructions with
maybe_wrap_ipv6_address() or format_tcp_address() to prevent malformed
URLs and TCP addresses when using IPv6 addresses.

Files fixed:
- grpc_server.py: gRPC warmup URL
- http_server_engine.py: log message and HTTP URL
- model_runner.py: transfer engine session_id, init_method tcp:// URIs, log messages
- remote_instance.py: init_method tcp:// URI
- encode_server.py: ZMQ tcp:// endpoint
- encode_grpc_server.py: gRPC listen address
- encode_receiver.py: receive_url host:port
- mindspore_runner.py: dist_init_method tcp:// URIs
- ascend/transfer_engine.py: session_id
- dumper.py: ZMQ local_addr
- loader.py: instance:// and http:// URLs for remote weight loading
- remote_instance_weight_loader_utils.py: seed instance service URLs
- mooncake_store.py: master_server_ip parsing and URL (fix broken split on IPv6)
- mini_3fs_metadata_server.py: log message
- elastic_ep/expert_backup_manager.py: ZMQ bind addresses
- elastic_ep/expert_backup_client.py: ZMQ connect addresses
- bench_serving.py: all http:// URL constructions
- compile_deep_gemm.py: base_url construction
- multimodal_gen/benchmarks/bench_serving.py: base_url construction
ZMQ requires zmq.IPV6=1 on each socket before bind/connect to
IPv6 endpoints, otherwise it silently fails on IPv4-only mode.
- Add info-level init/session_id logs to MooncakeTransferEngine
- Upgrade transfer failure logs from debug to error with full context
  (session_id, addresses, lengths, return codes)
- Upgrade memory registration failures from debug to warning
- Include return code, endpoint, room, and chunk info in conn.py
  session failure logs
- Wrap IPv6 endpoints in log messages with maybe_wrap_ipv6_address
- Log registered buffer ptrs and lengths at init time
- Log failing transfer block address ranges (src..src+len, dst..dst+len)
  to correlate with C++ "address not found" errors
- Add debug logging of base pointers and layer count in send_kvcache
- Fix kv_chunk_idx NameError in session failure log
Now logs: local_session_id, prefill/dst index counts, MLA/MHA backend,
return code, and includes ret/session/indices in the record_failure
message that propagates to KVTransferError.
Log endpoint, room, session_id, and payload metadata at each
ZMQ send_multipart site: aux data sends, status syncs, decode
registration, and decode init.
Wrap all 4 ZMQ send_multipart calls with try/except:
- Log debug before send (what we're about to send)
- Log debug after successful send
- Log error on exception with full context, then re-raise
SGLang default log_level is 'warning', so info/debug logs are
suppressed. Upgrade all new diagnostic logs to warning level so
they appear with default configuration.
- Log why a session is being skipped (num_failures, endpoint, room)
  when hitting the "not alive" early exit
- Prefix session failure log with "Marking session as failed" for
  easy correlation with later "not alive" messages
- Log when a session is cleared from failed state on re-registration
- Upgrade KVArgs registration log from debug to warning
- isort: fix import ordering in gpu_worker.py
- ruff: remove 3 unused imports (socket, is_valid_ipv6_address, maybe_wrap_ipv6_address)
- black: reformat long lines and ternary expressions across 8 files
Add fallback in get_zmq_socket_on_host: if binding to a specific address
fails (e.g., address from get_local_ip_by_remote() is not assigned to any
local interface due to tunneling/NAT66), fall back to binding all interfaces
with tcp://*.
…mote

- reserve_port(): try AF_INET6 first with AF_INET fallback
- get_local_ip_by_remote(): try IPv6 (2001:4860:4860::8888) first, then
  fall back to IPv4 (8.8.8.8), then hostname resolution
@hnyls2002
Copy link
Copy Markdown
Collaborator

We do accept vibe coding PRs, but for such a large refactor, please submit a roadmap first and make sure you know every single line you're about to change.

@hnyls2002
Copy link
Copy Markdown
Collaborator

@psaab Hi, I have refactored the ipv6 utils in SGLang through #20306. Could you please reintroduce IPv6 support based on this PR and make minimal changes to support your workload?

@psaab
Copy link
Copy Markdown
Contributor Author

psaab commented Mar 11, 2026

@hnyls2002 yes I can do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants