Add heartbeat and id to session server by maocheng23 · Pull Request #866 · radixark/miles

maocheng23 · 2026-04-03T00:10:04Z

Summary

Add /health endpoint to session server that returns session_server_instance_id
Propagate session_server_id (host:port) and session_server_instance_id (uuid) through agentic_tool_call metadata to the agent server
OpenAIEndpointTracer.create() fetches instance_id from /health on startup
Generate uuid-based session_server_instance_id in _start_session_server
Forward session_server_id and session_server_instance_id in swe_agent_function to agent server requests

Test plan

Unit test: test_create_fetches_session_server_instance_id — verifies tracer fetches instance_id from health endpoint
Unit test: test_health_reports_stable_instance_id — verifies health endpoint returns consistent uuid
Unit test: test_session_server_identity_forwarded_to_agent_metadata — verifies id propagation through generate pipeline

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a session server auto-discovery and health monitoring system for the SWE-agent-v2 experimental example. Key changes include a background health checker in the agent server to cancel stale trials, new admin endpoints for flushing tasks, and enhanced configuration options for training and rollout, including W&B and Prometheus integration. Feedback highlights security concerns regarding unauthenticated admin endpoints and the use of shell=True in subprocess calls, as well as opportunities to improve efficiency in DNS lookups and avoid hardcoded timeouts.

gemini-code-assist · 2026-04-03T00:16:19Z

+    if not _ADMIN_SECRET:
+        return


If HARBOR_ADMIN_SECRET is not set, the admin endpoints (/flush, /clients) are completely unprotected. This allows any user to cancel in-flight trials or view active session IDs. It is recommended to require this secret to be set for the server to start if these endpoints are enabled, or to default to a secure state that denies access.

Suggested change

if not _ADMIN_SECRET:

return

if not _ADMIN_SECRET:

raise HTTPException(status_code=403, detail="Admin secret not configured")

gemini-code-assist · 2026-04-03T00:16:20Z

-        response = await post(f"{agent_server_url}/run", request)
+        response = await asyncio.wait_for(
+            post(f"{agent_server_url}/run", request),
+            timeout=3600,  # 1 hour max per trial


The 1-hour timeout (3600s) for the agent server call is hardcoded. Following repository guidelines to avoid hardcoding parameters, it would be better to make this configurable via an environment variable (e.g., AGENT_TRIAL_TIMEOUT_SEC) or the model configuration to allow for different task complexities without modifying the code.

Suggested change

timeout=3600, # 1 hour max per trial

timeout=int(os.getenv("AGENT_TRIAL_TIMEOUT_SEC", "3600")), # 1 hour max per trial

References

Model parameters and configuration values should be retrieved from the configuration rather than being hardcoded.

gemini-code-assist · 2026-04-03T00:16:20Z

+            port = urlsplit(f"http://{session_server_id}").port
+            session_server_id = f"{external_host}:{port}"


If urlsplit does not find a port in session_server_id, port will be None, resulting in a malformed session_server_id string like host:None when external_host is set.

Suggested change

port = urlsplit(f"http://{session_server_id}").port

session_server_id = f"{external_host}:{port}"

parsed_sid = urlsplit(f"http://{session_server_id}")

port = parsed_sid.port

session_server_id = f"{external_host}:{port}" if port else external_host

gemini-code-assist · 2026-04-03T00:16:20Z

+async def _validate_session_server_id(session_server_id: str, base_url: str, client_host: str | None) -> str:
+    sid_host, sid_port, sid = _parse_session_server_id(session_server_id)
+    base_host, base_port = _parse_url_origin(base_url)
+    if sid_port != base_port:
+        raise ValueError("session_server_id port must match base_url port")
+    if not await _hosts_overlap(sid_host, base_host):
+        raise ValueError("session_server_id host must match base_url host")
+    if client_host is not None:
+        client_host = _normalize_host(client_host)
+        if not await _hosts_overlap(sid_host, client_host):
+            raise ValueError("session_server_id host must resolve to the caller")
+    return sid


_validate_session_server_id performs redundant DNS lookups by calling _hosts_overlap multiple times. Resolving sid_host once and reusing the result would be more efficient and reduce latency for the /run endpoint.

gemini-code-assist · 2026-04-03T00:16:20Z

+        subprocess.run(
+            f"pgrep -f '{t}' | {exclude} | xargs -r kill 2>/dev/null || true",
+            shell=True,
+        )


Using shell=True with string formatting in subprocess.run is a security risk if any part of the command string comes from external input. Although the targets are currently hardcoded, it is safer to avoid shell=True and use subprocess with a list of arguments. Additionally, the PID exclusion logic could be simplified using pgrep's native flags if available.

guapisolo

Checked offline and approve to unblock.

Shi-Dong

LGTM with minor comments.

Shi-Dong · 2026-04-05T03:46:09Z

+            session_server_id = f"{external_host}:{port}"
+        request["session_server_id"] = session_server_id
+
+    session_server_instance_id = metadata.get("session_server_instance_id")


I'd suggest adding a comment here to explain what session_server_id and session_server_instance_id are for.

Shi-Dong · 2026-04-05T03:59:34Z

+                if session_server_instance_id is not None:
+                    args.session_server_instance_id = session_server_instance_id
+        except Exception as e:
+            logger.warning("Failed to get session server health from %s: %s", session_url, e)


It feels like this warning is only logged when the post times out (the following if section is unlikely to throw an exception). I wonder whether it'd be more helpful to also log the warning when the session doesn't have a session_server_instance_id.

Shi-Dong · 2026-04-05T04:04:13Z

            assert "reward" not in s.metadata

+    def test_session_server_identity_forwarded_to_agent_metadata(self, variant, generation_env):
+        from miles.utils.test_utils import mock_tools


Nit: the import here is rather weird because some functions in mock_tools are already imported at the beginning.

Shi-Dong · 2026-04-05T04:14:32Z


    router_url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}"

    from miles.rollout.session.session_server import run_session_server


Super nit: can we move this import to the beginning as a drive-by?

I think the style guide mandates that all imports be placed at the beginning. Personally it'd be OK for me make an exception if the import is heavy-weight, but it doesn't seem so in this case :/

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

Note: rollout_ft/0 was already merged into main as PR #897 and deleted from origin, so the cascade starts from main → rollout_ft/1. Conflict 1: miles/ray/rollout.py (modify/delete) - HEAD (rollout_ft/1) deleted the file as part of a mechanical split: commit d3fc26e "mechanically move" splits the original 1298-line miles/ray/rollout.py into the directory miles/ray/rollout/{addr_allocator, metrics,observability,rollout_manager,rollout_server,router_manager, server_group}.py. - origin/main modified the file (4 commits since merge-base): a772c33 zero_std all_zero_ratio/all_one_ratio metrics (#1034) 41615af weight staleness control for fully async rollout (#958) c198efa consistent hashing routing policy (#891) eaa36a2 heartbeat and id to session server (#866) Resolution: removed miles/ray/rollout.py (preserve mechanical split) and re-applied main's 7 hunks to the new directory files: - miles/ray/rollout/router_manager.py: + import uuid + router_args.policy = args.sglang_router_policy (in start_router) + args.session_server_instance_id = uuid.uuid4().hex (in start_session_server) - miles/ray/rollout/train_data_conversion.py: + train_data["weight_versions"] population (after multimodal block) + "weight_versions" added to per-DP-split key whitelist - miles/ray/rollout/metrics.py: + oldest_weight_version statistics + mixed_version_ratio in _compute_metrics_from_samples + zero_std/all_zero_percentage and all_one_percentage in _compute_zero_std_metrics Verified: all dependent symbols (Sample.weight_versions, Sample.oldest_weight_version, args.sglang_router_policy, args.session_server_instance_id) are already present in the merged tree from cleanly-merged peer files (miles/utils/types.py, miles/backends/sglang_utils/arguments.py, etc.). Verified: merge_diff_check.py shows the only "lost" deviations are PR #897 skill files (already in main) — no real loss.

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

maocheng23 force-pushed the fix/agentic_error_hendle branch 2 times, most recently from 3782432 to e5cc2a1 Compare April 3, 2026 21:24

maocheng23 added 2 commits April 3, 2026 15:29

[feat] Add heartbeat and session_server_id

49e8416

Add tests

f4cf54e

maocheng23 force-pushed the fix/agentic_error_hendle branch from e5cc2a1 to f4cf54e Compare April 3, 2026 22:51

fix

f513b0c

maocheng23 changed the title ~~[feat] Add heartbeat for session_server, and book-keeping for agentic server~~ Add heartbeat and id to session server Apr 3, 2026

maocheng23 marked this pull request as ready for review April 3, 2026 22:58

maocheng23 requested review from fzyzcjy, guapisolo and yueming-yuan as code owners April 3, 2026 22:58

guapisolo approved these changes Apr 5, 2026

View reviewed changes

Shi-Dong approved these changes Apr 5, 2026

View reviewed changes

Shi-Dong reviewed Apr 5, 2026

View reviewed changes

yueming-yuan approved these changes Apr 7, 2026

View reviewed changes

maocheng23 merged commit eaa36a2 into main Apr 7, 2026
19 checks passed

maocheng23 deleted the fix/agentic_error_hendle branch April 7, 2026 03:02

GuanxingLu pushed a commit to GuanxingLu/miles that referenced this pull request Apr 21, 2026

Add heartbeat and id to session server (radixark#866)

f418615

fzyzcjy mentioned this pull request May 5, 2026

Refactor rollout.py by file decompositions #899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heartbeat and id to session server#866

Add heartbeat and id to session server#866
maocheng23 merged 3 commits intomainfrom
fix/agentic_error_hendle

maocheng23 commented Apr 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

guapisolo left a comment

Uh oh!

Shi-Dong left a comment

Uh oh!

Shi-Dong Apr 5, 2026

Uh oh!

Shi-Dong Apr 5, 2026

Uh oh!

Shi-Dong Apr 5, 2026

Uh oh!

Shi-Dong Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	timeout=3600, # 1 hour max per trial
	timeout=int(os.getenv("AGENT_TRIAL_TIMEOUT_SEC", "3600")), # 1 hour max per trial

		port = urlsplit(f"http://{session_server_id}").port
		session_server_id = f"{external_host}:{port}"

-            port = urlsplit(f"http://{session_server_id}").port
-            session_server_id = f"{external_host}:{port}"
+            parsed_sid = urlsplit(f"http://{session_server_id}")
+            port = parsed_sid.port
+            session_server_id = f"{external_host}:{port}" if port else external_host


		router_url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}"

		from miles.rollout.session.session_server import run_session_server

Conversation

maocheng23 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

guapisolo left a comment

Choose a reason for hiding this comment

Uh oh!

Shi-Dong left a comment

Choose a reason for hiding this comment

Uh oh!

Shi-Dong Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Shi-Dong Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Shi-Dong Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Shi-Dong Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maocheng23 commented Apr 3, 2026 •

edited

Loading