[fix] Skip flush_cache in in_place mode and add fully async example by maocheng23 · Pull Request #974 · radixark/miles

maocheng23 · 2026-04-12T05:55:33Z

Summary

Skip flush_cache call in in_place pause_generation mode — in fully async mode, flush is unnecessary and hangs because the engine never becomes fully idle while paused (the waiting queue still holds requests)
Add example script run_qwen3_30b_a3b_fully_async.py for Qwen3-30B-A3B fully async training with configurable pause-generation and weight-transfer modes

Test plan

Run fully async training with --pause-generation-mode in_place and verify no hang during weight update
Run with --pause-generation-mode retract to confirm flush_cache still executes

🤖 Generated with Claude Code

…mple In fully async (in_place) mode, flush_cache is unnecessary and can hang because the engine never becomes fully idle while paused. Skip it when pause_generation_mode is "in_place". Also adds an example script for Qwen3-30B-A3B fully async training with configurable pause-generation and weight-transfer modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a new script for running Qwen3-30B-A3B in a fully asynchronous mode and modifies the weight update logic to skip cache flushing when using in-place generation pausing. Feedback focuses on improving code quality by using idiomatic string comparisons, moving imports to the top level per PEP 8, and correctly implementing dynamic default values in dataclasses using field(default_factory=...) to prevent shared state across instances.

gemini-code-assist · 2026-04-12T05:58:40Z

            mode = self.args.pause_generation_mode
            ray.get([engine.pause_generation.remote(mode=mode) for engine in self.rollout_engines])
-            ray.get([engine.flush_cache.remote() for engine in self.rollout_engines])
+            if mode not in ("in_place"):


The expression mode not in ("in_place") is evaluated as a substring check because ("in_place") is a string literal, not a tuple. While this works for the current literal values, it is non-idiomatic and potentially confusing. Using a direct inequality check is clearer and more robust.

Suggested change

if mode not in ("in_place"):

if mode != "in_place":

gemini-code-assist · 2026-04-12T05:58:40Z

+from dataclasses import dataclass
+from typing import Literal


Move the os import to the top level and include field from dataclasses to support dynamic default values for dataclass fields.

Suggested change

from dataclasses import dataclass

from typing import Literal

from dataclasses import dataclass, field

import os

from typing import Literal

References

PEP 8: Imports should be at the top of the file, before any other code except module docstrings. ^(link)

gemini-code-assist · 2026-04-12T05:58:40Z

+@dataclass
+class ScriptArgs(U.ExecuteTrainConfig):
+    mode: Literal["normal", "debug_minimal"] = "normal"
+    run_id: str = U.create_run_id()


In Python dataclasses, dynamic default values should be defined using field(default_factory=...). Using a function call directly in the class definition assigns the result of that call at module load time, meaning all instances of ScriptArgs will share the same run_id generated when the script is first imported. Using a factory ensures a fresh ID is generated upon instantiation.

Suggested change

run_id: str = U.create_run_id()

run_id: str = field(default_factory=U.create_run_id)

gemini-code-assist · 2026-04-12T05:58:40Z

+    import os
+
+    fully_async_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)))


The local import of os is unnecessary if moved to the top level, and os.path.join is redundant when called with a single argument.

Suggested change

import os

fully_async_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)))

fully_async_dir = os.path.dirname(os.path.abspath(__file__))

…adixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

maocheng23 requested review from fzyzcjy, yueming-yuan and yushengsu-thu as code owners April 12, 2026 05:55

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

nit

5951813

yueming-yuan approved these changes Apr 14, 2026

View reviewed changes

maocheng23 merged commit f144961 into main Apr 15, 2026
17 checks passed

maocheng23 deleted the fix/fully_async_no_hang branch April 15, 2026 18:51

GuanxingLu pushed a commit to GuanxingLu/miles that referenced this pull request Apr 21, 2026

[fix] Skip flush_cache in in_place mode and add fully async example (r…

109247b

…adixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Skip flush_cache in in_place mode and add fully async example#974

[fix] Skip flush_cache in in_place mode and add fully async example#974
maocheng23 merged 2 commits intomainfrom
fix/fully_async_no_hang

maocheng23 commented Apr 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	run_id: str = U.create_run_id()
	run_id: str = field(default_factory=U.create_run_id)

		import os

		fully_async_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)))

Conversation

maocheng23 commented Apr 12, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants