[example] add retool v2 example with multi-turn framework interfaces by PopSoda2002 · Pull Request #654 · radixark/miles

PopSoda2002 · 2026-03-02T08:10:22Z

What does this PR do?

Use new interface to give a retool example of better running multi-turn
Add sandbox to execute tool
Compared with v1, we do not need to handwriting a generate file, we can utilize existing interface from generate_hub, custom config:

custom_args = (
        "--custom-generate-function-path miles.rollout.generate_hub.multi_turn.generate "
        "--generate-tool-specs-path examples.retool_v2.tool_sandbox.tool_specs "
        "--generate-execute-tool-function-path examples.retool_v2.tool_sandbox.execute_tool "
        "--generate-tool-call-parser qwen25 "
        f"--generate-max-turns {args.generate_max_turns} "
        "--log-multi-turn "
    )

Instead, we just need to provide tool specs and execute tool, it's much more easier for verifying new ideas.

Fix bugs in retoolv1, the answer passer is too strict for extracting correct answer
Fix bugs in log passrate
Use deprecated megatron environment to not cause bugs in norm_epsilon

Result

We compare retool v1 and retool v2 to test correctness.

In the below aime eval graph, the 2 curves. bottom is original retool v1, and the upper 4 curves are retool v1 after fixed and retool v2 qwen3-4b and qwen3-4b-sft. We can see these curves converge in a same trend

And we also test the pass rate in raw dapo dataset which is for training:

We can see them close with each other.

Full wandb report:

https://api.wandb.ai/links/popsoda-university-of-washington/5yzh5361

coauthor with @GuanxingLu

gemini-code-assist · 2026-03-02T08:10:39Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an advanced retool_v2 example that showcases the miles framework's enhanced multi-turn capabilities for RL training with integrated tool usage. The primary goal is to simplify the development experience by abstracting complex multi-turn orchestration and providing clear interfaces for defining tools, executing them safely, and calculating rewards, thereby enabling more efficient and robust tool-augmented model training.

Highlights

New Retool v2 Example: A new example, retool_v2, has been added to demonstrate multi-turn Reinforcement Learning (RL) training with tool calls using updated framework interfaces.
Simplified Multi-Turn Framework Interfaces: The retool_v2 example leverages the miles framework's standard plugin interfaces, allowing users to implement tool definitions, execution logic, and reward functions with significantly less boilerplate compared to v1.
Python Sandbox for Safe Tool Execution: A PythonSandbox class is introduced within tool_sandbox.py to provide a secure environment for executing Python code called by tools, including safety checks, resource limits, and memory management.
Tool-Aware Reward Function: The example includes a reward_func that utilizes math_dapo_compute_score and applies a bonus for tool usage, encouraging the model to effectively integrate tools into its problem-solving process.
Dedicated Training Launch Script: A new training launch script, run_retool_multi_turn.py, is provided to configure and execute the multi-turn RL training, handling model preparation, dataset downloading, and argument passing to the framework.

Changelog

examples/retool_v2/readme.md
- Added a comprehensive README file detailing the purpose, key differences from the v1 example, file structure, and quick start instructions for the Retool v2 example.
examples/retool_v2/run_retool_multi_turn.py
- Added a Python script to serve as the main entry point for launching the multi-turn RL training, configuring various parameters, and managing model and data preparation.
examples/retool_v2/tool_sandbox.py
- Added a module containing the PythonSandbox for safe code execution, a ToolRegistry for managing tool specifications, and an asynchronous reward function that incentivizes tool calls.

Activity

No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

PopSoda2002 · 2026-03-02T08:13:23Z

Hi @yueming-yuan @fzyzcjy I think this PR is ready to review. But I am not sure if the result is fine or I need to do more experiments:

It seems Qwen3-4B's result is better than Qwen3-4B(After SFT)
Both of their evaluation results are getting better due to RL training

Additionally thanks a lot for the incredible work!

PopSoda2002 · 2026-03-02T08:18:32Z

cc @GuanxingLu

gemini-code-assist

Code Review

This pull request introduces retool_v2, a new example showcasing multi-turn RL training with tool-calling capabilities using updated framework interfaces, including a README, a training script, and a tool sandboxing module. A security audit identified critical vulnerabilities: a hardcoded W&B API key risking unauthorized access, a potential command injection in the training launch script, and a weak regex-based blacklist in the Python sandbox that could lead to remote code execution. Additionally, the review noted areas for improvement in code portability and overall clarity.

gemini-code-assist · 2026-03-02T08:22:41Z

+## TODO: Delete in the future
+def _get_wandb_args() -> str:
+    WANDB_API_KEY = "a37f4796e6205800c4212556a38e1319b5f144b7"
+    return (


A hardcoded Weights & Biases (W&B) API key was found. Committing secrets directly to the source code is a security risk, as it allows unauthorized access to the associated service and exposes it to anyone with repository access. It is strongly recommended to load the key from a secure source, such as an environment variable.

Suggested change

return (

def _get_wandb_args() -> str:

WANDB_API_KEY = os.environ.get("WANDB_API_KEY", "")

return (

"--use-wandb "

f"--wandb-project {WANDB_PROJECT} "

f"--wandb-group {WANDB_GROUP} "

f"--wandb-key {WANDB_API_KEY} "

)

gemini-code-assist · 2026-03-02T08:22:41Z

+            r"import\s+sys",
+            r"import\s+subprocess",
+            r"import\s+shutil",
+            r"import\s+glob",
+            r"import\s+pathlib",
+            r"__import__",
+            r"eval\s*\(",
+            r"exec\s*\(",
+            r"open\s*\(",
+            r"file\s*\(",
+            r"input\s*\(",
+            r"raw_input\s*\(",
+            r"compile\s*\(",
+            r"execfile\s*\(",
+            r"getattr\s*\(",
+            r"setattr\s*\(",
+            r"delattr\s*\(",
+            r"hasattr\s*\(",
+            r"globals\s*\(",
+            r"locals\s*\(",
+            r"vars\s*\(",
+            r"dir\s*\(",
+            r"type\s*\(",
+            r"isinstance\s*\(",
+            r"issubclass\s*\(",
+            r"super\s*\(",
+            r"property\s*\(",
+            r"staticmethod\s*\(",
+            r"classmethod\s*\(",
+            r"__\w+__",  # double underscore methods
+        ]
+
+        for pattern in dangerous_patterns:
+            if re.search(pattern, code, re.IGNORECASE):
+                return False, f"Code contains dangerous pattern: {pattern}"
+


The PythonSandbox uses a regex-based blacklist, specifically the overly broad pattern r"__\w+__", which is notoriously difficult to implement securely in Python and can be easily bypassed (e.g., via __getattribute__). This pattern blocks essential dunder methods like __init__ and __name__, severely restricting the sandbox's utility. Since this sandbox executes LLM-generated code, a bypass could lead to Remote Code Execution (RCE). It is strongly recommended to use a more robust isolation mechanism, such as a containerized environment or a whitelist-based approach, and to refine or remove the r"__\w+__" pattern.

gemini-code-assist · 2026-03-02T08:22:41Z

+        f"{custom_args} "
+        f"{args.extra_args} "
+    )
+


The extra_args parameter is directly concatenated into a shell command string. Since this script is designed to be executed with command-line arguments, an attacker who can control these arguments could potentially perform command injection. Consider validating or sanitizing this input before use.

gemini-code-assist · 2026-03-02T08:22:41Z

+    prompt_data: str = "/root/dapo-math-17k/dapo-math-17k.jsonl"
+    generate_max_turns: int = 16


The script contains hardcoded absolute paths starting with /root/ (e.g., for save_path, prompt_data, and paths within __post_init__). This makes the script difficult to run in different environments and not portable. It's a best practice to avoid hardcoding paths. Consider adding a data_root argument to ScriptArgs and constructing all other paths relative to it. This would make the example script more reusable and easier for other developers to run.

gemini-code-assist · 2026-03-02T08:22:41Z

+                    module.clear_cache()
+
+
+def check_and_cleanup_memory():
+    """Check memory usage and perform appropriate cleanup"""
+    current_memory = get_memory_usage()


The logic for clearing module caches in aggressive_cleanup_memory by calling a clear_cache() method is fragile. This method is not a standard part of Python modules, and popular libraries like numpy or pandas do not expose such a function. This means the code will likely have no effect for most modules, giving a false sense of memory optimization. This logic should be removed to avoid confusion unless you can rely on a specific, documented cache-clearing API for the modules in use.

gemini-code-assist · 2026-03-02T08:22:41Z

+        # Check memory usage before execution
+        current_memory = get_memory_usage()


Using a bare except Exception: pass to handle errors during the cleanup of the temporary directory is risky. It silently ignores any issues that might occur, such as permission errors or problems with shutil.rmtree. This can lead to temporary files being left behind, potentially consuming significant disk space over time. It would be better to at least log the exception, so that cleanup failures can be detected and debugged.

Suggested change

# Check memory usage before execution

current_memory = get_memory_usage()

except Exception as e:

print(f"Warning: Failed to clean up temporary directory {temp_dir}: {e}")

gemini-code-assist · 2026-03-02T08:22:41Z

+
+    if result["pred"] is None:
+        result["pred"] = ""
+


The reward calculation tool_call_count / 2 * 0.1 uses magic numbers. This makes the code harder to understand and maintain. To improve readability, these numbers should be defined as named constants with meaningful names (e.g., TOOL_CALL_REWARD_FACTOR). This will make the intention of the reward shaping logic clearer.

Suggested change

TOOL_CALL_REWARD_FACTOR = 0.05 # 0.1 / 2

tool_call_reward = tool_call_count * TOOL_CALL_REWARD_FACTOR

GuanxingLu · 2026-03-19T05:39:40Z

I think this is ready to review.

fzyzcjy · 2026-03-25T12:02:11Z

nit: README.md

fzyzcjy · 2026-03-25T12:04:25Z

+
+## TODO: Delete in the future
+def _get_wandb_args() -> str:
+    WANDB_API_KEY = "a37f4796e6205800c4212556a38e1319b5f144b7"


nit: your key

fzyzcjy · 2026-03-25T12:04:46Z

qq: is it copy-pasted from v1 w/o change, if so I will skip review

In async def _execute_python(self, arguments: dict[str, Any]) -> str:

It's the only difference, all others are exactly same:

if isinstance(code, list): code = "\n".join(str(item) for item in code)

And in the retool v2 tool sanbox, we add reward cal, do you think these are fine?

fzyzcjy · 2026-03-25T12:05:55Z

+    if isinstance(sample.prompt, str):
+        solution_str = sample.prompt + sample.response
+    else:
+        solution_str = sample.response


qq: why asymmetric b/t str and non-str prompt

fzyzcjy · 2026-03-25T12:06:34Z

not check in detail but if it works then looks ok

PopSoda2002

Thanks for commenting, please tell me if there is anything else I need to go, once it's ready, I will merge the lateset main branch and rerun the experiments to do the final check

PopSoda2002 · 2026-03-29T07:03:42Z

PopSoda2002 · 2026-03-29T07:23:13Z

In async def _execute_python(self, arguments: dict[str, Any]) -> str:

It's the only difference, all others are exactly same:

if isinstance(code, list): code = "\n".join(str(item) for item in code)

And in the retool v2 tool sanbox, we add reward cal, do you think these are fine?

fzyzcjy · 2026-04-05T06:25:36Z

oops I missed the github inbox (the inbox exploded), next time feel free to ping me on slack if I do not reply on github!

fzyzcjy · 2026-04-05T06:31:15Z

LGTM I think it is ready

PopSoda2002 · 2026-04-05T07:00:08Z

LGTM I think it is ready

Thanks so much! Please allow me to rebase the new branch and rerun again

Co-authored-by: GuanxingLu <gxlu02@gmail.com> Made-with: Cursor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PopSoda2002 · 2026-04-06T06:48:44Z

I have run the experiments again, tests passed on my end, I think it's ready to merge

…ismatch DEPRECATED_MEGATRON_COMPATIBLE causes norm_epsilon attribute error with current Megatron which uses layernorm_epsilon. Also fix log_passrate() call passing an extra parallel_state argument not accepted by the function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PopSoda2002 · 2026-04-06T23:35:46Z

I have rebased the current main branch and run again, looking fine

GuanxingLu · 2026-04-10T13:44:19Z

I was able to reproduce the results on my end.
Setup: 4xH200 GPUs
Results:

Just wanted to add this as an additional data point :)

yangninghua · 2026-04-18T01:15:40Z

I was able to reproduce the results on my end. Setup: 4xH200 GPUs Results:

Just wanted to add this as an additional data point :)

Which of the following models is it?

Qwen3-4B-Instruct-2507?

Qwen3-4B?

yangninghua · 2026-04-19T03:56:31Z

Hi @yueming-yuan @fzyzcjy I think this PR is ready to review. But I am not sure if the result is fine or I need to do more experiments:

It seems Qwen3-4B's result is better than Qwen3-4B(After SFT)

Both of their evaluation results are getting better due to RL training

Additionally thanks a lot for the incredible work!

font-info/qwen3-4b-sft-SGLang-RL

Qwen3-4B-Instruct-2507

My conclusion differs from yours; I'm not sure where I went wrong.

Did the open-source model make a mistake?

PopSoda2002 · 2026-04-21T03:31:46Z

Hi @yangninghua I use qwen3 4B for experiments, hope it help!

yangninghua · 2026-04-21T03:50:07Z

Hi @yangninghua I use qwen3 4B for experiments, hope it help!

Thank you for your prompt reply despite your busy schedule. I attempted to run experiments on your branch, but I suspect there are several discrepancies that may be preventing me from reproducing the results:

The Docker Hub image version I used is miles-dev-202604140056.
My GPU hardware is an H20-141G, not an H100 or H200.
I used the qwen3-4B-Instruct-2507 model along with its corresponding script at scripts/models/qwen3-4B-Instruct-2507.sh.

https://github.com/PopSoda2002/slime qwen3-4B-Instruct-2507 result：

…adixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PopSoda2002 · 2026-04-21T06:08:31Z

Hi @yangninghua I use qwen3 4B for experiments, hope it help!

Thank you for your prompt reply despite your busy schedule. I attempted to run experiments on your branch, but I suspect there are several discrepancies that may be preventing me from reproducing the results:

The Docker Hub image version I used is miles-dev-202604140056.

My GPU hardware is an H20-141G, not an H100 or H200.

I used the qwen3-4B-Instruct-2507 model along with its corresponding script at scripts/models/qwen3-4B-Instruct-2507.sh.

https://github.com/PopSoda2002/slime qwen3-4B-Instruct-2507 result：

I think instruct model maybe the difference?

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

PopSoda2002 force-pushed the examples/retool_v2 branch from 7d3fad0 to f402ab8 Compare March 9, 2026 07:18

PopSoda2002 requested review from fzyzcjy, maocheng23 and yueming-yuan as code owners March 19, 2026 18:57

fzyzcjy reviewed Mar 25, 2026

View reviewed changes

PopSoda2002 commented Mar 29, 2026

View reviewed changes

PopSoda2002 requested a review from fzyzcjy March 31, 2026 07:22

fzyzcjy approved these changes Apr 5, 2026

View reviewed changes

PopSoda2002 and others added 8 commits April 6, 2026 06:16

[example] add retool v2 example with multi-turn framework interfaces

2deb94d

Co-authored-by: GuanxingLu <gxlu02@gmail.com> Made-with: Cursor

Clean

57d619b

Clean

80c36b8

Fix retool v1 bug

5f0d806

fix logpass bug

88b01ec

Clean code

f163b75

Add megatron config

c12a6aa

Clean code

db87c85

PopSoda2002 force-pushed the examples/retool_v2 branch 2 times, most recently from 2ca3cdd to db87c85 Compare April 6, 2026 06:34

fix: isort import order in run_retool_multi_turn.py

3ea8862

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fzyzcjy approved these changes Apr 6, 2026

View reviewed changes

fzyzcjy merged commit afc5b55 into radixark:main Apr 7, 2026
15 checks passed

-    return (
+def _get_wandb_args() -> str:
+    WANDB_API_KEY = os.environ.get("WANDB_API_KEY", "")
+    return (
+        "--use-wandb "
+        f"--wandb-project {WANDB_PROJECT} "
+        f"--wandb-group {WANDB_GROUP} "
+        f"--wandb-key {WANDB_API_KEY} "
+    )

		prompt_data: str = "/root/dapo-math-17k/dapo-math-17k.jsonl"
		generate_max_turns: int = 16

		# Check memory usage before execution
		current_memory = get_memory_usage()


	TOOL_CALL_REWARD_FACTOR = 0.05 # 0.1 / 2
	tool_call_reward = tool_call_count * TOOL_CALL_REWARD_FACTOR

Conversation

PopSoda2002 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Result

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

PopSoda2002 commented Mar 2, 2026

Uh oh!

PopSoda2002 commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

GuanxingLu commented Mar 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Apr 5, 2026

Uh oh!

fzyzcjy commented Apr 5, 2026

Uh oh!

PopSoda2002 commented Apr 5, 2026

Uh oh!

PopSoda2002 commented Apr 6, 2026

Uh oh!

PopSoda2002 commented Apr 6, 2026

Uh oh!

Uh oh!

PopSoda2002 commented Mar 2, 2026 •

edited

Loading