Skip to content

Add GenerateFnInput compatibility shim in sglang_rollout#1008

Closed
Shi-Dong wants to merge 1 commit intomainfrom
shi/260418-glm47-flash-async-agentic
Closed

Add GenerateFnInput compatibility shim in sglang_rollout#1008
Shi-Dong wants to merge 1 commit intomainfrom
shi/260418-glm47-flash-async-agentic

Conversation

@Shi-Dong
Copy link
Copy Markdown
Contributor

@Shi-Dong Shi-Dong commented Apr 19, 2026

Summary

Small compatibility shim so generate_and_rm can dispatch to the new single-argument GenerateFnInput-style custom generate functions (e.g. miles/rollout/generate_hub/agentic_tool_call.py::generate) in addition to the legacy 3-arg and 4-arg signatures.

Without the shim, any rollout wired to a GenerateFnInput-style function crashes immediately with:

TypeError: generate() takes 1 positional argument but 3 were given

because the old dispatcher always called the target as custom_generate_func(args, sample, sampling_params[, evaluation=...]).

The shim inspects the target's signature: if it has exactly one parameter, the four runtime values are packed into a GenerateFnInput dataclass and awaited as a single-arg call, and output.samples replaces sample. Otherwise it falls through to the existing 3-arg / 4-arg paths. Fully backwards-compatible with all existing custom generate functions.

Test plan

  • Existing 3-arg and 4-arg custom generate functions still dispatch correctly (no behavior change — the len(params) == 1 branch only fires for the new contract).
  • miles/rollout/generate_hub/agentic_tool_call.py::generate (single-arg GenerateFnInput) dispatches through the new branch without the TypeError, verified on a 2-node async agentic rollout against GLM-4.7-Flash on 2026-04-18 PT.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script for fully-async agentic training of GLM-4.7-Flash using SWE-bench data and updates the rollout logic to support a new single-argument signature for custom generation functions. The review feedback suggests improving the robustness of function signature detection by checking argument counts, warns about the broad process termination patterns in the cleanup function, and recommends retrieving model parameters from configuration instead of using hardcoded values.

Comment thread miles/rollout/sglang_rollout.py Outdated
sig = inspect.signature(custom_generate_func)
params = list(sig.parameters.values())
# Support GenerateFnInput-style generate functions (single-arg with typed input)
if len(params) == 1 and params[0].annotation is GenerateFnInput:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check params[0].annotation is GenerateFnInput is fragile. If the module defining the custom function uses from __future__ import annotations, the annotation will be a string, causing this check to fail. Since the legacy signatures use 3 or 4 arguments, checking for len(params) == 1 is a more reliable way to identify the new single-argument signature.

Suggested change
if len(params) == 1 and params[0].annotation is GenerateFnInput:
if len(params) == 1:

Comment on lines +100 to +106
targets = ["sglang", "train.py", "train_async.py", "MegatronTrain"]
exclude = f"grep -v '^{my_pid}$' | grep -v '^{ppid}$'"
for t in targets:
subprocess.run(
f"pgrep -f '{t}' | {exclude} | xargs -r kill 2>/dev/null || true",
shell=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using pgrep -f with broad patterns like 'sglang' or 'train.py' in the cleanup function can be risky on shared systems, as it might match and kill unrelated processes (e.g., a text editor with a file open that matches the pattern). Consider using more specific patterns or restricting the cleanup to processes owned by the current user.


def prepare(args: ScriptArgs):
"""Convert HF checkpoint to torch_dist format."""
max_convert_nodes = 92 // args.num_gpus_per_node
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 92 appears to be a hardcoded parameter. Model parameters should be retrieved from the model configuration rather than being hardcoded to ensure consistency and maintainability. Additionally, this value does not seem to be a multiple of the default num_gpus_per_node (8).

References
  1. Model parameters should be retrieved from the model configuration rather than being hardcoded.

@Shi-Dong Shi-Dong force-pushed the shi/260418-glm47-flash-async-agentic branch from 10e7d47 to 165855c Compare April 20, 2026 04:44
@Shi-Dong Shi-Dong changed the title Add GLM-4.7-Flash fully-async agentic training launcher Add GenerateFnInput compatibility shim in sglang_rollout Apr 20, 2026
Lets generate_and_rm dispatch to the new single-argument
GenerateFnInput-style custom generate functions (e.g.
miles/rollout/generate_hub/agentic_tool_call.py::generate)
alongside the existing 3-arg and 4-arg signatures, so async
agentic rollouts don't crash with "TypeError: generate() takes
1 positional argument but 3 were given".
@Shi-Dong
Copy link
Copy Markdown
Contributor Author

Superseded by #1016 which lands the same fix more cleanly: PR #1016 centralizes the legacy ↔ new generate_function adapter in miles/rollout/inference_rollout/compatibility.py (LegacyGenerateFnAdapter + load_generate_function), whereas this PR inlined the dispatch in generate_and_rm. PR #1016 is on main as commit 9a00364. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant