fix: honor exec approval security and clean up stale tasks#60310
Conversation
Greptile SummaryThis PR bundles three bug fixes across different subsystems: (1) the main fix aligns gateway exec security with node-host by making The security fix is backward-compatible: when no explicit entry exists in Confidence Score: 5/5Safe to merge — all three fixes are backward-compatible, well-commented, and the maintenance path has direct regression test coverage. No P0 or P1 findings. The exec security fix is provably backward-compatible (verified via resolveExecApprovalsFromFile fallback chain). The timer change follows the same requestHeartbeatNow pattern already used elsewhere in the same function. The hasBackingSession refactor correctly handles all four TaskRuntime values. Tests cover the three newly fixed stale-task cases plus a healthy subagent non-regression case. No files require special attention. Reviews (1): Last reviewed commit: "fix: subagents now inherit exec approval..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73e9a2b1c9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (task.runtime === "cron") { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Stop treating every active cron task as orphaned
Returning false for all runtime === "cron" tasks makes shouldMarkLost() reconcile any running cron task to lost once the 5-minute grace expires, even when the job is still executing normally. This is a real runtime path because cron agent turns are allowed to run far longer (default safety timeout is up to 60 minutes), so long jobs will be mislabeled as lost mid-run and can emit incorrect terminal state/notifications before they actually finish.
Useful? React with 👍 / 👎.
73e9a2b to
c4b1bcd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4b1bcdff7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const chatType = deriveSessionChatType(childSessionKey); | ||
| if (chatType === "channel" || chatType === "group" || chatType === "direct") { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Preserve liveness checks for active CLI chat runs
Returning false for CLI tasks whose childSessionKey parses as channel/group/direct makes shouldMarkLost() flip them to lost after the 5-minute grace even when the run is still healthy. This path is reachable for gateway-ingress tasks (runtime: "cli") that use chat-scoped session keys, and agent runs are allowed to exceed 5 minutes (default timeout is far longer), so long-running turns will be mislabeled as terminal mid-execution and can be swept/inspected as dead work before they actually finish.
Useful? React with 👍 / 👎.
c4b1bcd to
b19e07c
Compare
88e11a2 to
0126ab9
Compare
0126ab9 to
258e5e7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 258e5e7d33
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| function resolveBaselineFloor( | ||
| provider: ProviderKey, | ||
| lane: BaselineLane, | ||
| ): LiveCacheFloor | undefined { | ||
| return (LIVE_CACHE_REGRESSION_BASELINE[provider] as Record<string, LiveCacheFloor>)[lane]; |
There was a problem hiding this comment.
Remove duplicate resolveBaselineFloor implementation
This introduces a second resolveBaselineFloor implementation while the original one earlier in this file is still present. TypeScript treats this as duplicate function implementations, which can break pnpm tsgo and prevent importing this module (for example through scripts/check-live-cache.ts and the live cache regression test) before any runtime logic executes. Keep a single helper and change its parameter type in place instead of redefining it.
Useful? React with 👍 / 👎.
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
Summary
This PR now contains two verified fixes:
exec-approvals.jsonagent security directly, matching the node-host path.The earlier cron wake fallback experiment was removed from this PR.
Changes
Exec approvals (
#60268)src/agents/bash-tools.exec-host-shared.tshostSecurityto useapprovals.agent.securitydirectly instead of intersecting it with the tool default viaminSecurity(...)askprecedence logic unchangedWhy this is correct:
exec-approvals.jsonis the explicit user security policyapprovals.agent.securityfalls back toparams.security, so behavior stays backward-compatibleTask maintenance (
#60299)src/tasks/task-registry.maintenance.tsruntime="cron"tasks as having no persistent backing session after the run exitsruntime="cli"tasks with channel/group/direct child session keys as stale even if that long-lived chat session still exists in the session storesubagentandacpbehavior intactsrc/tasks/task-registry.maintenance.issue-60299.test.tsVerification
pnpm checkpnpm test src/tasks/task-registry.maintenance.issue-60299.test.tsI also added a targeted exec regression test in-tree. A standalone local runner for that file was sticky in this environment, so the landed proof for the exec path is the code change plus that focused unit coverage.