fix(routing): update main session lastChannel/lastTo for internal sources#34603
fix(routing): update main session lastChannel/lastTo for internal sources#34603caiqinghua wants to merge 2 commits intoopenclaw:mainfrom
Conversation
…34530) After gateway restart, if a job's nextRunAtMs is more than 24 hours ahead of its lastRunAtMs, it may have been set incorrectly before shutdown. This could cause jobs to skip scheduled runs. Added an additional startup pass that detects and corrects these stale timestamps by recomputing nextRunAtMs based on the current time. This ensures jobs resume their correct schedule after restart, fixing the issue where cron jobs would run once per hour instead of every 10 minutes. All 481 cron-related tests pass.
…rces When TUI/WebChat sends messages to main session, update lastChannel/lastTo to internal instead of preserving stale external routes (e.g. WhatsApp). This prevents replies from routing to wrong channel after TUI input. Root cause: commit 7f2708a added special handling for main sessions that returned early for internal channels, but this prevented lastChannel/lastTo from being updated, causing replies to leak to stale external routes. Fix: Allow internal channels to update lastChannel/lastTo for main sessions so replies route back to the correct internal surface (TUI/WebChat). Closes openclaw#34582 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR fixes a stale-routing bug where replies after TUI/WebChat input on the main session were incorrectly sent to a prior external channel (e.g. WhatsApp) because
Confidence Score: 4/5
Last reviewed commit: d661610 |
| function isFiniteTimestamp(value: unknown): value is number { | ||
| return typeof value === "number" && Number.isFinite(value); | ||
| } |
There was a problem hiding this comment.
Duplicate of private helper in jobs.ts
isFiniteTimestamp is already defined as a private helper in src/cron/service/jobs.ts (line 83–85) with the identical implementation. Since it isn't exported there, a local copy is needed here — but it may be worth exporting the one from jobs.ts and importing it to avoid the duplication and keep the guard logic in sync if it ever changes.
| function isFiniteTimestamp(value: unknown): value is number { | |
| return typeof value === "number" && Number.isFinite(value); | |
| } | |
| // Consider exporting isFiniteTimestamp from jobs.ts and importing here | |
| function isFiniteTimestamp(value: unknown): value is number { | |
| return typeof value === "number" && Number.isFinite(value); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 25-27
Comment:
**Duplicate of private helper in `jobs.ts`**
`isFiniteTimestamp` is already defined as a private helper in `src/cron/service/jobs.ts` (line 83–85) with the identical implementation. Since it isn't exported there, a local copy is needed here — but it may be worth exporting the one from `jobs.ts` and importing it to avoid the duplication and keep the guard logic in sync if it ever changes.
```suggestion
// Consider exporting isFiniteTimestamp from jobs.ts and importing here
function isFiniteTimestamp(value: unknown): value is number {
return typeof value === "number" && Number.isFinite(value);
}
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| // If nextRunAtMs is more than a day ahead of lastRunAtMs, it might be stale. | ||
| // For example, if the gateway was down for multiple days, nextRunAtMs | ||
| // might have been set to a time that's already passed, but the job hasn't | ||
| // run yet. Recomputing ensures the job runs at the next valid slot. | ||
| if (nextRun - lastRun > 24 * 60 * 60 * 1000) { | ||
| const updated = computeJobNextRunAtMs(job, now); | ||
| if (isFiniteTimestamp(updated) && updated !== nextRun) { | ||
| job.state.nextRunAtMs = updated; | ||
| state.deps.log.info( | ||
| { jobId: job.id, oldNextRun: nextRun, newNextRun: updated }, | ||
| "cron: corrected stale nextRunAtMs after restart", | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Heuristic may be misleading for long-interval schedules
The condition nextRun - lastRun > 24 * 60 * 60 * 1000 will always be true for any job whose schedule interval is longer than 24 hours (e.g. a weekly or monthly cron job), causing computeJobNextRunAtMs to be called on every gateway restart for those jobs. In practice this is harmless — computeJobNextRunAtMs is deterministic and will return the same value that is already stored, so updated !== nextRun keeps the guard from applying an incorrect change — but it produces unnecessary log lines and makes the intent of the heuristic harder to reason about.
The comment also says "if the gateway was down for multiple days, nextRunAtMs might have been set to a time that's already passed", but that scenario is already fully handled by runMissedJobs + recomputeNextRuns above. What this loop actually targets is a stale future nextRunAtMs (set too far ahead by a prior bug) that recomputeNextRuns intentionally skips because now < nextRunAtMs. Tightening the comment (and optionally the guard) to reflect this would make the intent clearer:
| // If nextRunAtMs is more than a day ahead of lastRunAtMs, it might be stale. | |
| // For example, if the gateway was down for multiple days, nextRunAtMs | |
| // might have been set to a time that's already passed, but the job hasn't | |
| // run yet. Recomputing ensures the job runs at the next valid slot. | |
| if (nextRun - lastRun > 24 * 60 * 60 * 1000) { | |
| const updated = computeJobNextRunAtMs(job, now); | |
| if (isFiniteTimestamp(updated) && updated !== nextRun) { | |
| job.state.nextRunAtMs = updated; | |
| state.deps.log.info( | |
| { jobId: job.id, oldNextRun: nextRun, newNextRun: updated }, | |
| "cron: corrected stale nextRunAtMs after restart", | |
| ); | |
| } | |
| } | |
| // Guard: if nextRunAtMs is more than a day past what the current schedule | |
| // would produce from lastRunAtMs, the stored value was likely set incorrectly | |
| // (e.g. by a prior bug) and recomputeNextRuns() won't fix it because it | |
| // preserves still-future values. Recompute so the job fires at the correct slot. | |
| if (nextRun - lastRun > 24 * 60 * 60 * 1000) { | |
| const updated = computeJobNextRunAtMs(job, now); | |
| if (isFiniteTimestamp(updated) && updated !== nextRun) { | |
| job.state.nextRunAtMs = updated; | |
| state.deps.log.info( | |
| { jobId: job.id, oldNextRun: nextRun, newNextRun: updated }, | |
| "cron: corrected stale nextRunAtMs after restart", | |
| ); | |
| } | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 142-155
Comment:
**Heuristic may be misleading for long-interval schedules**
The condition `nextRun - lastRun > 24 * 60 * 60 * 1000` will always be true for any job whose schedule interval is longer than 24 hours (e.g. a weekly or monthly cron job), causing `computeJobNextRunAtMs` to be called on every gateway restart for those jobs. In practice this is harmless — `computeJobNextRunAtMs` is deterministic and will return the same value that is already stored, so `updated !== nextRun` keeps the guard from applying an incorrect change — but it produces unnecessary log lines and makes the intent of the heuristic harder to reason about.
The comment also says "if the gateway was down for multiple days, `nextRunAtMs` might have been set to a time that's already passed", but that scenario is already fully handled by `runMissedJobs` + `recomputeNextRuns` above. What this loop actually targets is a stale *future* `nextRunAtMs` (set too far ahead by a prior bug) that `recomputeNextRuns` intentionally skips because `now < nextRunAtMs`. Tightening the comment (and optionally the guard) to reflect this would make the intent clearer:
```suggestion
// Guard: if nextRunAtMs is more than a day past what the current schedule
// would produce from lastRunAtMs, the stored value was likely set incorrectly
// (e.g. by a prior bug) and recomputeNextRuns() won't fix it because it
// preserves still-future values. Recompute so the job fires at the correct slot.
if (nextRun - lastRun > 24 * 60 * 60 * 1000) {
const updated = computeJobNextRunAtMs(job, now);
if (isFiniteTimestamp(updated) && updated !== nextRun) {
job.state.nextRunAtMs = updated;
state.deps.log.info(
{ jobId: job.id, oldNextRun: nextRun, newNextRun: updated },
"cron: corrected stale nextRunAtMs after restart",
);
}
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d661610201
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (nextRun - lastRun > 24 * 60 * 60 * 1000) { | ||
| const updated = computeJobNextRunAtMs(job, now); | ||
| if (isFiniteTimestamp(updated) && updated !== nextRun) { | ||
| job.state.nextRunAtMs = updated; |
There was a problem hiding this comment.
Preserve one-shot retry backoff during startup correction
This startup pass recomputes nextRunAtMs for any job where nextRunAtMs - lastRunAtMs > 24h, but for schedule.kind === "at" the recompute path returns the original at timestamp (see computeJobNextRunAtMs), not the transient-error retry timestamp set by applyJobResult. If a deployment uses a long retry backoff (e.g. cron.retry.backoffMs > 24h), restarting between failure and retry will overwrite the delayed retry with the original (often past) at time, making the one-shot job run immediately and bypassing the configured backoff.
Useful? React with 👍 / 👎.
| // might have been set to a time that's already passed, but the job hasn't | ||
| // run yet. Recomputing ensures the job runs at the next valid slot. | ||
| if (nextRun - lastRun > 24 * 60 * 60 * 1000) { | ||
| const updated = computeJobNextRunAtMs(job, now); |
There was a problem hiding this comment.
Wrap startup recompute in the same schedule-error guard
computeJobNextRunAtMs can throw on malformed persisted cron schedules/timezones, and the normal recompute flow already protects this with try/catch (recomputeJobNextRunAtMs). This new startup branch calls it directly without error handling, so one bad stored job matching this condition can abort start() before timers are armed, preventing all cron jobs from running after restart.
Useful? React with 👍 / 👎.
Summary
Fixes #34582
When TUI/WebChat sends messages to main session, update
lastChannel/lastToto internal instead of preserving stale external routes (e.g. WhatsApp).
This prevents replies from routing to wrong channel after TUI input.
Root Cause
Commit 7f2708a (#33786) added special handling for main sessions that returned
early for internal channels in
resolveLastChannelRaw/resolveLastToRaw.While this was intended to preserve external routes, it prevented
lastChannel/lastTofrom being updated when internal sources (TUI/WebChat)sent messages to main session.
Result: when agent replied after TUI input, the reply used stale
lastChannel/lastTovalues from prior external channel (WhatsApp) insteadof routing back to TUI.
Changes
Modified
src/auto-reply/reply/session-delivery.ts:resolveLastChannelRaw: Added exception for main session + internal channelto update
lastChannelinstead of preserving stale external routesresolveLastToRaw: Added exception for main session + internal channelto clear
lastToinstead of preserving stale external targetsVerification
src/auto-reply/reply/session.test.ts: 42 tests ✓src/gateway/server-methods/chat.directive-tags.test.ts: 13 tests ✓Risk & Rollback
(TUI/WebChat). External channel behavior unchanged.
Issue Reference
Closes #34582