Skip to content

Heartbeat: spread interval runs across stable phases#64560

Merged
odysseus0 merged 4 commits into
openclaw:mainfrom
odysseus0:fix/heartbeat-stable-phase
Apr 11, 2026
Merged

Heartbeat: spread interval runs across stable phases#64560
odysseus0 merged 4 commits into
openclaw:mainfrom
odysseus0:fix/heartbeat-stable-phase

Conversation

@odysseus0

@odysseus0 odysseus0 commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: interval heartbeats are scheduled relative to gateway startup, which clusters provider traffic around the same startup-relative slots instead of distributing it uniformly across the configured interval.
  • Why it matters: synchronized heartbeat turns can hammer model providers even when the average heartbeat rate is reasonable.
  • What changed: interval heartbeats now use a stable per-agent phase derived from gateway identity plus agent id, spreading runs uniformly across the configured interval.
  • What did NOT change (scope boundary): default cadence stays at 30m, and targeted/action-driven wakes keep the existing cooldown behavior.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: the scheduler used now + interval for initial and follow-up interval heartbeats, so each gateway stayed pinned to startup-relative times instead of being distributed across the interval.
  • Missing detection / guardrail: there was no scheduler test asserting stable phase distribution independent of startup time.
  • Contributing context (if known): no central coordinator exists, so startup-relative scheduling becomes the de facto global distribution strategy.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/infra/heartbeat-schedule.test.ts, src/infra/heartbeat-runner.scheduler.test.ts
  • Scenario the test should lock in: heartbeats for a given gateway/agent get a stable phase inside the configured interval, config reload preserves future due slots, and retry wakes do not push the schedule out indefinitely.
  • Why this is the smallest reliable guardrail: the bug lives in pure scheduling math plus the runner seam; end-to-end coverage would be slower without adding signal.
  • Existing test that already covers this (if any): src/infra/heartbeat-runner.scheduler.test.ts covered the old interval scheduler behavior.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Heartbeat interval runs are spread across the configured interval instead of clustering around startup-relative times.
  • Default heartbeat cadence remains 30m.

Diagram (if applicable)

Before:
[heartbeat stream] -> [startup-relative timer] -> [traffic clusters at similar interval slots]

After:
[heartbeat stream] -> [stable hashed per-agent phase] -> [traffic distributed across the interval]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 24 / pnpm
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): agents.defaults.heartbeat.every="30m"

Steps

  1. Start multiple gateways with the same heartbeat interval.
  2. Compare when their next interval heartbeat fires.
  3. Restart them and compare the next interval heartbeat again.

Expected

  • Each gateway/agent keeps its own stable slot within the interval.

Actual

  • Before this change, heartbeats stayed clustered around startup-relative times.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: targeted scheduler tests pass for stable phase math and runner scheduling behavior.
  • Edge cases checked: config reload preserves future due slots; retry wakes do not permanently shift interval scheduling; targeted wakes still use cooldown-style now + interval behavior.
  • What you did not verify: full pnpm build currently fails on an unrelated baseline issue resolving acpx/runtime from extensions/acpx/src/runtime.ts.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: stable phase scheduling changes when interval changes or when a gateway identity is regenerated.
    • Mitigation: that is expected and bounded; with a stable identity and unchanged interval, phases remain deterministic across restarts.

@aisle-research-bot

aisle-research-bot Bot commented Apr 11, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Targeted heartbeat wakes can indefinitely postpone scheduled interval heartbeats (scheduler starvation)
1. 🟡 Targeted heartbeat wakes can indefinitely postpone scheduled interval heartbeats (scheduler starvation)
Property Value
Severity Medium
CWE CWE-400
Location src/infra/heartbeat-runner.ts:1240-1251

Description

advanceAgentSchedule() updates an agent’s nextDueMs differently based on reason.

  • For normal interval ticks (reason === "interval"), nextDueMs is aligned to the stable phase (computeNextHeartbeatPhaseDueMs).
  • For targeted/action-driven wakes (hook/manual/exec-event/cron/etc), nextDueMs is set to now + intervalMs.
  • The global timer (scheduleNext) schedules the next interval wake based on the minimum nextDueMs across agents.

As a result, any targeted wake that successfully runs a heartbeat pushes that agent’s next scheduled interval slot out to now + intervalMs, overriding the phase-aligned cadence. If an untrusted or attacker-controllable event source can repeatedly trigger targeted wakes (e.g. HTTP hook wake via requestHeartbeatNow({ reason: "hook:wake" })), they can keep pushing nextDueMs forward, effectively starving/suppressing interval-driven heartbeats indefinitely.

Impact:

  • Availability/monitoring: periodic health/heartbeat checks can be delayed indefinitely.
  • Security monitoring: heartbeat-driven alerts/telemetry may be masked by suppressing scheduled runs.

Vulnerable code:

agent.nextDueMs =
  reason === "interval"
    ? computeNextHeartbeatPhaseDueMs({ nowMs: now, intervalMs: agent.intervalMs, phaseMs: agent.phaseMs })
    : now + agent.intervalMs;

Recommendation

Keep the phase-aligned schedule independent from targeted/action-driven runs.

One approach:

  • Always compute the next phase due time for interval scheduling.
  • Track a separate cooldownUntilMs / lastRunMs to prevent too-frequent runs due to targeted wakes.
  • When handling targeted wakes, update only the cooldown, not the phase schedule.

Example (sketch):

// state
type HeartbeatAgentState = {// ...
  phaseMs: number;
  nextDueMs: number;        // next interval slot (phase aligned)
  cooldownUntilMs: number;  // min time before any run is allowed
};

function advanceAfterRun(agent: HeartbeatAgentState, now: number) {
  agent.cooldownUntilMs = now + agent.intervalMs;
  agent.nextDueMs = computeNextHeartbeatPhaseDueMs({
    nowMs: now,
    intervalMs: agent.intervalMs,
    phaseMs: agent.phaseMs,
  });
}// When deciding to run on interval ticks:
if (reason === "interval" && (now < agent.nextDueMs || now < agent.cooldownUntilMs)) continue;// When deciding to run on targeted wakes:
if (now < agent.cooldownUntilMs) return { status: "skipped", reason: "cooldown" };

Additionally, consider rate-limiting/authenticating external wake sources (e.g. hook endpoints) so they cannot be abused to manipulate scheduler timing.


Analyzed PR: #64560 at commit 774ede6

Last updated on: 2026-04-11T02:41:05Z

@openclaw-barnacle openclaw-barnacle Bot added size: M maintainer Maintainer-authored PR labels Apr 11, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42d50ed944

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1240 to +1244
reason === "interval"
? computeNextHeartbeatPhaseDueMs({
nowMs: now,
intervalMs: agent.intervalMs,
phaseMs: agent.phaseMs,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve cooldown after non-interval wakes

After a manual/exec/hook wake, the code still sets nextDueMs = now + interval, but the next interval-triggered run immediately snaps back to the hashed phase via computeNextHeartbeatPhaseDueMs, which can schedule another heartbeat far sooner than a full interval (e.g., minutes or milliseconds later depending on phase). Before this commit, subsequent runs stayed on now + interval, so this introduces a cadence regression for action-driven wakes and can unexpectedly increase heartbeat frequency after user/event-triggered runs.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces startup-relative now + interval heartbeat scheduling with a stable per-agent phase derived from SHA-256(schedulerSeed:agentId) % intervalMs, so that restart waves no longer cluster heartbeats into provider spikes. Interval wakes snap to the phase slot; targeted/action-driven wakes retain the existing cooldown (now + intervalMs) and then snap back on the next interval tick. Test coverage covers phase stability, config-reload preservation, retry non-drift, and multi-agent routing.

Confidence Score: 5/5

Safe to merge — scheduling math is correct, edge cases are handled, and test coverage is thorough.

No P0 or P1 findings. The phase derivation, cycle-position arithmetic, config-reload preservation, and cooldown/phase bifurcation all behave correctly. The only possible nit is that resolveHeartbeatPhaseMs reads only 4 bytes of the SHA-256 digest (minor modulo bias for large intervals, negligible for the spreading goal), but this is firmly P2 and does not affect correctness.

No files require special attention.

Reviews (1): Last reviewed commit: "Heartbeat: spread interval runs across s..." | Re-trigger Greptile

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c67169b4ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +156 to +160
return createHash("sha256")
.update(process.env.HOME ?? "")
.update("\0")
.update(process.cwd())
.digest("hex");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use unique fallback scheduler seed per gateway

When loadOrCreateDeviceIdentity() fails (for example on read-only or permission-limited state dirs), the fallback seed is derived only from HOME and cwd. In many containerized/service deployments those values are identical across replicas, so different gateways end up with the same schedulerSeed and therefore the same per-agent phase, re-synchronizing interval heartbeats and negating this commit’s load-spreading goal under exactly the degraded path this fallback is meant to handle.

Useful? React with 👍 / 👎.

@odysseus0 odysseus0 force-pushed the fix/heartbeat-stable-phase branch from 1f3d716 to 774ede6 Compare April 11, 2026 02:39
@odysseus0 odysseus0 merged commit 9a4a9a5 into openclaw:main Apr 11, 2026
26 checks passed
@odysseus0

Copy link
Copy Markdown
Contributor Author

Merged via squash.

Thanks @odysseus0!

@odysseus0 odysseus0 deleted the fix/heartbeat-stable-phase branch April 11, 2026 02:40

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 774ede6408

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

prev &&
prev.intervalMs === intervalMs &&
prev.phaseMs === phaseMs &&
prev.nextDueMs > params.nowMs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep missed due slot when heartbeat config hot-reloads

resolveNextHeartbeatDueMs only preserves the previous schedule when prev.nextDueMs > nowMs. On hot reloads (src/gateway/server-reload-handlers.ts:87-89 calls heartbeatRunner.updateConfig for heartbeat-relevant config changes), if reload occurs just after a due boundary, this logic recomputes from nowMs and skips the overdue run until the next phase slot, delaying heartbeats by up to a full interval. Before this change, lastRunMs + interval allowed immediate catch-up for overdue slots.

Useful? React with 👍 / 👎.

lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
Merged via squash.

Prepared head SHA: 774ede6
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Reviewed-by: @odysseus0
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
Merged via squash.

Prepared head SHA: 774ede6
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Reviewed-by: @odysseus0
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Merged via squash.

Prepared head SHA: 774ede6
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Reviewed-by: @odysseus0
globalcaos pushed a commit to globalcaos/tinkerclaw that referenced this pull request May 13, 2026
Merged via squash.

Prepared head SHA: 774ede6
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Reviewed-by: @odysseus0
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
Merged via squash.

Prepared head SHA: 774ede6
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Reviewed-by: @odysseus0
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
Merged via squash.

Prepared head SHA: 774ede6
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Co-authored-by: odysseus0 <8635094+odysseus0@users.noreply.github.com>
Reviewed-by: @odysseus0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant