Skip to content

fix(cron): implement true LRU cache eviction#39706

Closed
liangruochong44-ui wants to merge 2 commits into
openclaw:mainfrom
liangruochong44-ui:fix/cron-cache-lru
Closed

fix(cron): implement true LRU cache eviction#39706
liangruochong44-ui wants to merge 2 commits into
openclaw:mainfrom
liangruochong44-ui:fix/cron-cache-lru

Conversation

@liangruochong44-ui

Copy link
Copy Markdown

Summary

The cronEvalCache was using FIFO eviction (first inserted = first evicted), but Map.get() doesn't reorder entries. This caused frequently accessed entries to be prematurely evicted.

Changes

  • On cache hit, delete and re-set the entry to move it to the end of Map iteration order, implementing true LRU behavior.

Testing

  • Existing cron tests pass

Closes #39679

lrc added 2 commits March 8, 2026 18:10
- Add WAL journal mode to memory SQLite database (fixes openclaw#36035)
  WAL mode is crash-safe and survives SIGTERM/SIGKILL mid-write
- Add transient SQLite error handling (fixes openclaw#34678)
  Classify SQLITE_CANTOPEN, SQLITE_BUSY, SQLITE_LOCKED, SQLITE_IOERR as
  non-fatal errors to prevent LaunchAgent crash loops
The cronEvalCache was using FIFO eviction (first inserted = first
evicted), but Map.get() doesn't reorder entries. This caused frequently
accessed entries to be prematurely evicted.

Fix: On cache hit, delete and re-set the entry to move it to the end
of Map iteration order, implementing true LRU behavior.

Fixes openclaw#39679
@greptile-apps

greptile-apps Bot commented Mar 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR bundles three distinct changes under the cron LRU cache fix title: (1) the stated LRU fix in schedule.ts, (2) a new isTransientSqliteError handler in unhandled-rejections.ts, and (3) enabling WAL journal mode in manager-sync-ops.ts.

  • src/cron/schedule.ts — The LRU fix is correct. Delete-then-re-insert on a cache hit moves the entry to the tail of Map iteration order, ensuring the eviction path always removes the least-recently-used entry.
  • src/infra/unhandled-rejections.tsSQLITE_BUSY and SQLITE_LOCKED are legitimately transient, but SQLITE_CANTOPEN (missing file / bad permissions that won't self-heal) and SQLITE_IOERR (a broad parent code covering disk-level failures) are not reliably transient. Treating them as such means the gateway can silently continue running while its database is inaccessible, masking data-loss or corruption rather than crashing fast. The message-matching path also skips the case-normalization (.toUpperCase()) used by the analogous network-error checker, leaving lowercase variants undetected.
  • src/memory/manager-sync-ops.ts — WAL mode is a good addition; it is idempotent and improves crash-safety. The return value of the pragma is not checked, so a silent fallback on unsupported filesystems would go unnoticed, but this is a low-impact edge case.

Confidence Score: 2/5

  • The SQLite transient-error classification in unhandled-rejections.ts can cause the gateway to silently continue after non-recoverable database failures; needs tightening before merge.
  • The cron LRU fix and WAL mode change are both correct and safe. However, classifying SQLITE_CANTOPEN and SQLITE_IOERR as transient in the unhandled-rejection handler is a meaningful logic issue: these codes often represent permanent conditions, and suppressing them turns a crash-fast failure into a silent, hard-to-diagnose broken state. This lowers confidence significantly.
  • src/infra/unhandled-rejections.ts — specifically the TRANSIENT_SQLITE_CODES set and the isTransientSqliteError message-matching logic.

Last reviewed commit: 0a9c952

Comment on lines +72 to +76
"SQLITE_CANTOPEN",
"SQLITE_BUSY",
"SQLITE_LOCKED",
"SQLITE_IOERR",
]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLITE_CANTOPEN and SQLITE_IOERR are not truly transient

SQLITE_CANTOPEN typically indicates a permanent condition — the database file or its parent directory doesn't exist, or the process lacks permissions that won't fix themselves. Classifying it as transient means the gateway will silently continue running while unable to access its database, causing silent data-loss or reads returning stale data instead of crashing-fast.

SQLITE_IOERR is similarly dangerous: it's the parent code for disk-level I/O failures (SQLITE_IOERR_READ, SQLITE_IOERR_WRITE, SQLITE_IOERR_FSYNC, etc.) that can indicate disk corruption or hardware failure — not temporary contention. Swallowing it prevents the operator from knowing the database is in a bad state.

Consider narrowing the set to only codes that are genuinely retryable:

Suggested change
"SQLITE_CANTOPEN",
"SQLITE_BUSY",
"SQLITE_LOCKED",
"SQLITE_IOERR",
]);
const TRANSIENT_SQLITE_CODES = new Set([
"SQLITE_BUSY",
"SQLITE_LOCKED",
]);

If SQLITE_CANTOPEN or SQLITE_IOERR must be handled at all, they should at minimum not be suppressed silently — they warrant a distinct code-path that either retries with backoff or shuts down with a clear error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/unhandled-rejections.ts
Line: 72-76

Comment:
**`SQLITE_CANTOPEN` and `SQLITE_IOERR` are not truly transient**

`SQLITE_CANTOPEN` typically indicates a permanent condition — the database file or its parent directory doesn't exist, or the process lacks permissions that won't fix themselves. Classifying it as transient means the gateway will silently continue running while unable to access its database, causing silent data-loss or reads returning stale data instead of crashing-fast.

`SQLITE_IOERR` is similarly dangerous: it's the parent code for disk-level I/O failures (`SQLITE_IOERR_READ`, `SQLITE_IOERR_WRITE`, `SQLITE_IOERR_FSYNC`, etc.) that can indicate disk corruption or hardware failure — not temporary contention. Swallowing it prevents the operator from knowing the database is in a bad state.

Consider narrowing the set to only codes that are genuinely retryable:

```suggestion
const TRANSIENT_SQLITE_CODES = new Set([
  "SQLITE_BUSY",
  "SQLITE_LOCKED",
]);
```

If `SQLITE_CANTOPEN` or `SQLITE_IOERR` must be handled at all, they should at minimum not be suppressed silently — they warrant a distinct code-path that either retries with backoff or shuts down with a clear error.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +234 to +238
const rawMessage = (candidate as { message?: unknown }).message;
const message = typeof rawMessage === "string" ? rawMessage : "";
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
return true;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message check skips case normalization

Unlike isTransientNetworkError (line 186), which calls .toLowerCase().trim() before matching, the message here is used as-is and matched against uppercase literals. If a SQLite binding ever surfaces a lowercase variant (e.g. "sqlite_busy") or mixed-case message, these checks would silently miss it.

Suggested change
const rawMessage = (candidate as { message?: unknown }).message;
const message = typeof rawMessage === "string" ? rawMessage : "";
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
return true;
const message = typeof rawMessage === "string" ? rawMessage.toUpperCase() : "";
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/unhandled-rejections.ts
Line: 234-238

Comment:
**Message check skips case normalization**

Unlike `isTransientNetworkError` (line 186), which calls `.toLowerCase().trim()` before matching, the message here is used as-is and matched against uppercase literals. If a SQLite binding ever surfaces a lowercase variant (e.g. `"sqlite_busy"`) or mixed-case message, these checks would silently miss it.

```suggestion
    const message = typeof rawMessage === "string" ? rawMessage.toUpperCase() : "";
    if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
        message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
```

How can I resolve this? If you propose a fix, please make it concise.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a9c952f7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

const db = new DatabaseSync(dbPath, { allowExtension: this.settings.store.vector.enabled });
// WAL mode is crash-safe and survives SIGTERM/SIGKILL mid-write.
// It also allows concurrent reads during writes.
db.exec("PRAGMA journal_mode = WAL");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set busy_timeout before enabling WAL mode

Running PRAGMA journal_mode = WAL before PRAGMA busy_timeout = 5000 makes the first lock-sensitive statement execute with SQLite's default 0ms timeout, so a brief concurrent writer can cause immediate database is locked failures during DB open/reopen. This regresses the retry behavior described in the nearby comment ("retry instead of failing immediately"), and can make startup/reindexing flaky under normal multi-process contention.

Useful? React with 👍 / 👎.

Comment on lines +236 to +237
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match transient SQLite errors using normalized messages

The new SQLite suppression path only matches uppercase substrings like SQLITE_BUSY/SQLITE_LOCKED, but common runtime forms in this codebase are normalized lowercase or plain text such as database is locked (for example, qmd-manager already checks those forms). As written, many transient SQLite lock rejections will not be recognized and still trigger process.exit(1), so the intended non-fatal handling is incomplete.

Useful? React with 👍 / 👎.

@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 24, 2026
@clawsweeper

clawsweeper Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Closing this as duplicate or superseded after Codex automated review.

Close as superseded by #39703. The cron LRU behavior is still not implemented on current main, but #39703 already tracks the same remaining fix as a narrower PR with regression-test coverage, while #39706 bundles unrelated SQLite/WAL changes that were flagged in review and have since diverged on main.

Best possible solution:

Close #39706 as superseded by #39703 and keep the remaining cron LRU work on that focused PR or an equivalent narrow branch. The eventual fix should promote cache hits before eviction and include a regression test that proves hot cron entries remain cached after capacity eviction.

What I checked:

  • Current main still has FIFO cron eviction: At HEAD 73e2151, resolveCachedCron() returns a cached Cron immediately without deleting/re-setting the key, then evicts cronEvalCache.keys().next().value when full. That is insertion-order FIFO behavior, not true LRU. (src/cron/schedule.ts:17, 73e21511075d)
  • Current cron tests do not cover eviction order: The existing cron cache test verifies reuse and final cache size for repeated expression/timezone lookups, but it does not fill the cache, promote a hot key, or assert which key survives eviction. (src/cron/schedule.test.ts:127, 73e21511075d)
  • Canonical related PR tracks the same remaining work: The provided/local ClawSweeper report for fix: use LRU eviction for cron schedule cache instead of FIFO #39703 shows it is an open PR titled “fix: use LRU eviction for cron schedule cache instead of FIFO”; it was kept open because current main still has FIFO eviction, and its branch is focused on the cron cache with a key-level regression test asserting promoted entries survive eviction. (src/cron/schedule.test.ts:145, 00cb426b2c66)
  • This PR is dirty relative to the focused cron fix: The provided PR context shows fix(cron): implement true LRU cache eviction #39706 changes src/cron/schedule.ts plus unrelated SQLite transient-error handling and WAL-mode code. Review comments flagged those unrelated SQLite/WAL changes, and current main now has a different SQLite handler/WAL layout, including src/infra/unhandled-rejections.ts and WAL setup under src/tasks and src/proxy-capture rather than the old src/memory/manager-sync-ops.ts path. (src/infra/unhandled-rejections.ts:70, 73e21511075d)

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 73e21511075d.

@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Apr 26, 2026
@clawsweeper clawsweeper Bot closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Cron schedule cache eviction is FIFO, not LRU — hot entries get evicted prematurely

1 participant