Skip to content

fix(sessions): fall back to latest reset archive for missing async transcripts#76134

Open
masatohoshino wants to merge 3 commits into
openclaw:mainfrom
masatohoshino:pr/openclaw-archived-session-async-fallback
Open

fix(sessions): fall back to latest reset archive for missing async transcripts#76134
masatohoshino wants to merge 3 commits into
openclaw:mainfrom
masatohoshino:pr/openclaw-archived-session-async-fallback

Conversation

@masatohoshino

@masatohoshino masatohoshino commented May 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Why Refs and not Fixes:

Root Cause (if applicable)

  • Root cause: the candidate-resolution path used by all three async readers (findExistingTranscriptPath plus the inline resolveSessionTranscriptCandidates + existsSync block in readSessionTitleFieldsFromTranscriptAsync) only considers active <id>.jsonl candidates, even though archiveFileOnDisk (session-transcript-files.fs.ts) atomically renames the active to <id>.jsonl.reset.<ts> on reset.
  • Missing detection / guardrail: cleanupArchivedSessionTranscripts already enumerates archives via parseSessionArchiveTimestamp(entry, "reset"), so core was maintaining the on-disk archive while the read side had no async-path equivalent. The sync side is being addressed in fix(session): fall back to reset archive in chat.history when primary transcript is missing #60409; this PR brings the async side to behavioral parity for the missing-active case only.
  • Contributing context: the rename and the next active-write are not atomic from a reader's point of view, so a reader arriving in that gap saw nothing.

Regression Test Plan (if applicable)

  • Coverage level: [x] Unit test
  • Target file: src/gateway/session-utils.fs.test.ts — new describe("async archive fallback (missing-primary auto-fallback)") block.
  • Invariants asserted (10 cases across the three async readers):
    • Active-wins: when active and archive both exist, the result is identical to today's output for the active file (no merging, archive ignored).
    • Missing-active fallback: when only an archive exists, the result is the archive's content (or its title fields).
    • Latest-only: when multiple .reset.<ts> archives exist, only the newest archive's content is returned (no chain aggregation).
    • Both-missing: when no candidate and no archive exist, the result is the existing default ([] or {firstUserMessage: null, lastMessagePreview: null}).
  • Why this is the smallest reliable guardrail: each invariant maps directly to a public Acceptance Criterion in this PR and to a single branch in the new helper. The fixtures use formatSessionArchiveTimestamp(ms) for deterministic archive names, so the cases run in milliseconds and do not depend on Date.now() ordering.
  • Existing coverage: none for the missing-primary async read path. archiveSessionTranscripts describe block elsewhere in the same file covers archive creation but not the read-side fallback.

User-visible / Behavior Changes

  • When the active transcript is missing, async readers used by chat.history, sessions-history HTTP, and session listing/title now return the newest reset-archive content (or its title fields) instead of an empty result. Active-present behavior is intended to remain unchanged and is covered by the active-wins tests above. No new config knob, no schema change, no opt-in flag.

Diagram (if applicable)

N/A (single-step decision: "no active candidate exists → consult findLatestResetArchiveAsync → read that one path").

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Concrete invariant: the fallback uses the same candidate roots as resolveSessionTranscriptCandidates, only inspects sibling filenames in those candidate parent directories, and accepts only entries whose suffix passes the existing parseSessionArchiveTimestamp(entry, "reset") validation (already used by cleanupArchivedSessionTranscripts). No new directory roots, no caller-controlled path components.

Repro + Verification

Environment

  • OS: Linux x86_64 (workspace dev environment)
  • Runtime: Node 22, pnpm 10.33.2
  • Model/provider: N/A (server-side fs path)
  • Integration/channel (if any): N/A
  • Relevant config: default ~/.openclaw layout / temp store under os.tmpdir() for tests

Steps

  1. Start with no <id>.jsonl for sessionId in the sessions store; place a <id>.jsonl.reset.<UTC timestamp> sibling with content.
  2. Call each of readRecentSessionMessagesAsync, readSessionMessagesAsync({mode:"full"}), and readSessionTitleFieldsFromTranscriptAsync.
  3. Add an active <id>.jsonl with different content and repeat — assert active wins.
  4. Add a second <id>.jsonl.reset.<later-ts> and re-remove the active file — assert only the newer archive is returned.

Expected

  • Step 2: archive content / archive-derived title fields.
  • Step 3: active content only (archive ignored).
  • Step 4: newer archive content only (older archive bytes do not appear).

Actual

  • Matches Expected. The 10 new cases in src/gateway/session-utils.fs.test.ts enforce all four invariants.

Evidence

  • The new regression cases cover behavior that is absent from the PR base (current upstream/main HEAD bd6035d97707b7a62376a30de2eeae9579d56466): three of the new invariants (missing-active fallback, latest-archive-only, archive-derived title fields) exercise code paths that did not exist before this PR, so without these commits the asserted post-conditions could not be produced. I have not separately cherry-picked the new describe block onto the base commit to demonstrate a direct red run.
  • Existing 71 cases in session-utils.fs.test.ts continue to pass alongside the 10 new cases (81 / 81 green locally on this branch). Adjacent regression: session-archive.imports.test.ts, session-transcript-key.test.ts, session-history-state.test.ts, sessions-history-http.revocation.test.ts, managed-image-attachments.test.ts, embedded-backend.test.ts all pass alongside.
  • Real-behavior trace — see ## Real behavior proof section below.
  • Screenshot/recording — N/A (server-side fs).
  • Perf numbers — N/A (extra cost is one async readdir per deduped candidate parent dir, only on the missing-primary slow path; on the fast path the only added work is one extra await).

Real behavior proof

Captured 2026-05-17 against this branch rebased onto upstream/main 9616aa6e5a. Branch head SHA after rebase: 21bd4d3ed080c6cb75f725dbf14a527f22a6074b.

Behavior or issue addressed: when a session has been reset (<id>.jsonl atomically rotated to <id>.jsonl.reset.<UTC timestamp> with no active file recreated yet), the production async readers consumed by chat.history (readRecentSessionMessagesAsync), sessions-history HTTP (readSessionMessagesAsync + readSessionMessageCountAsync + readRecentSessionMessagesWithStatsAsync), and session listing/title (readSessionTitleFieldsFromTranscriptAsync) used to return an empty response. After this PR they auto-fall back to the newest reset-archive sibling.

Real environment tested: Linux x86_64 local dev workspace (Node v22, pnpm 11.1.0) running against this branch's actual source tree under src/gateway/. The fs state was constructed on the real local filesystem (/tmp/proof-archive-async-*), the readers were imported directly from src/gateway/session-utils.fs.ts via tsx (no mocks, no stubs, no vitest harness), and the sessions-history HTTP path was exercised with the real gateway HTTP server bound to a real loopback port.

Exact steps or command run after this patch:

  1. Build the temp store: write a sessions.json and a sibling <id>.jsonl.reset.2026-05-01T00-00-00.000Z archive carrying two messages; do not create the active <id>.jsonl.
  2. Import the production readers from src/gateway/session-utils.fs.ts directly and invoke each public async reader against that fs state.
  3. Run the real gateway HTTP server in the regression test, archive the active via the production archiveSessionTranscripts helper (reset reason → atomic rename), and follow nextCursor pagination with real fetch calls.

Concrete commands run on the rebased branch:

$ node node_modules/.bin/tsx scripts/proof-archive-async-fallback.ts
$ CI=true pnpm test src/gateway/sessions-history-http.test.ts -- --reporter=verbose -t archive
$ CI=true pnpm test src/gateway/session-utils.fs.test.ts src/gateway/sessions-history-http.test.ts
$ CI=true pnpm check:changed

Evidence after fix:

Terminal transcript from the standalone fs probe (no test framework involved — actual production readers invoked against real disk state):

=== fs state (active missing, reset archive present) ===
tmpDir:           /tmp/proof-archive-async-kRSk7e
primary (active): sess-proof.jsonl
  exists?         false
archive (reset):  sess-proof.jsonl.reset.2026-05-01T00-00-00.000Z
  exists?         true
  size bytes:     197

=== ls of tmpDir ===
  sess-proof.jsonl.reset.2026-05-01T00-00-00.000Z
  sessions.json

=== production async readers (PR #76134 fallback path) ===
readRecentSessionMessagesAsync:
[
  { "role": "user",      "content": "archived: please summarize log A", "__openclaw": { "seq": 1 } },
  { "role": "assistant", "content": "archived: summary = A1, A2, A3",   "__openclaw": { "seq": 2 } }
]

readSessionMessagesAsync (mode=full):
[
  { "role": "user",      "content": "archived: please summarize log A", "__openclaw": { "seq": 1 } },
  { "role": "assistant", "content": "archived: summary = A1, A2, A3",   "__openclaw": { "seq": 2 } }
]

readSessionMessageCountAsync:
  2

readRecentSessionMessagesWithStatsAsync (totalMessages + seq):
{ "messages": [
    { "role": "user",      "content": "archived: please summarize log A", "__openclaw": { "seq": 1 } },
    { "role": "assistant", "content": "archived: summary = A1, A2, A3",   "__openclaw": { "seq": 2 } }
  ],
  "totalMessages": 2
}

readSessionTitleFieldsFromTranscriptAsync:
{ "firstUserMessage": "archived: please summarize log A",
  "lastMessagePreview": "archived: summary = A1, A2, A3" }

Live HTTP transcript from the regression scenario (real gateway HTTP server, archive-only state, real fetch cursor pagination):

GET /v1/sessions/agent:main:main/history?limit=2
  items     : ["second message","third message"]
  msg seqs  : [2, 3]
  hasMore   : true
  nextCursor: "2"

GET /v1/sessions/agent:main:main/history?limit=2&cursor=2
  items     : ["first message"]
  msg seqs  : [1]
  hasMore   : false
  nextCursor: undefined

The seq numbering 2,3 then 1 (rather than 1,2 then end) only holds if totalMessages is computed from the archived transcript (3 messages) rather than from the missing active file (0) — so the HTTP pagination path is observably routed through the new archive fallback in both the readers and the count helper.

Observed result after fix: every public async surface consumed by chat.history, sessions-history HTTP, and session listing/title returns the archived content against the archive-only fs state (recent.length=2, full.length=2, messageCount=2, stats.totalMessages=2, title.firstUserMsg="archived: please summarize log A"). Cursor pagination over the same archive-only state crosses both pages without skipping or duplicating, with stable __openclaw.seq numbering anchored to the archived totalMessages. Against the same fs state on the prior commit db66238880 the readers return [] / 0 / null title fields and the cursor pagination cannot reach the middle archive page.

What was not tested:

  • Live chat.history round trip over the OpenClaw protocol seam against a real running OpenClaw gateway with a real client; the change sits below the protocol layer and the protocol shape is unchanged, so the public readers above are the contract surface.
  • macOS / Windows filesystem semantics; the new code path only depends on fs.promises.readdir + fs.promises.stat, which are platform-stable.
  • Active-file present case under live operator traffic; the active-wins boundary is covered by the regression suite (prefers active over archive when both exist cases) and the change deliberately does not touch the active path.

Human Verification (required)

  • Verified scenarios:
    • pnpm test src/gateway/session-utils.fs.test.ts (rebased branch, locally) — 81 / 81 passed.
    • pnpm test for the consumer-side test files listed in Evidence — all passed.
    • pnpm check:changed — typecheck-all (tsgo), lint (oxlint), no-conflict-markers, no wildcard re-export drift, no runtime sidecar drift, no import cycles, all green locally.
  • Edge cases checked:
    • Empty candidate list → findLatestResetArchiveAsync returns null.
    • readdir failure on a candidate directory → swallowed by .catch(() => []), fallback continues with other deduped candidate dirs.
    • Filename starts with <basename>. but is not a .reset.<ts> archive → rejected by parseSessionArchiveTimestamp returning null.
    • Distinct timestamps → strict > keeps the newer one; same-timestamp edge case is not relied on by any test.
  • What I did not verify:
    • Live chat.history end-to-end (the change is below the protocol seam; the protocol is unchanged so the unit tests on the three readers are the contract).
    • macOS / Windows behavior; the change relies only on fs.promises.readdir + fs.promises.stat semantics that are stable across platforms.
  • Note on local vs Testbox: I am an external contributor without Blacksmith/Testbox access, so the broad gates above ran locally rather than on Testbox. The full pnpm test:changed fan-out also surfaced 8 unrelated upstream-pre-existing failures in src/agents/pi-embedded-runner/compact.hooks.test.ts (its vi.doMock("../session-write-lock.js", ...) factory does not list resolveSessionWriteLockAcquireTimeoutMs, which was added by upstream commit f7ed29e118 "fix: thread session write-lock timeout config"). I confirmed those failures reproduce on bare bd6035d97707b7a62376a30de2eeae9579d56466 without this PR's commits.

Review Conversations

  • I will reply to or resolve every bot review conversation I address in this PR.
  • I will leave unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes — async readers return strictly more (a previously-empty result becomes the archive's content) only when no active exists; active-present behavior is intended to remain unchanged and is covered by the active-wins tests.
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: a reader that arrives during the rename window could see archive content one read and active content the next.
    • Mitigation: this PR does not change the rename ordering; it only adds a fallback for the missing-active case. If a writer transitions the active back into existence between two reads, the second read picks the active (existing behavior).
  • Risk: a corrupt archive timestamp suffix could be selected as "newest".
    • Mitigation: parseSessionArchiveTimestamp validates the suffix against ARCHIVE_TIMESTAMP_RE and returns null for non-matching suffixes; rogue entries are skipped before sorting.
  • Risk: extra readdir cost on the missing-primary path.
    • Mitigation: a seenDirs set deduplicates across overlapping candidate parent directories; readdir errors are swallowed and the fallback yields no archive (caller's existing default applies).

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels May 2, 2026
@clawsweeper

clawsweeper Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 1:07 AM ET / 05:07 UTC.

Summary
The PR reroutes async gateway session-history/title readers, counts, and visitor paths to fall back from a missing active transcript to the newest .reset.<timestamp> archive, with regression coverage for active-wins and archive-only pagination.

PR surface: Source +55, Tests +352, Docs +1. Total +408 across 5 files.

Reproducibility: yes. from source inspection: current main's async readers and count paths use active-only transcript resolution while reset archives are valid sibling files. I did not run tests because this was a required read-only review.

Review metrics: 2 noteworthy metrics.

  • Release-owned changelog edits: 1 added entry. Normal OpenClaw PRs should leave CHANGELOG.md to release generation and keep release-note context in the PR body or squash message.
  • Async transcript read surfaces: 5 rerouted. The PR changes full reads, recent reads, message counts, visitor traversal, and async title fields, so active-vs-archive invariants matter across all of them before merge.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] GitHub context reports this PR as mergeableState: dirty, so maintainers need a rebase or conflict resolution before trusting the final merge result.
  • [P2] The fallback changes which physical transcript backs async session history when the active file is missing; active-present behavior is tested, but maintainers should still treat this as session-state-sensitive.
  • [P1] The CHANGELOG.md hunk should be removed because OpenClaw release generation owns changelog updates for normal PRs.

Maintainer options:

  1. Refresh and keep the narrow fallback (recommended)
    Rebase or otherwise refresh the branch, remove the changelog hunk, and preserve the active-wins plus archive-only pagination tests before merge.
  2. Accept automatic archive fallback deliberately
    Maintainers can accept the changed missing-active session-history behavior as the intended default, but should do so knowing it can surface reset-archived transcript content automatically.
  3. Pause for the broader archive-history design
    If maintainers prefer only explicit archive recovery, pause this PR and route the remaining work through the open includeArchived/logical-history efforts instead.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Refresh PR 76134 against current main, remove the `CHANGELOG.md` edit while preserving release-note context in the PR body or squash message, and keep the async session archive fallback plus focused gateway/session-history regression tests.

Next step before merge

  • Continue normal maintainer review; ClawSweeper found no patch-correctness issue.

Security
Cleared: No concrete security or supply-chain concern was found; the diff reads local sibling transcript archives under existing candidate roots and does not add dependencies, network calls, command execution, workflows, or secret handling.

Review details

Do we have a high-confidence way to reproduce the issue?

Yes, from source inspection: current main's async readers and count paths use active-only transcript resolution while reset archives are valid sibling files. I did not run tests because this was a required read-only review.

Is this the best way to solve the issue?

Mostly yes: the fallback is the narrowest maintainable fix for the missing-active async slice and avoids schema/config expansion. It is not merge-ready until the dirty branch state and release-owned changelog edit are cleaned up.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against ebf20241bd17.

Label changes

Label changes:

  • add merge-risk: 🚨 session-state: The PR changes async session-history source selection when an active transcript is missing, which can affect what context users and gateway callers see after resets.
  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal/live-output proof using real filesystem state and a real loopback gateway HTTP sessions-history path, not only mocks or CI.
  • add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes after-fix terminal/live-output proof using real filesystem state and a real loopback gateway HTTP sessions-history path, not only mocks or CI.
  • remove impact:session-state: Current review selected no impact labels.
  • remove impact:message-loss: Current review selected no impact labels.
  • remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.

Label justifications:

  • P2: This fixes a bounded but user-visible session-history/session-state gap without evidence of a current emergency outage.
  • merge-risk: 🚨 session-state: The PR changes async session-history source selection when an active transcript is missing, which can affect what context users and gateway callers see after resets.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes after-fix terminal/live-output proof using real filesystem state and a real loopback gateway HTTP sessions-history path, not only mocks or CI.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal/live-output proof using real filesystem state and a real loopback gateway HTTP sessions-history path, not only mocks or CI.
Evidence reviewed

PR surface:

Source +55, Tests +352, Docs +1. Total +408 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 2 62 7 +55
Tests 2 352 0 +352
Docs 1 1 0 +1
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 415 7 +408

What I checked:

  • Current main still resolves async full reads through active transcript paths only: On current main, readSessionMessagesAsync calls findExistingTranscriptPath, returning [] when no active candidate exists instead of scanning reset archives. (src/gateway/session-utils.fs.ts:585, ebf20241bd17)
  • Current main still resolves async recent/count paths through the same active-only helper: readSessionMessageCountAsync and readRecentSessionMessagesAsync also call findExistingTranscriptPath, so bounded session-history stats can be empty or inconsistent when only a reset archive exists. (src/gateway/session-utils.fs.ts:647, ebf20241bd17)
  • Reset archives are existing supported artifacts: Current main creates archive names as <file>.<reason>.<timestamp> with archiveFileOnDisk, and timestamp parsing validates .reset.<timestamp> suffixes through parseSessionArchiveTimestamp. (src/gateway/session-transcript-files.fs.ts:128, ebf20241bd17)
  • PR head adds the missing archive resolver: The PR adds findLatestResetArchiveAsync(candidatePaths), scanning each candidate parent directory and accepting only reset entries whose suffix passes the existing timestamp parser. (src/gateway/session-transcript-files.fs.ts:243, abc9c164f481)
  • PR head keeps active transcripts ahead of archive fallback: The new internal findActiveOrLatestResetArchiveAsync first returns the first existing active candidate, then falls back to the latest reset archive only when active candidates are missing. (src/gateway/session-utils.fs.ts:1015, abc9c164f481)
  • Regression coverage includes the pagination edge that prior review flagged: The PR adds an HTTP sessions-history test proving archive-only bounded pagination returns [2,3] then [1] with consistent nextCursor behavior after archiveSessionTranscripts({ reason: "reset" }). (src/gateway/sessions-history-http.test.ts:456, abc9c164f481)

Likely related people:

  • Vincent Koc: Current-main blame for the central async session reader block points to commit 459abfc26b, and recent local history includes gateway/session-history helper work by the same author. (role: recent area contributor; confidence: medium; commits: 459abfc26baf, 27afd01577; files: src/gateway/session-utils.fs.ts, src/gateway/session-transcript-files.fs.ts, src/gateway/sessions-history-http.test.ts)
  • Peter Steinberger: Recent history includes fix(gateway): unify session history snapshots and refactor: consolidate session history sanitization, both touching the HTTP/session-history surfaces this PR relies on. (role: session-history state contributor; confidence: medium; commits: 9e0d632928, b04dd6d05c; files: src/gateway/session-history-state.ts, src/gateway/sessions-history-http.ts, src/gateway/server-methods/chat.ts)
  • Byungsker: Commit 709dc671e4 introduced daily/scheduled reset transcript archiving, which is the on-disk state this PR now reads from async session-history paths. (role: archive behavior introducer; confidence: medium; commits: 709dc671e4; files: src/auto-reply/reply/session.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@masatohoshino masatohoshino force-pushed the pr/openclaw-archived-session-async-fallback branch from 532865f to db66238 Compare May 2, 2026 15:45
masatohoshino added a commit to masatohoshino/openclaw that referenced this pull request May 2, 2026
…hive fallback as readers

Codex/clawsweeper review on PR openclaw#76134 found that
`readRecentSessionMessagesWithStatsAsync` still derived
`totalMessages` from `readSessionMessageCountAsync`, which kept the
active-only `findExistingTranscriptPath` resolution. After the new
async reader fallback, an archive-only transcript longer than the
bounded tail window therefore returned the newest messages with low
`seq` values and `hasMore = false`. The sessions-history HTTP first
page passes `boundedSnapshot.totalMessages` into
`buildSessionHistorySnapshot`, so a `nextCursor` follow-up could skip
the middle archive page.

This commit routes the async count and visit helpers
(`readSessionMessageCountAsync` and `visitSessionMessagesAsync`)
through the same `findActiveOrLatestResetArchiveAsync` resolution that
the message readers use, so an archive-only transcript reports a
consistent total, gives messages the right tail-end `seq` values, and
exposes the archive's earlier pages via `nextCursor`.

Behavior boundaries (kept narrow, same as the original PR):

- Active wins when present (no active+archive merging).
- Only the single newest archive is returned (no chain aggregation).
- Sync paths, `findExistingTranscriptPath`, the protocol schema, the
  `includeArchived` opt-in, sessions_history schema, the session-memory
  hook, and the session-logs skill are still untouched.

Tests: 4 new cases in `session-utils.fs.test.ts` cover the count and
stats helpers (active-only, archive-only with archive longer than the
tail window, both-exist active-priority, both-missing) and assert the
returned `seq` metadata matches the archive's tail position. One new
case in `sessions-history-http.test.ts` exercises the archive-only
HTTP pagination path: archive a 3-message transcript, request the
first `limit=2` page, follow `nextCursor`, and assert the older
archive message is visible without skipping middle pages.

Refs openclaw#56131 openclaw#43929 openclaw#45003 openclaw#60409 openclaw#73883
@masatohoshino

Copy link
Copy Markdown
Contributor Author

Thanks for the precise root-cause trace. New head db66238880 addresses the [P2] finding without expanding scope.

Stats/count now uses the same active-or-archive resolution as the readers. Both readSessionMessageCountAsync and visitSessionMessagesAsync (src/gateway/session-utils.fs.ts) now go through the existing findActiveOrLatestResetArchiveAsync helper, so readRecentSessionMessagesWithStatsAsync derives totalMessages from the same file the messages came from. The active-only findExistingTranscriptPath resolution is no longer used by any of the async readers, count/stats helpers, or the visitor; sync paths and findExistingTranscriptPath itself are still untouched per scope.

Archive-only bounded pagination is now covered by tests.

  • src/gateway/session-utils.fs.test.ts adds 4 new cases:
    • readSessionMessageCountAsync returns archive count when primary is missing.
    • readSessionMessageCountAsync prefers active count when both exist.
    • readRecentSessionMessagesWithStatsAsync reports archive totals and consistent seq for archive-only transcripts (5-message archive, maxMessages: 2totalMessages: 5, seq: [4, 5]).
    • readRecentSessionMessagesWithStatsAsync prefers active totals + seq when active exists.
  • src/gateway/sessions-history-http.test.ts adds an end-to-end pagination case that mirrors the existing cursor pagination over direct REST test against an archive-only transcript: append 3 messages, archive via archiveSessionTranscripts({reason: "reset"}), GET ?limit=2 → asserts messages seq: [2, 3], hasMore: true, nextCursor: "2", then GET ?limit=2&cursor=2 → asserts seq: [1], hasMore: false, nextCursor: undefined. This is the exact "older archive page" path the review flagged as previously skippable.

Acceptance criteria run locally on the rebased branch:

  • pnpm test src/gateway/session-utils.fs.test.ts src/gateway/session-history-state.test.ts src/gateway/sessions-history-http.test.ts → 93 + 13 = 106 / 106 pass (gateway and e2e shards).
  • pnpm exec oxfmt --check --threads=1 src/gateway/session-utils.fs.ts src/gateway/session-utils.fs.test.ts src/gateway/sessions-history-http.test.ts CHANGELOG.md → clean.
  • pnpm check:changed → tsgo, lint, conflict-markers, wildcard-reexport, runtime-sidecar, import-cycles, all green.

Scope remains narrow. Still no includeArchived opt-in (#73883's surface), no active+archive merging (active still wins when present), no archive-chain aggregation (single newest .reset.<ts> only), no previousFragments / chained logical-history (#43929's broader UX), and no touches to the protocol schema, sessions_history tool schema, sync paths (#60409's surface), findExistingTranscriptPath, the session-memory hook, or the session-logs skill.

The CHANGELOG conflict is also resolved: rebased onto current upstream/main and the entry now sits among the latest Unreleased Fixes, no code-side hunk conflicts.

#76134

@masatohoshino masatohoshino force-pushed the pr/openclaw-archived-session-async-fallback branch from db66238 to 21bd4d3 Compare May 17, 2026 10:09
masatohoshino added a commit to masatohoshino/openclaw that referenced this pull request May 17, 2026
…hive fallback as readers

Codex/clawsweeper review on PR openclaw#76134 found that
`readRecentSessionMessagesWithStatsAsync` still derived
`totalMessages` from `readSessionMessageCountAsync`, which kept the
active-only `findExistingTranscriptPath` resolution. After the new
async reader fallback, an archive-only transcript longer than the
bounded tail window therefore returned the newest messages with low
`seq` values and `hasMore = false`. The sessions-history HTTP first
page passes `boundedSnapshot.totalMessages` into
`buildSessionHistorySnapshot`, so a `nextCursor` follow-up could skip
the middle archive page.

This commit routes the async count and visit helpers
(`readSessionMessageCountAsync` and `visitSessionMessagesAsync`)
through the same `findActiveOrLatestResetArchiveAsync` resolution that
the message readers use, so an archive-only transcript reports a
consistent total, gives messages the right tail-end `seq` values, and
exposes the archive's earlier pages via `nextCursor`.

Behavior boundaries (kept narrow, same as the original PR):

- Active wins when present (no active+archive merging).
- Only the single newest archive is returned (no chain aggregation).
- Sync paths, `findExistingTranscriptPath`, the protocol schema, the
  `includeArchived` opt-in, sessions_history schema, the session-memory
  hook, and the session-logs skill are still untouched.

Tests: 4 new cases in `session-utils.fs.test.ts` cover the count and
stats helpers (active-only, archive-only with archive longer than the
tail window, both-exist active-priority, both-missing) and assert the
returned `seq` metadata matches the archive's tail position. One new
case in `sessions-history-http.test.ts` exercises the archive-only
HTTP pagination path: archive a 3-message transcript, request the
first `limit=2` page, follow `nextCursor`, and assert the older
archive message is visible without skipping middle pages.

Refs openclaw#56131 openclaw#43929 openclaw#45003 openclaw#60409 openclaw#73883
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 17, 2026
@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:message-loss Channel message delivery can be lost, duplicated, or misrouted. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 17, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 17, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 17, 2026
…reads when active is missing

When the active session transcript file is missing, async read paths used by
chat.history (`readSessionMessagesAsync`, `readRecentSessionMessagesAsync`)
and the session title/preview lookup
(`readSessionTitleFieldsFromTranscriptAsync`) returned an empty result instead
of any archived content. Reset rotates the active `<id>.jsonl` to a sibling
`<id>.jsonl.reset.<UTC timestamp>` and creates a new empty active file, so
sessions reset before the next read became invisible through the production
async API even when the archive still existed on disk.

This change adds a narrow auto-fallback on the async paths only:

- New `findLatestResetArchiveAsync(candidatePaths)` helper in
  `session-transcript-files.fs.ts` does an async `readdir` of each candidate
  parent directory, validates archive entries via the existing
  `parseSessionArchiveTimestamp(entry, "reset")` primitive, and returns only
  the path with the newest archive timestamp.
- A small internal helper `findActiveOrLatestResetArchiveAsync` in
  `session-utils.fs.ts` reuses the existing active-candidate `existsSync`
  check first; only when no active candidate exists does it consult the new
  helper.
- The three async functions above are re-routed through this helper.

Behavior boundaries (kept narrow on purpose):

- Active wins when present (no active+archive merging).
- Only the single newest archive is returned (no chain aggregation, no
  `previousFragments`, no rolled logical-history reconstruction).
- Sync paths, `findExistingTranscriptPath`, the protocol schema, the
  `includeArchived` opt-in, sessions_history schema, the session-memory hook,
  and the session-logs skill are untouched.

Refs openclaw#56131 openclaw#43929 openclaw#45003 openclaw#60409 openclaw#73883
…hive fallback as readers

Codex/clawsweeper review on PR openclaw#76134 found that
`readRecentSessionMessagesWithStatsAsync` still derived
`totalMessages` from `readSessionMessageCountAsync`, which kept the
active-only `findExistingTranscriptPath` resolution. After the new
async reader fallback, an archive-only transcript longer than the
bounded tail window therefore returned the newest messages with low
`seq` values and `hasMore = false`. The sessions-history HTTP first
page passes `boundedSnapshot.totalMessages` into
`buildSessionHistorySnapshot`, so a `nextCursor` follow-up could skip
the middle archive page.

This commit routes the async count and visit helpers
(`readSessionMessageCountAsync` and `visitSessionMessagesAsync`)
through the same `findActiveOrLatestResetArchiveAsync` resolution that
the message readers use, so an archive-only transcript reports a
consistent total, gives messages the right tail-end `seq` values, and
exposes the archive's earlier pages via `nextCursor`.

Behavior boundaries (kept narrow, same as the original PR):

- Active wins when present (no active+archive merging).
- Only the single newest archive is returned (no chain aggregation).
- Sync paths, `findExistingTranscriptPath`, the protocol schema, the
  `includeArchived` opt-in, sessions_history schema, the session-memory
  hook, and the session-logs skill are still untouched.

Tests: 4 new cases in `session-utils.fs.test.ts` cover the count and
stats helpers (active-only, archive-only with archive longer than the
tail window, both-exist active-priority, both-missing) and assert the
returned `seq` metadata matches the archive's tail position. One new
case in `sessions-history-http.test.ts` exercises the archive-only
HTTP pagination path: archive a 3-message transcript, request the
first `limit=2` page, follow `nextCursor`, and assert the older
archive message is visible without skipping middle pages.

Refs openclaw#56131 openclaw#43929 openclaw#45003 openclaw#60409 openclaw#73883
@masatohoshino masatohoshino force-pushed the pr/openclaw-archived-session-async-fallback branch from 21bd4d3 to abc9c16 Compare May 17, 2026 12:14
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 17, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 17, 2026
@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Jun 1, 2026
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. labels Jun 1, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Jun 2, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. impact:message-loss Channel message delivery can be lost, duplicated, or misrouted. labels Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant