fix: use high-water mark for incremental indexing (fixes #84) by anastasiiaanfimova · Pull Request #85 · obra/episodic-memory

anastasiiaanfimova · 2026-04-28T09:40:26Z

Problem

The indexer skips a file entirely if COUNT(*) > 0 for its archive_path:

const alreadyIndexed = db.prepare(
  'SELECT COUNT(*) as count FROM exchanges WHERE archive_path = ?'
).get(archivePath) as { count: number };

if (alreadyIndexed.count > 0) continue;

This means any conversation that was indexed in a previous run will never receive new exchanges — even if the source .jsonl file has grown since then. Fixes #84.

Fix

Replace the count check with a high-water mark using MAX(line_end). After parsing, filter out exchanges already covered by the high-water mark. The archive copy is also updated whenever the file was partially indexed (it may have grown since the last copy).

const maxIndexedResult = db.prepare(
  'SELECT COALESCE(MAX(line_end), 0) as max_line FROM exchanges WHERE archive_path = ?'
).get(archivePath) as { max_line: number };
const maxIndexedLine = maxIndexedResult.max_line;

if (!fs.existsSync(archivePath) || maxIndexedLine > 0) {
  fs.copyFileSync(sourcePath, archivePath);
}

const exchanges = await parseConversation(sourcePath, project, archivePath);
if (exchanges.length === 0) continue;

const newExchanges = maxIndexedLine > 0
  ? exchanges.filter(e => e.lineStart > maxIndexedLine)
  : exchanges;

if (newExchanges.length === 0) continue;

unprocessed.push({ ...conv, exchanges: newExchanges });

No schema migration needed — line_start and line_end are already stored per exchange.

Testing

Verified locally: after rebuild, subsequent index runs correctly pick up new exchanges appended to previously-indexed files.

@anastasiiaanfimova

The indexer skipped any file with COUNT(*) > 0 in exchanges, so once a transcript had been indexed it never gained new rows. Resumed sessions and concurrent SessionStart syncs that raced a still-running session left the tail permanently invisible to search. Schema already stored line_start/line_end per exchange, so the fix is a high-water-mark check: SELECT MAX(line_end), then filter parsed exchanges to those with lineStart past that mark. Transcript JSONLs are append-only, so monotonicity holds and IDs from prior runs are preserved. The archive copy is refreshed whenever maxIndexedLine > 0, since the source may have grown since the last copy. Test: indexUnprocessed against a temp source with 2 exchanges, append 3 more, re-index, verify all 5 are present in the DB. Credit: independent re-derivation of @anastasiiaanfimova's #85 via TDD. Closes #84

obra · 2026-05-02T22:37:16Z

Hi @anastasiiaanfimova — thanks for diagnosing this and proposing the fix. The bug + your high-water-mark approach were correct. Re-derived in fa02569 via TDD: red test that indexes 2 exchanges, appends 3 more, re-indexes, asserts 5 in the DB. Closing in favor of the merged version, but the credit is yours.

Original reporter @jamster — your detection script and the file-distribution stats were what made this actionable. Thank you.

Closes #84.

— Claude Opus 4.7, Claude Code 2.1.119

fix: use high-water mark for incremental indexing (fixes #84)

bdab133

obra added bug Something isn't working area:indexer Indexer/file discovery labels May 2, 2026

obra closed this May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use high-water mark for incremental indexing (fixes #84)#85

fix: use high-water mark for incremental indexing (fixes #84)#85
anastasiiaanfimova wants to merge 1 commit into
obra:mainfrom
anastasiiaanfimova:fix/incremental-indexing-high-water-mark

anastasiiaanfimova commented Apr 28, 2026

Uh oh!

obra commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants