Skip to content

Indexer skips entire file when any rows exist, leaving appended turns permanently unindexed #84

@jamster

Description

@jamster

Summary

src/indexer.ts:294-298 checks whether any rows already exist for a given archive_path and skips the entire file if so. There is no comparison against mtime or MAX(line_end). Any .jsonl transcript that gets appended to after its first indexing pass — which is the normal case for resumed sessions, or any long session where a sibling Claude Code session triggered a SessionStart sync mid-way — has its tail permanently excluded from the index.

Sync still copies the updated archive correctly (mtime check works); only the indexer is at fault.

Evidence (real index, plugin v1.0.15)

Metric Value
Total archived conversations 2,361
Files with fresh index 1,173 (49.7%)
Files with unindexed tail content 1,188 (50.3%)
Total lines on disk 394,291
Total lines indexed 386,526
Total lines unindexed 7,765 (2.0%)

Distribution of per-file unindexed-line delta (stale files only):

Stat Lines
median 5
p95 12
max 1,308
files with delta ≥ 50 5
files with delta ≥ 100 3
files with delta ≥ 1000 1

So the typical case is "last few turns of a session never made it in" (consistent with a SessionStart background sync racing the still-running session that produced it). The long tail is the painful case — large multi-hour sessions where most of the conversation is silently missing from semantic search.

Detection query

For anyone wanting to check their own index:

import sqlite3, os
con = sqlite3.connect(os.path.expanduser("~/.config/superpowers/conversation-index/db.sqlite"))
rows = con.execute("SELECT archive_path, MAX(line_end) FROM exchanges GROUP BY archive_path").fetchall()
for path, last in rows:
    if not os.path.exists(path): continue
    with open(path, 'rb') as f: n = sum(1 for _ in f)
    if n > (last or 0):
        print(n - last, path)

Root cause (current behavior)

// src/indexer.ts:294
const alreadyIndexed = db.prepare(
  'SELECT COUNT(*) as count FROM exchanges WHERE archive_path = ?'
).get(archivePath);
if (alreadyIndexed.count > 0) continue;

Proposed fix

The schema already stores line_start and line_end per exchange, so the data model supports incremental indexing without a migration. Replace the boolean skip with a high-water-mark check:

  1. SELECT COALESCE(MAX(line_end), 0) FROM exchanges WHERE archive_path = ?
  2. Parse the file and only index exchanges whose line_start > maxIndexedLine
  3. Embed/insert only the tail; existing rows untouched

The unused last_indexed column hints this was the original intent.

Workaround until fixed

Delete affected rows so the next sync re-indexes from scratch (re-embeds the whole file — non-trivial cost on a large archive):

DELETE FROM exchanges WHERE archive_path IN (...stale paths...);

Happy to open a PR if you'd like — the change is small and well-scoped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:indexerIndexer/file discoverybugSomething isn't workingpriority:highHigh priority - blocks users

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions