Skip to content

codedb_index writes codedb.snapshot into the process CWD (and dumps full index shards there on data-dir fallback), polluting source trees #496

@edouard-andrei

Description

@edouard-andrei

Summary

When indexing, codedb writes index output into the process's current working directory — not only the indexed root and the ~/.codedb/projects/<hash>/ data dir. In the benign case this is a stray codedb.snapshot; in the wild I hit the worse variant where the full loose index (trigram.lookup, trigram.postings, word.index, pair_freq.bin) was dumped into a tracked source subdirectory — ~55 MB showing up as untracked files in git status.

Version: codedb 0.2.5817 (latest; codedb update is a no-op).

Reproduced: stray codedb.snapshot in CWD

Drive the MCP server with its cwd set to a subdirectory of a non-temp git repo, then index the repo root:

T=~/cdbrepro; rm -rf "$T"; mkdir -p "$T/src/feat/deep"; cd "$T"; git init -q
for i in $(seq 1 60); do echo "export function fn$i(){ return $i }" > "src/feat/deep/m$i.ts"; done

cd "$T/src/feat/deep"          # cwd = a subdirectory, NOT the indexed root
{
  printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"t","version":"1"}}}'
  printf '%s\n' '{"jsonrpc":"2.0","method":"notifications/initialized"}'
  printf '%s\n' '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"codedb_index","arguments":{"path":"'"$T"'"}}}'
  sleep 6
} | codedb mcp >/dev/null 2>&1

ls "$T/src/feat/deep/codedb.snapshot"   # <-- stray snapshot in the cwd subdir (bug)
ls "$T/codedb.snapshot"                 # snapshot at indexed root (expected)

A codedb.snapshot is written into src/feat/deep/ (the cwd) even though that directory is not the indexed root.

(Note: codedb refuses /tmp roots — "refusing to index temporary root" — so the repro must live under a non-temp path like ~.)

Observed in the wild: full index shards in CWD

A real repo ended up with trigram.lookup (1.6M), trigram.postings (34M), word.index (19M) inside a deeply-nested source subdirectory (e.g. src/features/alpha/AlphaPanel/). Evidence it was the whole-repo index (root = repo root, not that subdir): word.index starts with magic CDBW, and its header references sibling paths such as src/features/beta/BetaPanel/.... The central ~/.codedb/projects/<hash>/codedb.snapshot for the repo root was updated at the same second — so the snapshot persisted correctly while the loose shards leaked into the source tree.

Suspected mechanism

The binary contains could not create data dir paired with fallback_cwd (and CwdNotSupported). It looks like when the per-project data dir under ~/.codedb/projects/<hash>/ can't be created/used, codedb falls back to writing index files into the current working directory.

Impact

  • Large binary index files (tens of MB) appear as untracked files in git status, in arbitrary source folders.
  • Easy to accidentally commit; noisy; confusing to diagnose (the files reappear whenever the indexer runs from that cwd).

Suggested fix

  • Never write index artifacts into the process CWD. Write the portable codedb.snapshot only to the indexed root and/or the ~/.codedb data dir.
  • If the data dir can't be created, fail loudly (or fall back to a $TMPDIR location) instead of silently writing the index into CWD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions