Skip to content

codedb 0.2.5823: a few correctness/UX findings (non-ASCII outline, codedb_find false hits, kind labels, search cap, snapshot staleness) #518

@ahndohun

Description

@ahndohun

codedb v0.2.5823 — a few correctness/UX findings from a controlled audit

Hi — Pro user here, big fan of codedb (symbol recall was effectively 100% to the exact file:line in our tests, codedb_word is exhaustive and fast, and codedb_outline matched Python ast exactly on real files). While auditing we hit a few smaller things and would like to know if they're bugs, intended, or our misuse. Environment: codedb 0.2.5823, macOS arm64; ground truth computed with Python ast/re.

1. codedb_outline / codedb_symbol return nothing for non-ASCII (e.g. Korean) identifiers

printf 'def \xed\x95\x9c():\n    return 1\n' > /tmp/uni.py    # "def 한():"
# index a folder containing it, then:
# codedb_outline uni.py        -> header only, 0 symbols
# codedb_symbol  한            -> "no results"
# codedb_search/word for the bytes DOES find it

Python ast parses the function fine (valid Python 3 identifier). Is non-ASCII identifier support intended for the structural layer?

2. codedb_find returns confident hits for queries that match no filename (no score floor)

codedb_find "zzznosuchfilexyz"  -> notrail.py (32.79), oracle.json (31.36), unicode.py (25.30) ...
codedb_find "Widget"            -> empty.py (28.29), crlf.txt (24.96) ...   (no file named Widget)

The fuzzy subsequence matcher never returns empty, so a non-match looks like a ranked result. A score floor (or an explicit "no match") would help callers distinguish.

3. Kind labeling: Python class shown as struct_def / uniform fn header

codedb_symbol Widget on a Python class Widget: reports kind struct_def, and the status header labels every symbol fn regardless of type. ast says class. Cosmetic, but it confuses class-vs-function navigation in a Python codebase.

4. codedb_search caps at 50 results — is codedb_word the intended exhaustive tool?

For a token occurring 180x, codedb_search returns at most 50 (default 20), while codedb_word returned all 180+ uncapped. We've adopted codedb_word for exhaustive single-identifier lookups — just confirming that's the intended pattern, and whether the codedb_search cap is configurable.

5. Snapshot staleness + /tmp refusal — intended? (please confirm the supported workflow)

  • After an out-of-band edit (appending a function) without re-indexing, codedb_find/search/outline miss the new symbol (serves the precomputed snapshot). We assume re-indexing is required — is there an auto-refresh or a "stale" indicator?
  • codedb index refuses paths under /tmp ("refusing to index temporary root"). We worked around it by copying under $HOME; just confirming this is by design.

6. Benchmark question: "63us per op (p50)"

We measured ~46-49ms/op cold from the CLI (each invocation reloads the ~4069-file snapshot, ~28-34ms) and ~1-5ms/op via the warm MCP daemon; the internal "⚡" tick is ~451us. Does the 63us figure refer to the internal in-memory lookup (excluding snapshot load + process/transport)? Is there a way to keep the snapshot warm across CLI calls?

Thanks for codedb — genuinely useful. Happy to share our corpus generator/repros if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions