codedb v0.2.5823 — a few correctness/UX findings from a controlled audit
Hi — Pro user here, big fan of codedb (symbol recall was effectively 100% to the exact file:line in our tests, codedb_word is exhaustive and fast, and codedb_outline matched Python ast exactly on real files). While auditing we hit a few smaller things and would like to know if they're bugs, intended, or our misuse. Environment: codedb 0.2.5823, macOS arm64; ground truth computed with Python ast/re.
1. codedb_outline / codedb_symbol return nothing for non-ASCII (e.g. Korean) identifiers
printf 'def \xed\x95\x9c():\n return 1\n' > /tmp/uni.py # "def 한():"
# index a folder containing it, then:
# codedb_outline uni.py -> header only, 0 symbols
# codedb_symbol 한 -> "no results"
# codedb_search/word for the bytes DOES find it
Python ast parses the function fine (valid Python 3 identifier). Is non-ASCII identifier support intended for the structural layer?
2. codedb_find returns confident hits for queries that match no filename (no score floor)
codedb_find "zzznosuchfilexyz" -> notrail.py (32.79), oracle.json (31.36), unicode.py (25.30) ...
codedb_find "Widget" -> empty.py (28.29), crlf.txt (24.96) ... (no file named Widget)
The fuzzy subsequence matcher never returns empty, so a non-match looks like a ranked result. A score floor (or an explicit "no match") would help callers distinguish.
3. Kind labeling: Python class shown as struct_def / uniform fn header
codedb_symbol Widget on a Python class Widget: reports kind struct_def, and the status header labels every symbol fn regardless of type. ast says class. Cosmetic, but it confuses class-vs-function navigation in a Python codebase.
4. codedb_search caps at 50 results — is codedb_word the intended exhaustive tool?
For a token occurring 180x, codedb_search returns at most 50 (default 20), while codedb_word returned all 180+ uncapped. We've adopted codedb_word for exhaustive single-identifier lookups — just confirming that's the intended pattern, and whether the codedb_search cap is configurable.
5. Snapshot staleness + /tmp refusal — intended? (please confirm the supported workflow)
- After an out-of-band edit (appending a function) without re-indexing,
codedb_find/search/outline miss the new symbol (serves the precomputed snapshot). We assume re-indexing is required — is there an auto-refresh or a "stale" indicator?
codedb index refuses paths under /tmp ("refusing to index temporary root"). We worked around it by copying under $HOME; just confirming this is by design.
6. Benchmark question: "63us per op (p50)"
We measured ~46-49ms/op cold from the CLI (each invocation reloads the ~4069-file snapshot, ~28-34ms) and ~1-5ms/op via the warm MCP daemon; the internal "⚡" tick is ~451us. Does the 63us figure refer to the internal in-memory lookup (excluding snapshot load + process/transport)? Is there a way to keep the snapshot warm across CLI calls?
Thanks for codedb — genuinely useful. Happy to share our corpus generator/repros if helpful.
codedb v0.2.5823 — a few correctness/UX findings from a controlled audit
Hi — Pro user here, big fan of codedb (symbol recall was effectively 100% to the exact file:line in our tests,
codedb_wordis exhaustive and fast, andcodedb_outlinematched Pythonastexactly on real files). While auditing we hit a few smaller things and would like to know if they're bugs, intended, or our misuse. Environment: codedb 0.2.5823, macOS arm64; ground truth computed with Pythonast/re.1.
codedb_outline/codedb_symbolreturn nothing for non-ASCII (e.g. Korean) identifiersPython
astparses the function fine (valid Python 3 identifier). Is non-ASCII identifier support intended for the structural layer?2.
codedb_findreturns confident hits for queries that match no filename (no score floor)The fuzzy subsequence matcher never returns empty, so a non-match looks like a ranked result. A score floor (or an explicit "no match") would help callers distinguish.
3. Kind labeling: Python
classshown asstruct_def/ uniformfnheadercodedb_symbol Widgeton a Pythonclass Widget:reports kindstruct_def, and the status header labels every symbolfnregardless of type.astsaysclass. Cosmetic, but it confuses class-vs-function navigation in a Python codebase.4.
codedb_searchcaps at 50 results — iscodedb_wordthe intended exhaustive tool?For a token occurring 180x,
codedb_searchreturns at most 50 (default 20), whilecodedb_wordreturned all 180+ uncapped. We've adoptedcodedb_wordfor exhaustive single-identifier lookups — just confirming that's the intended pattern, and whether thecodedb_searchcap is configurable.5. Snapshot staleness +
/tmprefusal — intended? (please confirm the supported workflow)codedb_find/search/outlinemiss the new symbol (serves the precomputed snapshot). We assume re-indexing is required — is there an auto-refresh or a "stale" indicator?codedb indexrefuses paths under/tmp("refusing to index temporary root"). We worked around it by copying under$HOME; just confirming this is by design.6. Benchmark question: "63us per op (p50)"
We measured ~46-49ms/op cold from the CLI (each invocation reloads the ~4069-file snapshot, ~28-34ms) and ~1-5ms/op via the warm MCP daemon; the internal "⚡" tick is ~451us. Does the 63us figure refer to the internal in-memory lookup (excluding snapshot load + process/transport)? Is there a way to keep the snapshot warm across CLI calls?
Thanks for codedb — genuinely useful. Happy to share our corpus generator/repros if helpful.