Problem
codedb detects C and C++ file types, but currently does not parse them into outlines/symbols.
Current code path:
src/explore.zig maps .c / .h to Language.c.
.cpp / .hpp map to Language.cpp.
- Block-comment handling and
computeSymbolEnds already treat C/C++ as brace languages.
parseOutlineWithParser never dispatches .c or .cpp to a parser, so these files produce zero symbols.
This is visible in the current wiki.codes production corpus.
Corpus evidence
From /var/lib/codedb-cloud/storage after the 2026-04-25 common-repo expansion:
language files files_with_symbols symbols
unknown 594170 0 0
c 177123 0 0
cpp 50596 0 0
javascript 144310 104161 848776
typescript 110948 93312 1224590
rust 86190 84087 1270028
python 42282 37762 853342
go_lang 41522 39726 769793
Top impacted C/C++-heavy indexed repos:
slug c_like_files c_header_files symbols
chromium 70703 34035 0
llvm-llvm-project 70616 30046 0
torvalds-linux 63179 63174 0
ziglang-zig 14010 13791 0
nodejs-node 13413 9530 0
godotengine-godot 8138 5106 0
envoyproxy-envoy 6749 2810 0
duckdb-duckdb 5076 1008 0
pytorch-pytorch 4816 2467 0
postgres-postgres 2564 2559 0
openssl-openssl 2151 2151 0
curl-curl 965 963 0
git-git 961 961 0
redis-redis 757 749 0
nginx-nginx 396 395 0
sqlite-sqlite 355 355 0
Proposed scope for the next codedb RC
Add a native, zero-dependency C outline parser in src/explore.zig.
Minimum useful extraction:
#include <...> and #include "..." as imports
#define NAME and #define NAME(...) as macro symbols
- function definitions for common forms:
int main(void) {
static inline int foo(...) {
void *foo(...) {
EXPORT_SYMBOL(foo) should not become a fake function
- function declarations/prototypes if safe, or skip prototypes initially to avoid noise
struct name {, enum name {, union name {
typedef struct name name_t;, typedef enum ... name_t;, simple typedef ... name_t;
Things to avoid:
- control flow keywords becoming symbols:
if, for, while, switch, return, sizeof
- function pointer declarations becoming fake function definitions unless confidently parsed
- macro invocations becoming functions
- comments/strings leaking fake symbols
Acceptance criteria
codedb outline on a small C fixture emits functions, macros, includes, and struct/enum/typedef symbols.
codedb find main / codedb find ngx_http_* works for C functions after indexing.
- Existing supported language tests still pass.
- Add regression tests for:
- basic C function definitions
- static/inline/pointer-return functions
#include parsing
#define parsing
- struct/enum/union and typedef parsing
- comments and strings containing fake C-looking code
- function pointer declarations and macro invocations not being misclassified
Other language coverage gaps seen in the same corpus
High-volume gaps worth tracking after C:
extension files note
.cc 38893 C++; currently unknown because detection only covers .cpp/.hpp
.mm 7542 Objective-C++ / C++ family
.java 12478 Java
.kt 1567 Kotlin
.svelte 5013 Svelte component files
.vue 1146 Vue component files
.astro 1571 Astro component files
.sh 5394 shell scripts
.css 5136 CSS
.scss 3687 SCSS
.sql 2495 SQL
.proto 2091 protobuf
.f90 3673 Fortran
.ll 44852 LLVM IR, mostly from LLVM/Rust corpora
.mlir 2872 MLIR
.td 1732 LLVM TableGen
Suggested priority after native C:
- C++ extension detection/parsing:
.cc, .cxx, .hh, .hxx, .hpp, .cpp.
- Java/Kotlin for backend/mobile repos.
- Vue/Svelte/Astro for frontend repos.
- Shell and SQL for operational/security scanning.
- LLVM-specific formats only if we want deep compiler-repo coverage.
Related: #318 covers a broader tree-sitter spike. This issue is narrower: add immediately useful native C coverage without waiting for a general parser framework.
Problem
codedbdetects C and C++ file types, but currently does not parse them into outlines/symbols.Current code path:
src/explore.zigmaps.c/.htoLanguage.c..cpp/.hppmap toLanguage.cpp.computeSymbolEndsalready treat C/C++ as brace languages.parseOutlineWithParsernever dispatches.cor.cppto a parser, so these files produce zero symbols.This is visible in the current wiki.codes production corpus.
Corpus evidence
From
/var/lib/codedb-cloud/storageafter the 2026-04-25 common-repo expansion:Top impacted C/C++-heavy indexed repos:
Proposed scope for the next codedb RC
Add a native, zero-dependency C outline parser in
src/explore.zig.Minimum useful extraction:
#include <...>and#include "..."as imports#define NAMEand#define NAME(...)as macro symbolsint main(void) {static inline int foo(...) {void *foo(...) {EXPORT_SYMBOL(foo)should not become a fake functionstruct name {,enum name {,union name {typedef struct name name_t;,typedef enum ... name_t;, simpletypedef ... name_t;Things to avoid:
if,for,while,switch,return,sizeofAcceptance criteria
codedb outlineon a small C fixture emits functions, macros, includes, and struct/enum/typedef symbols.codedb find main/codedb find ngx_http_*works for C functions after indexing.#includeparsing#defineparsingOther language coverage gaps seen in the same corpus
High-volume gaps worth tracking after C:
Suggested priority after native C:
.cc,.cxx,.hh,.hxx,.hpp,.cpp.Related: #318 covers a broader tree-sitter spike. This issue is narrower: add immediately useful native C coverage without waiting for a general parser framework.