Skip to content

Add native C parser for outlines and symbols #319

@justrach

Description

@justrach

Problem

codedb detects C and C++ file types, but currently does not parse them into outlines/symbols.

Current code path:

  • src/explore.zig maps .c / .h to Language.c.
  • .cpp / .hpp map to Language.cpp.
  • Block-comment handling and computeSymbolEnds already treat C/C++ as brace languages.
  • parseOutlineWithParser never dispatches .c or .cpp to a parser, so these files produce zero symbols.

This is visible in the current wiki.codes production corpus.

Corpus evidence

From /var/lib/codedb-cloud/storage after the 2026-04-25 common-repo expansion:

language      files    files_with_symbols  symbols
unknown       594170   0                   0
c             177123   0                   0
cpp            50596   0                   0
javascript    144310   104161              848776
typescript    110948    93312              1224590
rust           86190    84087              1270028
python         42282    37762              853342
go_lang        41522    39726              769793

Top impacted C/C++-heavy indexed repos:

slug                   c_like_files  c_header_files  symbols
chromium               70703         34035           0
llvm-llvm-project      70616         30046           0
torvalds-linux         63179         63174           0
ziglang-zig            14010         13791           0
nodejs-node            13413          9530           0
godotengine-godot       8138          5106           0
envoyproxy-envoy        6749          2810           0
duckdb-duckdb           5076          1008           0
pytorch-pytorch         4816          2467           0
postgres-postgres       2564          2559           0
openssl-openssl         2151          2151           0
curl-curl                965           963           0
git-git                  961           961           0
redis-redis              757           749           0
nginx-nginx              396           395           0
sqlite-sqlite            355           355           0

Proposed scope for the next codedb RC

Add a native, zero-dependency C outline parser in src/explore.zig.

Minimum useful extraction:

  • #include <...> and #include "..." as imports
  • #define NAME and #define NAME(...) as macro symbols
  • function definitions for common forms:
    • int main(void) {
    • static inline int foo(...) {
    • void *foo(...) {
    • EXPORT_SYMBOL(foo) should not become a fake function
  • function declarations/prototypes if safe, or skip prototypes initially to avoid noise
  • struct name {, enum name {, union name {
  • typedef struct name name_t;, typedef enum ... name_t;, simple typedef ... name_t;

Things to avoid:

  • control flow keywords becoming symbols: if, for, while, switch, return, sizeof
  • function pointer declarations becoming fake function definitions unless confidently parsed
  • macro invocations becoming functions
  • comments/strings leaking fake symbols

Acceptance criteria

  • codedb outline on a small C fixture emits functions, macros, includes, and struct/enum/typedef symbols.
  • codedb find main / codedb find ngx_http_* works for C functions after indexing.
  • Existing supported language tests still pass.
  • Add regression tests for:
    • basic C function definitions
    • static/inline/pointer-return functions
    • #include parsing
    • #define parsing
    • struct/enum/union and typedef parsing
    • comments and strings containing fake C-looking code
    • function pointer declarations and macro invocations not being misclassified

Other language coverage gaps seen in the same corpus

High-volume gaps worth tracking after C:

extension  files   note
.cc        38893   C++; currently unknown because detection only covers .cpp/.hpp
.mm         7542   Objective-C++ / C++ family
.java      12478   Java
.kt         1567   Kotlin
.svelte     5013   Svelte component files
.vue        1146   Vue component files
.astro      1571   Astro component files
.sh         5394   shell scripts
.css        5136   CSS
.scss       3687   SCSS
.sql        2495   SQL
.proto      2091   protobuf
.f90        3673   Fortran
.ll        44852   LLVM IR, mostly from LLVM/Rust corpora
.mlir       2872   MLIR
.td         1732   LLVM TableGen

Suggested priority after native C:

  1. C++ extension detection/parsing: .cc, .cxx, .hh, .hxx, .hpp, .cpp.
  2. Java/Kotlin for backend/mobile repos.
  3. Vue/Svelte/Astro for frontend repos.
  4. Shell and SQL for operational/security scanning.
  5. LLVM-specific formats only if we want deep compiler-repo coverage.

Related: #318 covers a broader tree-sitter spike. This issue is narrower: add immediately useful native C coverage without waiting for a general parser framework.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions