LLM comprehension benchmark comparing GCF, TOON, and JSON at 500 symbols.
Generates a 500-symbol, 200-edge code graph payload, encodes it in all three formats using the official libraries, sends each to an LLM with zero format instructions, and measures accuracy on 13 structured extraction questions.
| # | Category | Question |
|---|---|---|
| 1 | Counting | How many symbols? |
| 2 | Counting | How many edges? |
| 3 | Counting | How many targets (distance 0)? |
| 4 | Counting | How many related (distance 1)? |
| 5 | Counting | How many extended (distance 2)? |
| 6 | Counting | How many functions? |
| 7 | Counting | How many 'calls' edges? |
| 8 | Extraction | Highest-scored symbol name? |
| 9 | Extraction | Kind of highest-scored symbol? |
| 10 | Extraction | Kind of last symbol? |
| 11 | Extraction | All unique edge types? |
| 12 | Structure | Does it have an edges section? |
| 13 | Structure | What is the tool name? |
All answers are deterministic (computed from the payload). No LLM judge.
23 comprehension runs across 10 models and 3 providers. GCF wins 22, ties 1, loses 0.
| Model | Runs | GCF | TOON | JSON |
|---|---|---|---|---|
| Claude Opus 4.6 | 2 | 96.2% | 84.6% | 73.1% |
| Claude Sonnet 4.6 | 2 | 100% | 73.1% | 53.8% |
| Claude Haiku 4.5 | 2 | 96.2% | 69.2% | 57.7% |
| GPT-5.5 | 5 | 84.1% | 67.7% | 45.8% |
| GPT-5.4 | 4 | 76.4% | 56.0% | 44.1% |
| GPT-5.4-mini | 2 | 71.8% | 64.1% | 54.2% |
| Gemini 2.5 Pro | 1 | 100% | 76.9% | 58.3% |
| Gemini 3.1 Pro | 1 | 100% | 76.9% | 46.2% |
| Gemini 3.5 Flash | 1 | 100% | 61.5% | 46.2% |
| Gemini 2.5 Flash | 3 | 80.6% | 54.6% | 57.0% |
GCF wins on every model. The ordering GCF > TOON > JSON never flips.
Four models achieve 100%: Sonnet (Anthropic), Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash (all Google). All raw logs in gcf/eval/results.
# Claude CLI (default)
GOWORK=off go test -run TestComprehension -v -timeout 0
# Anthropic API
EVAL_BACKEND=api ANTHROPIC_API_KEY=sk-... GOWORK=off go test -run TestComprehension -v -timeout 0
# OpenAI (GPT-4o)
EVAL_BACKEND=openai OPENAI_API_KEY=sk-... EVAL_MODEL=gpt-4o GOWORK=off go test -run TestComprehension -v -timeout 0
# Google (Gemini)
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.0-flash GOWORK=off go test -run TestComprehension -v -timeout 0
# xAI (Grok)
EVAL_BACKEND=xai XAI_API_KEY=... EVAL_MODEL=grok-3 GOWORK=off go test -run TestComprehension -v -timeout 0The eval is a separate Go module (eval/go.mod) to avoid polluting the root gcf-go library with test-only dependencies:
github.com/blackwell-systems/gcf-go: GCF encodinggithub.com/toon-format/toon-go: TOON encoding (official library)
Consumers of gcf-go never pull toon-go transitively.
- Distance grouping:
## targets,## related,## extendedheaders make group counting trivial. TOON has no grouping; the model must scan all 500 rows and filter by a column. - Edge count in header:
## edges [200]gives the count directly. JSON and TOON require the model to count manually. - No noise: every token is content. JSON wastes 2,500+ tokens on repeated field names that dilute attention.
At 8 symbols, all formats pass trivially. At 500, the differentiation is undeniable. The scale is large enough to stress-test counting accuracy without exceeding model context limits. This is where JSON breaks and format design decisions matter.