Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

GCF Comprehension Eval

LLM comprehension benchmark comparing GCF, TOON, and JSON at 500 symbols.

What It Measures

Generates a 500-symbol, 200-edge code graph payload, encodes it in all three formats using the official libraries, sends each to an LLM with zero format instructions, and measures accuracy on 13 structured extraction questions.

Questions (13)

# Category Question
1 Counting How many symbols?
2 Counting How many edges?
3 Counting How many targets (distance 0)?
4 Counting How many related (distance 1)?
5 Counting How many extended (distance 2)?
6 Counting How many functions?
7 Counting How many 'calls' edges?
8 Extraction Highest-scored symbol name?
9 Extraction Kind of highest-scored symbol?
10 Extraction Kind of last symbol?
11 Extraction All unique edge types?
12 Structure Does it have an edges section?
13 Structure What is the tool name?

All answers are deterministic (computed from the payload). No LLM judge.

Results (multi-model, 2026-06-06)

23 comprehension runs across 10 models and 3 providers. GCF wins 22, ties 1, loses 0.

Model Runs GCF TOON JSON
Claude Opus 4.6 2 96.2% 84.6% 73.1%
Claude Sonnet 4.6 2 100% 73.1% 53.8%
Claude Haiku 4.5 2 96.2% 69.2% 57.7%
GPT-5.5 5 84.1% 67.7% 45.8%
GPT-5.4 4 76.4% 56.0% 44.1%
GPT-5.4-mini 2 71.8% 64.1% 54.2%
Gemini 2.5 Pro 1 100% 76.9% 58.3%
Gemini 3.1 Pro 1 100% 76.9% 46.2%
Gemini 3.5 Flash 1 100% 61.5% 46.2%
Gemini 2.5 Flash 3 80.6% 54.6% 57.0%

GCF wins on every model. The ordering GCF > TOON > JSON never flips.

Four models achieve 100%: Sonnet (Anthropic), Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash (all Google). All raw logs in gcf/eval/results.

Running

# Claude CLI (default)
GOWORK=off go test -run TestComprehension -v -timeout 0

# Anthropic API
EVAL_BACKEND=api ANTHROPIC_API_KEY=sk-... GOWORK=off go test -run TestComprehension -v -timeout 0

# OpenAI (GPT-4o)
EVAL_BACKEND=openai OPENAI_API_KEY=sk-... EVAL_MODEL=gpt-4o GOWORK=off go test -run TestComprehension -v -timeout 0

# Google (Gemini)
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.0-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# xAI (Grok)
EVAL_BACKEND=xai XAI_API_KEY=... EVAL_MODEL=grok-3 GOWORK=off go test -run TestComprehension -v -timeout 0

Dependencies

The eval is a separate Go module (eval/go.mod) to avoid polluting the root gcf-go library with test-only dependencies:

  • github.com/blackwell-systems/gcf-go: GCF encoding
  • github.com/toon-format/toon-go: TOON encoding (official library)

Consumers of gcf-go never pull toon-go transitively.

Why GCF Wins

  • Distance grouping: ## targets, ## related, ## extended headers make group counting trivial. TOON has no grouping; the model must scan all 500 rows and filter by a column.
  • Edge count in header: ## edges [200] gives the count directly. JSON and TOON require the model to count manually.
  • No noise: every token is content. JSON wastes 2,500+ tokens on repeated field names that dilute attention.

Why 500 Symbols?

At 8 symbols, all formats pass trivially. At 500, the differentiation is undeniable. The scale is large enough to stress-test counting accuracy without exceeding model context limits. This is where JSON breaks and format design decisions matter.