This directory contains benchmark specifications for measuring AI code generation efficiency in AILANG vs Python.
NEW: Benchmarks specifically designed to test AILANG's vision goals and differentiation from Python.
See VISION_BENCHMARKS.md for detailed documentation on:
- How benchmarks map to vision goals
- Current vs future evaluation capabilities
- Metrics tracking and success criteria
| Benchmark | Vision Goal | Status |
|---|---|---|
explicit_state_threading |
Explicit state vs implicit globals | 🆕 Ready |
deterministic_list_transform |
One canonical form vs multiple ways | 🆕 Ready |
effect_tracking_io_fs |
Explicit effect tracking (!: IO, FS) | 🆕 Ready |
effect_pure_separation |
Pure vs effectful separation | 🆕 Ready |
effect_composition |
Effect propagation in signatures | 🆕 Ready |
exhaustive_pattern_matching |
Total functions, no crashes | 🆕 Ready |
type_safe_record_access |
Static type safety vs runtime errors | 🆕 Ready |
referential_transparency |
Same input → same output | 🆕 Ready |
canonical_normalization |
Idiomatic, canonical code structure | 🆕 Ready |
no_runtime_crashes_option |
Option types prevent null errors | 🆕 Ready |
immutable_data_structures |
Immutable updates vs mutations | 🆕 Ready |
Key Differentiators:
- 🚀 Explicit Effects: AILANG forces effect declarations, Python doesn't track them
- 🎯 Determinism: AILANG has one way to do things, Python has many
- 🛡️ Type Safety: AILANG catches errors at compile time, Python at runtime
- ✅ Totality: AILANG requires exhaustive patterns, Python allows missing cases
| ID | Description | Difficulty | Focus Area | Expected Gain | Status |
|---|---|---|---|---|---|
fizzbuzz |
Classic FizzBuzz (1-100) | Easy | Control flow | Low | ✅ Passing |
recursion_factorial |
Factorial via recursion | Easy | Recursion | Low | ✅ Passing |
recursion_fibonacci |
Fibonacci via recursion | Easy | Recursion | Low | ✅ Passing |
records_person |
Record types with field access | Medium | Records | High | ✅ Passing |
adt_option |
Option/Maybe monad operations | Medium | Algebraic types | Very High | ✅ Passing |
float_eq |
Float equality comparisons | Easy | Type system | Low | |
numeric_modulo |
Modulo operator (%) | Easy | Operators | Low | ❌ Failing |
json_parse |
Parse JSON, filter, output | Medium | Data parsing | Medium | ❌ Failing |
pipeline |
stdin → transform → stdout | Medium | IO + lists | High | ❌ Failing |
cli_args |
Read file, process, sum | Hard | IO + FS | High | ❌ Failing |
| ID | Description | Difficulty | Focus Area | Expected Gain | Expected Status |
|---|---|---|---|---|---|
list_operations |
List construction, pattern matching, recursion | Easy | Lists + recursion | High | Should Pass ✅ |
string_manipulation |
String concat, show(), comparisons | Easy | Strings | Medium | Should Pass ✅ |
nested_records |
Nested record access | Medium | Records | High | Should Pass ✅ |
higher_order_functions |
Compose, map, filter, currying | Medium | Functional programming | High | May Pass |
pattern_matching_complex |
Nested patterns, guards, Tree ADT | Hard | Pattern matching | Medium | Should Pass ✅ |
| ID | Description | Difficulty | Focus Area | Expected Gain | Expected Status |
|---|---|---|---|---|---|
record_update |
Record update syntax {r | field: value} |
Medium | Records | Very High | Will Fail ❌ (M-R5b) |
list_comprehension |
Map/filter/fold on lists | Hard | Stdlib | Very High | Will Fail ❌ (no stdlib) |
error_handling |
Result/Either type for errors | Medium | Error handling | High | Will Fail ❌ (no Result) |
Test the harness with pre-written mock code:
# Single benchmark
ailang eval --benchmark fizzbuzz --mock
# Both languages
ailang eval --benchmark fizzbuzz --mock --langs python,ailang
# All existing benchmarks (mock)
for bench in fizzbuzz recursion_factorial recursion_fibonacci records_person adt_option \
float_eq numeric_modulo json_parse pipeline cli_args; do
ailang eval --benchmark $bench --mock --langs python,ailang
done
# All new benchmarks (mock)
for bench in list_operations string_manipulation nested_records higher_order_functions \
pattern_matching_complex record_update list_comprehension error_handling; do
ailang eval --benchmark $bench --mock --langs python,ailang
done
# Generate report
make eval-reportRequires API key from OpenAI or Anthropic:
# Set API key
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="..."
# Run single benchmark
ailang eval --benchmark fizzbuzz --model gpt-4 --seed 42
# Run all benchmarks (this will cost API credits!)
# Existing benchmarks
for bench in fizzbuzz recursion_factorial recursion_fibonacci records_person adt_option \
float_eq numeric_modulo json_parse pipeline cli_args; do
ailang eval --benchmark $bench --model gpt-4 --seed 42 --langs python,ailang
done
# New benchmarks (testing current features)
for bench in list_operations string_manipulation nested_records higher_order_functions \
pattern_matching_complex; do
ailang eval --benchmark $bench --model gpt-4 --seed 42 --langs python,ailang
done
# New benchmarks (expected to fail - identifies gaps)
for bench in record_update list_comprehension error_handling; do
ailang eval --benchmark $bench --model gpt-4 --seed 42 --langs python,ailang
done
# Generate report
make eval-reportEach benchmark is a YAML file with this structure:
id: my_benchmark
description: "Short description"
languages: ["python", "ailang"] # Supported languages
entrypoint: "main" # Entry function name
caps: ["IO", "FS"] # Required capabilities
difficulty: "medium" # easy | medium | hard
expected_gain: "high" # low | medium | high | very_high
prompt: |
Task description with <LANG> placeholder.
Requirements:
- Specific requirement 1
- Specific requirement 2
Output only the code, no explanations.
expected_stdout: |
expected
output
here- Create YAML file:
benchmarks/my_benchmark.yml - Test with mock:
ailang eval --benchmark my_benchmark --mock - Verify output: Check
eval_results/*.json - Test with real API:
ailang eval --benchmark my_benchmark --model gpt-4 - Generate report:
make eval-report
❌ Bad (biased):
prompt: "Write Python code (the most popular language) that..."✅ Good (neutral):
prompt: "Write a program in <LANG> that..."❌ Bad (vague):
prompt: "Do FizzBuzz"✅ Good (specific):
prompt: |
Write a program in <LANG> that prints numbers 1-100:
- Multiples of 3: print "Fizz"
- Multiples of 5: print "Buzz"
- Multiples of 15: print "FizzBuzz"
- Others: print the numberAfter baseline tests reveal common errors, add AILANG-specific hints:
prompt: |
Write a program in <LANG> that...
<LANG=AILANG> Additional context:
- Use `let x = value in body` syntax
- Import from stdlib/std/* for standard types
- Effects declared with ! syntax: func() -> T ! {IO}Goal: Measure first-attempt quality
Expected outcomes:
- Python: Higher success rate (familiar to AI)
- AILANG: Lower success rate (unfamiliar syntax)
- AILANG: Fewer tokens per attempt (concise syntax)
What we learn:
- Which AILANG syntax confuses AI
- What documentation is missing
- How to improve prompts for Phase 2
Goal: Measure total effort with iteration
Expected outcomes:
- Python: Fewer turns to success
- AILANG: More turns initially (learning curve)
- AILANG: Lower total tokens (after prompt improvements)
benchmarks/ houses three sibling suites that measure different things and
should not be conflated:
benchmarks/
README.md # This file
*.yml # AI eval specs (suite 1) — measure code-gen quality
cross-language/ # ↳ shared specs run against AILANG and Python
runtime/ # Runtime micro-benchmarks (suite 2) — measure interpreter cost
workloads/ # Latency-budget canonical workloads (suite 3) — measure user-visible p95
eval_results/ # AI-eval output (git-ignored)
.gitignore
*.json # Individual run results
summary.csv # Aggregated results
leaderboard.md # Human-readable report
| Suite | Question it answers | Measured by | Owner |
|---|---|---|---|
*.yml AI evals |
"How well do AI models generate AILANG?" | ailang eval-suite |
M-EVAL |
runtime/*.ail |
"Did this commit slow down a specific evaluator path?" | make bench (Go benchmarks) |
perf-reviewer skill |
workloads/*.ail |
"Did this release blow the latency budget on realistic user workloads?" | tools/bench_workloads.sh |
M-LAT-BUDGET |
The three suites are deliberately separate. AI evals stress code-gen quality (can the model write the program at all?). Runtime micro-benchmarks isolate single hot-loop interpreter changes. Latency-budget workloads run end-to-end programs that mirror real user shapes — they are the only suite whose p95 is treated as a release-gating SLO.
See workloads/README.md for the canonical workload
catalog and the latency-budget process.
Approximate API costs (as of 2025):
| Model | Tokens/Benchmark | Cost/Benchmark | Full Suite (5) |
|---|---|---|---|
| GPT-4 | 300 | $0.009 | $0.09 |
| GPT-3.5 | 300 | $0.0003 | $0.003 |
| Claude-3 | 300 | $0.0045 | $0.045 |
Note: Multi-turn evaluation (Phase 2) will cost 2-5x more due to iteration.
export OPENAI_API_KEY="sk-..."Check the benchmark ID matches the filename (without .yml):
ls benchmarks/
# Should see: fizzbuzz.yml, json_parse.yml, etc.This is expected! Mock code is for testing the harness, not for real benchmarks. Use --model gpt-4 for actual evaluation.
When adding benchmarks:
- ✅ Use neutral prompts (no language bias)
- ✅ Test with
--mockfirst - ✅ Specify clear expected output
- ✅ Document difficulty and expected gain
- ✅ Test with at least 2 models
- ✅ Submit PR with results
NEW ✨ Automatically generate design documents from eval failures!
# Full workflow: run evals → analyze → generate design docs
make eval-to-design
# Or run steps individually:
make eval-suite # Run benchmarks
make eval-analyze # Analyze failures, generate design docs- Run Evals: Benchmarks execute and save results to
eval_results/*.json - Analyze Patterns:
ailang eval-analyzegroups failures by error pattern - Generate Designs: GPT-5 analyzes failures and creates design documents in
design_docs/planned/
# Run eval suite with multiple models
make eval-suite
# Analyze results (dry-run to see issues first)
ailang eval-analyze --results eval_results/ --dry-run
# Generate design docs
ailang eval-analyze --results eval_results/ \
--model gpt5 \
--output design_docs/planned/ \
--min-frequency 2
# Review generated designs
ls -lh design_docs/planned/
cat design_docs/planned/EVAL_ANALYSIS_*.mdailang eval-analyze [options]
Options:
--results <dir> Directory with eval results (default: eval_results)
--output <dir> Output directory for design docs (default: design_docs/planned)
--model <name> LLM model for analysis (default: gpt5)
--min-frequency <n> Minimum failure count to report (default: 2)
--categories <list> Filter by category (compile_error,runtime_error,logic_error)
--dry-run Show issues without generating design docs
--generate=false Skip design doc generation (analysis only)For each issue pattern discovered:
-
Design Document (
YYYYMMDD_category_issue_name.md)- Problem statement synthesized from error patterns
- Root cause analysis
- Proposed solution with implementation plan
- Testing strategy
- Success criteria
- Estimated LOC and time
-
Summary Report (
EVAL_ANALYSIS_YYYYMMDD.md)- Overview of all issues by impact
- Links to generated design docs
- Next steps for implementation
-
Analysis Data (
analysis_YYYYMMDD_HHMMSS.json)- Machine-readable issue data
- For further processing/tracking
- ~$0.10-0.50 per design doc (GPT-5)
- Typical analysis: 1-3 design docs
- Total cost: $0.10-$1.50 per analysis run
- Run mock tests:
make eval - Set up API key:
export OPENAI_API_KEY="..." - Run baseline suite: See "Real API Mode" above
- Analyze results:
make eval-report - Generate design docs:
make eval-analyze✨ NEW - Share findings: Help improve AILANG documentation
Documentation: See docs/guides/benchmarking.md Design Doc: See design_docs/20251002/m_eval_ai_benchmarking.md Phase 2: See design_docs/20251002/m_eval2_agentic.md