Election OCR Hackathon — Super AI Engineer Season 6

Kaggle Rank: #53 (private leaderboard, +16 jump) | Final Score: 0.1587 | 16 submissions

Extract structured voting data from 846 scanned Thai election result documents (Form สส.6/1) from the 2026 Thai general election. Built entirely with Claude Code over a 48-hour sprint.

Competition: [Super AI Engineer Season 6] การแข่งขัน OCR เอกสารผลเลือกตั้ง สส. 2569

The Journey

Version	Approach	Kaggle Score	What Changed
V1 Naive	Single-pass OCR, row-number mapping	1.57	First attempt — format was completely wrong
V2 Pagewise	Page-by-page OCR with Gemini 2.5 Flash	0.52	Better structure, still wrong assembly
V3 Nuclear	Gemini 3 Flash, full 300-doc run	0.73	Only 74/300 docs complete, wrong constituency mapping
V4 Simple	Row-number alignment fix	0.076 (local)	Local eval looked perfect — Kaggle told a different story
V5 G3Flash	Complete 300-doc Gemini 3 Flash	0.081 (local)	Good locally, 0.929 on Kaggle (submitted wrong file!)
V6 Official	Merged official กกต. election data	0.229	Huge jump — real data > OCR for known candidates
V7 Hybrids	Template party-name alignment + multi-pass	0.235	Ensemble made it worse — too much noise
Final	Official data primary + G3Flash fallback + offset fix	0.158	Fixed 22-doc ballot offset bug

Score journey: 1.57 → 0.52 → 0.23 → 0.158

Architecture

846 PNG scans (300 election documents, 1-4 pages each)
    │
    ├── Discovery (src/discover.py)
    │   └── Classify: party_list vs constituency, extract metadata
    │
    ├── Multi-Model OCR
    │   ├── Gemini 3 Flash (nuclear_g3flash) — primary, best digit accuracy
    │   ├── Gemini 2.5 Pro (pagewise_pro) — backup, better structure
    │   ├── Gemini 2.5 Flash (pagewise_full) — fast fallback
    │   └── 3-pass overnight runs (900 API calls)
    │
    ├── Official Data Integration
    │   └── กกต. election results from GitHub (killernay/election-69-OCR-result)
    │       ├── constituency.csv (3,458 candidates by party name)
    │       └── party_list.csv (22,604 rows by ballot number)
    │
    ├── Assembly (src/assemble.py)
    │   ├── Template-based party-name fuzzy matching (thefuzz, 60% threshold)
    │   ├── Ballot-number offset detection and correction
    │   └── Multi-source merge: official → G3Flash → Pro → Flash fallback
    │
    └── submission.csv (10,053 rows of extracted vote counts)

Key Technical Decisions

1. Party-name alignment over ballot-number mapping

The naive approach maps ballot numbers directly. But 22+ party_list documents had shifted ballot numbers (position 1 was a placeholder). Template-based party-name matching solved this:

# Instead of: row_num → ballot_number → vote
# We do:     party_name (fuzzy match) → official data → vote

This single fix dropped the score from 0.307 to 0.228.

2. Official data as primary source

OCR digit accuracy was ~68% — reading "7" as "1", "9" as "3". For candidates that appeared in the official กกต. dataset, we used the official numbers directly and only fell back to OCR for missing entries.

3. Local eval is unreliable

We had ground truth for only 5 documents (2% of the dataset). Local eval showed 0.000 (perfect) while Kaggle showed 0.228. Zero correlation. Lesson: with a tiny eval set, only empirical Kaggle testing matters.

What Didn't Work

Approach	Score	Why It Failed
Pure OCR with row-number mapping	0.913	Wrong constituency assembly logic
Multi-pass OCR voting (ensemble of 3 runs)	0.234	Adding noisy OCR to correct official data worsened it
3-pass pure OCR without official data	0.244	OCR errors (~5%) worse than official data
Predicting Kaggle scores from local analysis	"0.03-0.07" predicted, 0.228 actual	5 docs ≠ representative sample

Stack

OCR: Google Gemini 3 Flash, Gemini 2.5 Pro, Gemini 2.5 Flash
Language: Python 3.11+
Matching: thefuzz (fuzzy string matching with 60% length ratio)
Data: pandas, pillow, aiohttp
Orchestration: Claude Code with /loop for 8+ hours of autonomous overnight monitoring
Evaluation: Custom TDD eval system (8 tests via pytest)

Project Structure

├── src/                    # Core pipeline (2,400 lines)
│   ├── discover.py         # Document classification and metadata extraction
│   ├── process.py          # Main OCR processing (single-page)
│   ├── process_pagewise.py # Page-by-page OCR with structured extraction
│   ├── ocr_nuclear.py      # Gemini 3 Flash OCR with aggressive prompting
│   ├── assemble.py         # Multi-source assembly and party-name matching
│   ├── postprocess.py      # Vote count cleanup and normalization
│   ├── validate.py         # Ground truth comparison and scoring
│   ├── config.py           # API keys and model configuration
│   └── pipeline.py         # End-to-end orchestration
├── submissions/            # All 36 submission CSVs (full iteration history)
├── BATTLE_PLAN*.md         # Strategy documents for each phase
├── EVAL_SYSTEM.py          # Local evaluation and cross-validation
├── BUILD_SUBMISSION.py     # Final submission builder
├── BUILD_PERFECT.py        # Optimal assembly with offset detection
├── election_ocr_pipeline.ipynb  # Kaggle notebook submission
└── diagram-*.png           # Journey and score visualizations

How to Run

# 1. Install dependencies
uv sync

# 2. Set up API keys
cp .env.example .env
# Add your GOOGLE_API_KEY

# 3. Run the full pipeline (requires competition data in data/)
uv run python -m src.pipeline

# 4. Or run individual steps
uv run python -m src.discover      # Classify documents
uv run python -m src.process       # OCR processing
uv run python -m src.assemble      # Assembly and matching
uv run python -m src.validate      # Evaluate against ground truth

Competition Context

Organizer: Super AI Engineer (AIAT Thailand), Season 6
Task: OCR + structured extraction from scanned Thai election tally sheets
Metric: Mean Levenshtein distance on vote count strings (lower is better)
Dataset: 300 documents (846 page images), ~10,053 vote entries
Duration: ~48 hours (March 20-22, 2026)
Total participants: 100+ teams

Author

Ohm Mingrath (มิ่งรัฐ เมฆวิชัย) — DVM, Chulalongkorn University

GitHub: @mingrath
Kaggle: mingrathmekavichai

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebook_sections		notebook_sections
scripts		scripts
src		src
submissions		submissions
.env.example		.env.example
.gitignore		.gitignore
BATTLE_PLAN.md		BATTLE_PLAN.md
BATTLE_PLAN_V2.md		BATTLE_PLAN_V2.md
BATTLE_PLAN_V3.md		BATTLE_PLAN_V3.md
BUILD_PERFECT.py		BUILD_PERFECT.py
BUILD_SUBMISSION.py		BUILD_SUBMISSION.py
EVAL_SYSTEM.py		EVAL_SYSTEM.py
README.md		README.md
analyze_assembly.py		analyze_assembly.py
analyze_assembly_destruction.py		analyze_assembly_destruction.py
build_2pass_fallback.py		build_2pass_fallback.py
build_hybrid_v2.py		build_hybrid_v2.py
build_hybrid_v3.py		build_hybrid_v3.py
build_iterations.py		build_iterations.py
build_ocr_primary.py		build_ocr_primary.py
build_smart_ensemble.py		build_smart_ensemble.py
diagram-1-main-journey.png		diagram-1-main-journey.png
diagram-1-main-journey.svg		diagram-1-main-journey.svg
diagram-2-dead-ends.png		diagram-2-dead-ends.png
diagram-2-dead-ends.svg		diagram-2-dead-ends.svg
diagram-3-score-timeline.png		diagram-3-score-timeline.png
diagram-3-score-timeline.svg		diagram-3-score-timeline.svg
election_ocr_pipeline.ipynb		election_ocr_pipeline.ipynb
hackathon-journey.png		hackathon-journey.png
hackathon-journey.svg		hackathon-journey.svg
pyproject.toml		pyproject.toml
run.sh		run.sh
test_advanced_ocr.py		test_advanced_ocr.py
test_ocr_comparison.py		test_ocr_comparison.py
test_submission.py		test_submission.py
vote62_data.csv		vote62_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Election OCR Hackathon — Super AI Engineer Season 6

The Journey

Architecture

Key Technical Decisions

1. Party-name alignment over ballot-number mapping

2. Official data as primary source

3. Local eval is unreliable

What Didn't Work

Stack

Project Structure

How to Run

Competition Context

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Election OCR Hackathon — Super AI Engineer Season 6

The Journey

Architecture

Key Technical Decisions

1. Party-name alignment over ballot-number mapping

2. Official data as primary source

3. Local eval is unreliable

What Didn't Work

Stack

Project Structure

How to Run

Competition Context

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages