Kaggle Rank: #53 (private leaderboard, +16 jump) | Final Score: 0.1587 | 16 submissions
Extract structured voting data from 846 scanned Thai election result documents (Form สส.6/1) from the 2026 Thai general election. Built entirely with Claude Code over a 48-hour sprint.
Competition: [Super AI Engineer Season 6] การแข่งขัน OCR เอกสารผลเลือกตั้ง สส. 2569
| Version | Approach | Kaggle Score | What Changed |
|---|---|---|---|
| V1 Naive | Single-pass OCR, row-number mapping | 1.57 | First attempt — format was completely wrong |
| V2 Pagewise | Page-by-page OCR with Gemini 2.5 Flash | 0.52 | Better structure, still wrong assembly |
| V3 Nuclear | Gemini 3 Flash, full 300-doc run | 0.73 | Only 74/300 docs complete, wrong constituency mapping |
| V4 Simple | Row-number alignment fix | 0.076 (local) | Local eval looked perfect — Kaggle told a different story |
| V5 G3Flash | Complete 300-doc Gemini 3 Flash | 0.081 (local) | Good locally, 0.929 on Kaggle (submitted wrong file!) |
| V6 Official | Merged official กกต. election data | 0.229 | Huge jump — real data > OCR for known candidates |
| V7 Hybrids | Template party-name alignment + multi-pass | 0.235 | Ensemble made it worse — too much noise |
| Final | Official data primary + G3Flash fallback + offset fix | 0.158 | Fixed 22-doc ballot offset bug |
Score journey: 1.57 → 0.52 → 0.23 → 0.158
846 PNG scans (300 election documents, 1-4 pages each)
│
├── Discovery (src/discover.py)
│ └── Classify: party_list vs constituency, extract metadata
│
├── Multi-Model OCR
│ ├── Gemini 3 Flash (nuclear_g3flash) — primary, best digit accuracy
│ ├── Gemini 2.5 Pro (pagewise_pro) — backup, better structure
│ ├── Gemini 2.5 Flash (pagewise_full) — fast fallback
│ └── 3-pass overnight runs (900 API calls)
│
├── Official Data Integration
│ └── กกต. election results from GitHub (killernay/election-69-OCR-result)
│ ├── constituency.csv (3,458 candidates by party name)
│ └── party_list.csv (22,604 rows by ballot number)
│
├── Assembly (src/assemble.py)
│ ├── Template-based party-name fuzzy matching (thefuzz, 60% threshold)
│ ├── Ballot-number offset detection and correction
│ └── Multi-source merge: official → G3Flash → Pro → Flash fallback
│
└── submission.csv (10,053 rows of extracted vote counts)
The naive approach maps ballot numbers directly. But 22+ party_list documents had shifted ballot numbers (position 1 was a placeholder). Template-based party-name matching solved this:
# Instead of: row_num → ballot_number → vote
# We do: party_name (fuzzy match) → official data → voteThis single fix dropped the score from 0.307 to 0.228.
OCR digit accuracy was ~68% — reading "7" as "1", "9" as "3". For candidates that appeared in the official กกต. dataset, we used the official numbers directly and only fell back to OCR for missing entries.
We had ground truth for only 5 documents (2% of the dataset). Local eval showed 0.000 (perfect) while Kaggle showed 0.228. Zero correlation. Lesson: with a tiny eval set, only empirical Kaggle testing matters.
| Approach | Score | Why It Failed |
|---|---|---|
| Pure OCR with row-number mapping | 0.913 | Wrong constituency assembly logic |
| Multi-pass OCR voting (ensemble of 3 runs) | 0.234 | Adding noisy OCR to correct official data worsened it |
| 3-pass pure OCR without official data | 0.244 | OCR errors (~5%) worse than official data |
| Predicting Kaggle scores from local analysis | "0.03-0.07" predicted, 0.228 actual | 5 docs ≠ representative sample |
- OCR: Google Gemini 3 Flash, Gemini 2.5 Pro, Gemini 2.5 Flash
- Language: Python 3.11+
- Matching: thefuzz (fuzzy string matching with 60% length ratio)
- Data: pandas, pillow, aiohttp
- Orchestration: Claude Code with
/loopfor 8+ hours of autonomous overnight monitoring - Evaluation: Custom TDD eval system (8 tests via pytest)
├── src/ # Core pipeline (2,400 lines)
│ ├── discover.py # Document classification and metadata extraction
│ ├── process.py # Main OCR processing (single-page)
│ ├── process_pagewise.py # Page-by-page OCR with structured extraction
│ ├── ocr_nuclear.py # Gemini 3 Flash OCR with aggressive prompting
│ ├── assemble.py # Multi-source assembly and party-name matching
│ ├── postprocess.py # Vote count cleanup and normalization
│ ├── validate.py # Ground truth comparison and scoring
│ ├── config.py # API keys and model configuration
│ └── pipeline.py # End-to-end orchestration
├── submissions/ # All 36 submission CSVs (full iteration history)
├── BATTLE_PLAN*.md # Strategy documents for each phase
├── EVAL_SYSTEM.py # Local evaluation and cross-validation
├── BUILD_SUBMISSION.py # Final submission builder
├── BUILD_PERFECT.py # Optimal assembly with offset detection
├── election_ocr_pipeline.ipynb # Kaggle notebook submission
└── diagram-*.png # Journey and score visualizations
# 1. Install dependencies
uv sync
# 2. Set up API keys
cp .env.example .env
# Add your GOOGLE_API_KEY
# 3. Run the full pipeline (requires competition data in data/)
uv run python -m src.pipeline
# 4. Or run individual steps
uv run python -m src.discover # Classify documents
uv run python -m src.process # OCR processing
uv run python -m src.assemble # Assembly and matching
uv run python -m src.validate # Evaluate against ground truth- Organizer: Super AI Engineer (AIAT Thailand), Season 6
- Task: OCR + structured extraction from scanned Thai election tally sheets
- Metric: Mean Levenshtein distance on vote count strings (lower is better)
- Dataset: 300 documents (846 page images), ~10,053 vote entries
- Duration: ~48 hours (March 20-22, 2026)
- Total participants: 100+ teams
Ohm Mingrath (มิ่งรัฐ เมฆวิชัย) — DVM, Chulalongkorn University
- GitHub: @mingrath
- Kaggle: mingrathmekavichai

