Skip to content

mingrath/election-ocr-hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Election OCR Hackathon — Super AI Engineer Season 6

Kaggle Rank: #53 (private leaderboard, +16 jump) | Final Score: 0.1587 | 16 submissions

Extract structured voting data from 846 scanned Thai election result documents (Form สส.6/1) from the 2026 Thai general election. Built entirely with Claude Code over a 48-hour sprint.

Competition: [Super AI Engineer Season 6] การแข่งขัน OCR เอกสารผลเลือกตั้ง สส. 2569


The Journey

Score Timeline

Version Approach Kaggle Score What Changed
V1 Naive Single-pass OCR, row-number mapping 1.57 First attempt — format was completely wrong
V2 Pagewise Page-by-page OCR with Gemini 2.5 Flash 0.52 Better structure, still wrong assembly
V3 Nuclear Gemini 3 Flash, full 300-doc run 0.73 Only 74/300 docs complete, wrong constituency mapping
V4 Simple Row-number alignment fix 0.076 (local) Local eval looked perfect — Kaggle told a different story
V5 G3Flash Complete 300-doc Gemini 3 Flash 0.081 (local) Good locally, 0.929 on Kaggle (submitted wrong file!)
V6 Official Merged official กกต. election data 0.229 Huge jump — real data > OCR for known candidates
V7 Hybrids Template party-name alignment + multi-pass 0.235 Ensemble made it worse — too much noise
Final Official data primary + G3Flash fallback + offset fix 0.158 Fixed 22-doc ballot offset bug

Score journey: 1.57 → 0.52 → 0.23 → 0.158


Architecture

Main Journey

846 PNG scans (300 election documents, 1-4 pages each)
    │
    ├── Discovery (src/discover.py)
    │   └── Classify: party_list vs constituency, extract metadata
    │
    ├── Multi-Model OCR
    │   ├── Gemini 3 Flash (nuclear_g3flash) — primary, best digit accuracy
    │   ├── Gemini 2.5 Pro (pagewise_pro) — backup, better structure
    │   ├── Gemini 2.5 Flash (pagewise_full) — fast fallback
    │   └── 3-pass overnight runs (900 API calls)
    │
    ├── Official Data Integration
    │   └── กกต. election results from GitHub (killernay/election-69-OCR-result)
    │       ├── constituency.csv (3,458 candidates by party name)
    │       └── party_list.csv (22,604 rows by ballot number)
    │
    ├── Assembly (src/assemble.py)
    │   ├── Template-based party-name fuzzy matching (thefuzz, 60% threshold)
    │   ├── Ballot-number offset detection and correction
    │   └── Multi-source merge: official → G3Flash → Pro → Flash fallback
    │
    └── submission.csv (10,053 rows of extracted vote counts)

Key Technical Decisions

1. Party-name alignment over ballot-number mapping

The naive approach maps ballot numbers directly. But 22+ party_list documents had shifted ballot numbers (position 1 was a placeholder). Template-based party-name matching solved this:

# Instead of: row_num → ballot_number → vote
# We do:     party_name (fuzzy match) → official data → vote

This single fix dropped the score from 0.307 to 0.228.

2. Official data as primary source

OCR digit accuracy was ~68% — reading "7" as "1", "9" as "3". For candidates that appeared in the official กกต. dataset, we used the official numbers directly and only fell back to OCR for missing entries.

3. Local eval is unreliable

We had ground truth for only 5 documents (2% of the dataset). Local eval showed 0.000 (perfect) while Kaggle showed 0.228. Zero correlation. Lesson: with a tiny eval set, only empirical Kaggle testing matters.

What Didn't Work

Approach Score Why It Failed
Pure OCR with row-number mapping 0.913 Wrong constituency assembly logic
Multi-pass OCR voting (ensemble of 3 runs) 0.234 Adding noisy OCR to correct official data worsened it
3-pass pure OCR without official data 0.244 OCR errors (~5%) worse than official data
Predicting Kaggle scores from local analysis "0.03-0.07" predicted, 0.228 actual 5 docs ≠ representative sample

Stack

  • OCR: Google Gemini 3 Flash, Gemini 2.5 Pro, Gemini 2.5 Flash
  • Language: Python 3.11+
  • Matching: thefuzz (fuzzy string matching with 60% length ratio)
  • Data: pandas, pillow, aiohttp
  • Orchestration: Claude Code with /loop for 8+ hours of autonomous overnight monitoring
  • Evaluation: Custom TDD eval system (8 tests via pytest)

Project Structure

├── src/                    # Core pipeline (2,400 lines)
│   ├── discover.py         # Document classification and metadata extraction
│   ├── process.py          # Main OCR processing (single-page)
│   ├── process_pagewise.py # Page-by-page OCR with structured extraction
│   ├── ocr_nuclear.py      # Gemini 3 Flash OCR with aggressive prompting
│   ├── assemble.py         # Multi-source assembly and party-name matching
│   ├── postprocess.py      # Vote count cleanup and normalization
│   ├── validate.py         # Ground truth comparison and scoring
│   ├── config.py           # API keys and model configuration
│   └── pipeline.py         # End-to-end orchestration
├── submissions/            # All 36 submission CSVs (full iteration history)
├── BATTLE_PLAN*.md         # Strategy documents for each phase
├── EVAL_SYSTEM.py          # Local evaluation and cross-validation
├── BUILD_SUBMISSION.py     # Final submission builder
├── BUILD_PERFECT.py        # Optimal assembly with offset detection
├── election_ocr_pipeline.ipynb  # Kaggle notebook submission
└── diagram-*.png           # Journey and score visualizations

How to Run

# 1. Install dependencies
uv sync

# 2. Set up API keys
cp .env.example .env
# Add your GOOGLE_API_KEY

# 3. Run the full pipeline (requires competition data in data/)
uv run python -m src.pipeline

# 4. Or run individual steps
uv run python -m src.discover      # Classify documents
uv run python -m src.process       # OCR processing
uv run python -m src.assemble      # Assembly and matching
uv run python -m src.validate      # Evaluate against ground truth

Competition Context

  • Organizer: Super AI Engineer (AIAT Thailand), Season 6
  • Task: OCR + structured extraction from scanned Thai election tally sheets
  • Metric: Mean Levenshtein distance on vote count strings (lower is better)
  • Dataset: 300 documents (846 page images), ~10,053 vote entries
  • Duration: ~48 hours (March 20-22, 2026)
  • Total participants: 100+ teams

Author

Ohm Mingrath (มิ่งรัฐ เมฆวิชัย) — DVM, Chulalongkorn University

About

Super AI Engineer Season 6 — OCR pipeline for Thai election documents (Form สส.6/1). Rank #53, score 0.1587. Multi-model Gemini OCR + official data matching.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors