GitHub - Cook4986/scribble: Archival document transcription tool

Archival document transcription for non-technical researchers.

Upload batches of archival images → receive structured transcription packages.

Overview

Scribble empowers non-technical researchers to transcribe and natively translate image-based document collections at scale — no Python, no Jupyter, no local setup. Researchers upload batches of archival photographs and receive structured transcription packages by email and in their dashboard.

Designed for closed beta deployments (≤ ~20 users). Built on free-tier infrastructure.

Features

Multi-format ingest — JPEG, PNG, HEIC/HEIF (iOS native), PDF, ZIP
AI provider routing — Gemini and OpenAI (direct or via institutional API portals)
Flexible output — free-text narrative, structured CSV tables, or both
Zero-shot translation — automatic translation of narrative text and data tables directly into a target language (default English)
Cost estimation — pre-flight estimate before any API calls are made; adaptive feedback loop from real spend data
Per-researcher budgets — hard cap enforced per-image; job pauses immediately at limit
Email delivery — completion email with a signed download link for the output package
Welcome emails — automated invite email with Google sign-in steps and a link to in-app help
Admin dashboard — invite users, manage budgets, monitor spend, generate impersonation links
Mobile-ready — HEIC uploads, responsive layout, bottom navigation on small screens

Architecture

graph TD
    Browser[Browser / Next.js] -->|HTTPS + JWT| FastAPI[FastAPI Backend]
    FastAPI -->|JWT verification| SupabaseAuth[Supabase Auth]
    FastAPI -->|CRUD & RLS| SupabaseDB[Supabase Postgres DB]
    FastAPI -->|Upload / Download| SupabaseStorage[Supabase Storage]
    FastAPI -->|BackgroundTasks| Worker[Worker Thread]
    Worker -->|Paired OCR / Transcription| HUITGateway[HUIT Gemini / OpenAI Proxy]
    Worker -->|SMTP Notification| Gmail[Gmail SMTP Server]

Why BackgroundTasks instead of Celery/Redis? Zero broker overhead. Sufficient for ≤ 20 researchers with batches up to ~200 images. If a worker thread restarts, the recovery loop automatically resets active image states to pending on restart.

Supported Models

The active list is discovered dynamically and admin-curated, so it evolves over time. A model appears in the UI only if its provider's API key is configured. Representative examples:

Model ID	Provider	UI tag
`huit-gpt-4.1-mini`	Harvard HUIT OpenAI	Fast
`huit-gpt-5.1`	Harvard HUIT OpenAI	Standard
`huit-gpt-5.5`	Harvard HUIT OpenAI	Premier
`huit-gemini-3.1-flash-lite-preview`	Harvard HUIT Gemini	Fast
`huit-gemini-3.5-flash`	Harvard HUIT Gemini	Fast
`huit-gemini-3.1-pro-preview`	Harvard HUIT Gemini	Premier
`gemini-` / `gpt-`	Direct Google / OpenAI	Depends on model family

Tags are assigned automatically by the model sync worker from naming conventions: Flash/Mini/Nano/Lite models → Fast, non-frontier workhorse models such as GPT 5.1 → Standard, and current frontier/SOTA models such as GPT 5.5 or Gemini 3.1 Pro → Premier. Admins still choose which discovered models are curated for users.

Security

Layer	Mechanism
Authentication	Invite-only via Supabase Auth — Google SSO; admins can mint single-use magic links
JWT verification	JWKS via `PyJWKClient` (ES256/RS256/HS256), with an HS256 shared-secret fallback
Data isolation	Row-Level Security on all tables (`auth.uid()` scoping)
Rate limiting	In-process sliding window: 3 submissions / 60s per researcher
Budget enforcement	Pre-flight check + per-image worker check; job pauses at cap
Image validation	PIL magic-byte `verify()` + 50 MP decompression-bomb limit, downscaled to 1600px
API key safety	Shared keys live in server env vars; per-researcher keys are admin-managed and never returned by profile APIs
CORS	Locked to exact frontend domain via `CORS_ORIGINS`

Deployment

Requirements

Service	Purpose	Cost
Supabase	Database, auth, file storage	Free tier
Vercel	Frontend hosting	Free Hobby or Pro
Render	Backend API hosting	Free or Starter ($7/mo)
AI API key	Gemini / OpenAI (direct or via portal)	Pay-as-you-go
Gmail SMTP	Completion email notifications	Included in Google Workspace

Note on Render tier: The free tier sleeps after 15 min of inactivity (cold start ~30s) and blocks outbound SMTP. The Starter plan ($7/mo) is always-on and unblocks port 465. Use UptimeRobot (free) to keep the free tier warm.

Production Readiness

Schema migrations applied from api/supabase/migrations/ in order
Storage buckets created: job-images, deliverables (private)
Vercel environment variables set
Render backend environment variables set
CORS_ORIGINS locked to exact Vercel URL
Successful End-to-End Test (E2E) Job completed

Step 1 — Fork and clone

git clone https://github.com/YOUR_USERNAME/scribble.git
cd scribble/app
cp api/.env.example api/.env   # fill in your values

Step 2 — Supabase

Create a project at supabase.com
SQL Editor → run all SQL files in api/supabase/migrations/ in order
Storage → create two private buckets: job-images and deliverables
Auth → Providers → enable Google (primary sign-in); enable Email if you want admin-issued magic links
Auth → Settings → disable "Allow new users to sign up" to keep access invite-only
Note your Project URL, anon key, service_role key, and JWT Secret

Step 3 — Vercel (frontend)

New project → import your fork → Root Directory: web
Add environment variables:

NEXT_PUBLIC_SUPABASE_URL       = your Supabase project URL
NEXT_PUBLIC_SUPABASE_ANON_KEY  = your Supabase anon key
NEXT_PUBLIC_API_URL            = your Render URL (fill after Step 4)

Deploy — note your Vercel URL

The support address shown in the UI is set in web/src/lib/constants.ts (SUPPORT_EMAIL); admin authorization is gated server-side by ADMIN_EMAIL on Render (Step 4), not in the frontend.

Step 4 — Render (backend)

render.com → New → Blueprint → connect your fork
Render detects render.yaml — fill in secret env vars in the form
Set ADMIN_EMAIL = the email that should have admin access (must match a researcher's login email)
Set CORS_ORIGINS = your Vercel URL (no trailing slash)
Deploy — note your Render URL
Back in Vercel, set NEXT_PUBLIC_API_URL = your Render URL → redeploy

Step 5 — AI keys

Add whichever keys you have to Render's environment:

# Institutional portal (e.g. Harvard HUIT)
HUIT_GEMINI_API_KEY    HUIT_GEMINI_BASE_URL
HUIT_OPENAI_API_KEY    HUIT_OPENAI_BASE_URL

# Direct keys
GOOGLE_API_KEY         # ai.google.dev
OPENAI_API_KEY         # platform.openai.com

Models without a configured key are automatically hidden from the UI — no code changes needed.

Step 6 — Email (optional)

Configure Gmail SMTP in Render:

SMTP_USERNAME    your Google Workspace email
SMTP_PASSWORD    16-character App Password (Google Account → Security → App Passwords)
SMTP_FROM_EMAIL  Scribble <your@email.com>

If SMTP is not configured, jobs complete normally and results appear in the dashboard — email is simply skipped.

Adding a Researcher

Navigate to /admin in the app.
Use Invite Researcher — enter email and set a starting budget.
The backend creates the user in Supabase Auth (auto-confirmed), provisions their profile, and automatically sends an invite email.
The user clicks Log In with Google in their email to log in via Google SSO (or requests a magic link).

Important: To enforce invite-only with Google SSO, disable "Allow new users to sign up" in Supabase Auth settings.

Local Development

# Backend
cd api
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

# Frontend (separate terminal)
cd web
npm install
npm run dev

Requires a running Supabase instance (supabase start) and a filled-in api/.env.

Output Package

Each completed job produces a ZIP download containing:

File	Contents
`transcriptions/`	Individual `.txt` (or `.csv` in Table mode) file per image
`{label}_collated.txt`	All transcriptions concatenated in filename order
`{label}_inventory.csv`	Per-image metadata: confidence, flags, cost, date
`{label}_metrics.jsonl`	Raw API latency and token usage
`{label}_table_output.csv`	Structured CSV (table/both output mode only)
`{label}_settings.json`	Job configuration snapshot (model, mode, prompt, totals)

Running Costs

Service	Cost
Vercel (Pro)	~$20/month (existing account)
Render (Starter)	$7/month — always-on, SMTP unblocked
Supabase (free tier)	$0
Gmail SMTP	$0
Infrastructure total	~$7–27/month
Gemini Flash family (per image)	~$0.02–0.05 (pay-as-you-go)
OpenAI / frontier models (per image)	higher; varies by model and page density

Design Decisions

Decision	Rationale
BackgroundTasks over Redis/Celery	Zero broker overhead for ≤ 20 users
Supabase over raw Postgres	Auth + Storage + RLS in one managed service
Gmail SMTP over Brevo/SendGrid	Harvard Google Workspace SMTP avoids spam filters
Page-by-page PDF streaming	Yields pages sequentially to keep RAM footprint below 512MB on Render
Filename sort over filesystem order	Inode/creation-time ordering is non-deterministic
Dynamic model fetch over hardcoded list	Models appear/disappear as keys are added in Render

Scribble — Harvard Library Digital Scholarship Program Matthew Cook · matt_cook@harvard.edu · library.harvard.edu/how-to/digital-scholarship-program

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
api		api
web		web
.gitignore		.gitignore
README.md		README.md
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Architecture

Supported Models

Security

Deployment

Requirements

Production Readiness

Step 1 — Fork and clone

Step 2 — Supabase

Step 3 — Vercel (frontend)

Step 4 — Render (backend)

Step 5 — AI keys

Step 6 — Email (optional)

Adding a Researcher

Local Development

Output Package

Running Costs

Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Architecture

Supported Models

Security

Deployment

Requirements

Production Readiness

Step 1 — Fork and clone

Step 2 — Supabase

Step 3 — Vercel (frontend)

Step 4 — Render (backend)

Step 5 — AI keys

Step 6 — Email (optional)

Adding a Researcher

Local Development

Output Package

Running Costs

Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages