Archival document transcription for non-technical researchers.
Upload batches of archival images → receive structured transcription packages.
Scribble empowers non-technical researchers to transcribe and natively translate image-based document collections at scale — no Python, no Jupyter, no local setup. Researchers upload batches of archival photographs and receive structured transcription packages by email and in their dashboard.
Designed for closed beta deployments (≤ ~20 users). Built on free-tier infrastructure.
- Multi-format ingest — JPEG, PNG, HEIC/HEIF (iOS native), PDF, ZIP
- AI provider routing — Gemini and OpenAI (direct or via institutional API portals)
- Flexible output — free-text narrative, structured CSV tables, or both
- Zero-shot translation — automatic translation of narrative text and data tables directly into a target language (default English)
- Cost estimation — pre-flight estimate before any API calls are made; adaptive feedback loop from real spend data
- Per-researcher budgets — hard cap enforced per-image; job pauses immediately at limit
- Email delivery — completion email with a signed download link for the output package
- Welcome emails — automated invite email with Google sign-in steps and a link to in-app help
- Admin dashboard — invite users, manage budgets, monitor spend, generate impersonation links
- Mobile-ready — HEIC uploads, responsive layout, bottom navigation on small screens
graph TD
Browser[Browser / Next.js] -->|HTTPS + JWT| FastAPI[FastAPI Backend]
FastAPI -->|JWT verification| SupabaseAuth[Supabase Auth]
FastAPI -->|CRUD & RLS| SupabaseDB[Supabase Postgres DB]
FastAPI -->|Upload / Download| SupabaseStorage[Supabase Storage]
FastAPI -->|BackgroundTasks| Worker[Worker Thread]
Worker -->|Paired OCR / Transcription| HUITGateway[HUIT Gemini / OpenAI Proxy]
Worker -->|SMTP Notification| Gmail[Gmail SMTP Server]
Why BackgroundTasks instead of Celery/Redis? Zero broker overhead. Sufficient for ≤ 20 researchers with batches up to ~200 images. If a worker thread restarts, the recovery loop automatically resets active image states to pending on restart.
The active list is discovered dynamically and admin-curated, so it evolves over time. A model appears in the UI only if its provider's API key is configured. Representative examples:
| Model ID | Provider | UI tag |
|---|---|---|
huit-gpt-4.1-mini |
Harvard HUIT OpenAI | Fast |
huit-gpt-5.1 |
Harvard HUIT OpenAI | Standard |
huit-gpt-5.5 |
Harvard HUIT OpenAI | Premier |
huit-gemini-3.1-flash-lite-preview |
Harvard HUIT Gemini | Fast |
huit-gemini-3.5-flash |
Harvard HUIT Gemini | Fast |
huit-gemini-3.1-pro-preview |
Harvard HUIT Gemini | Premier |
gemini-* / gpt-* |
Direct Google / OpenAI | Depends on model family |
Tags are assigned automatically by the model sync worker from naming conventions: Flash/Mini/Nano/Lite models → Fast, non-frontier workhorse models such as GPT 5.1 → Standard, and current frontier/SOTA models such as GPT 5.5 or Gemini 3.1 Pro → Premier. Admins still choose which discovered models are curated for users.
| Layer | Mechanism |
|---|---|
| Authentication | Invite-only via Supabase Auth — Google SSO; admins can mint single-use magic links |
| JWT verification | JWKS via PyJWKClient (ES256/RS256/HS256), with an HS256 shared-secret fallback |
| Data isolation | Row-Level Security on all tables (auth.uid() scoping) |
| Rate limiting | In-process sliding window: 3 submissions / 60s per researcher |
| Budget enforcement | Pre-flight check + per-image worker check; job pauses at cap |
| Image validation | PIL magic-byte verify() + 50 MP decompression-bomb limit, downscaled to 1600px |
| API key safety | Shared keys live in server env vars; per-researcher keys are admin-managed and never returned by profile APIs |
| CORS | Locked to exact frontend domain via CORS_ORIGINS |
| Service | Purpose | Cost |
|---|---|---|
| Supabase | Database, auth, file storage | Free tier |
| Vercel | Frontend hosting | Free Hobby or Pro |
| Render | Backend API hosting | Free or Starter ($7/mo) |
| AI API key | Gemini / OpenAI (direct or via portal) | Pay-as-you-go |
| Gmail SMTP | Completion email notifications | Included in Google Workspace |
Note on Render tier: The free tier sleeps after 15 min of inactivity (cold start ~30s) and blocks outbound SMTP. The Starter plan ($7/mo) is always-on and unblocks port 465. Use UptimeRobot (free) to keep the free tier warm.
- Schema migrations applied from
api/supabase/migrations/in order - Storage buckets created:
job-images,deliverables(private) - Vercel environment variables set
- Render backend environment variables set
-
CORS_ORIGINSlocked to exact Vercel URL - Successful End-to-End Test (E2E) Job completed
git clone https://github.com/YOUR_USERNAME/scribble.git
cd scribble/app
cp api/.env.example api/.env # fill in your values- Create a project at supabase.com
- SQL Editor → run all SQL files in
api/supabase/migrations/in order - Storage → create two private buckets:
job-imagesanddeliverables - Auth → Providers → enable Google (primary sign-in); enable Email if you want admin-issued magic links
- Auth → Settings → disable "Allow new users to sign up" to keep access invite-only
- Note your Project URL, anon key, service_role key, and JWT Secret
- New project → import your fork → Root Directory:
web - Add environment variables:
NEXT_PUBLIC_SUPABASE_URL = your Supabase project URL
NEXT_PUBLIC_SUPABASE_ANON_KEY = your Supabase anon key
NEXT_PUBLIC_API_URL = your Render URL (fill after Step 4)
- Deploy — note your Vercel URL
The support address shown in the UI is set in
web/src/lib/constants.ts(SUPPORT_EMAIL); admin authorization is gated server-side byADMIN_EMAILon Render (Step 4), not in the frontend.
- render.com → New → Blueprint → connect your fork
- Render detects
render.yaml— fill in secret env vars in the form - Set
ADMIN_EMAIL= the email that should have admin access (must match a researcher's login email) - Set
CORS_ORIGINS= your Vercel URL (no trailing slash) - Deploy — note your Render URL
- Back in Vercel, set
NEXT_PUBLIC_API_URL= your Render URL → redeploy
Add whichever keys you have to Render's environment:
# Institutional portal (e.g. Harvard HUIT)
HUIT_GEMINI_API_KEY HUIT_GEMINI_BASE_URL
HUIT_OPENAI_API_KEY HUIT_OPENAI_BASE_URL
# Direct keys
GOOGLE_API_KEY # ai.google.dev
OPENAI_API_KEY # platform.openai.com
Models without a configured key are automatically hidden from the UI — no code changes needed.
Configure Gmail SMTP in Render:
SMTP_USERNAME your Google Workspace email
SMTP_PASSWORD 16-character App Password (Google Account → Security → App Passwords)
SMTP_FROM_EMAIL Scribble <your@email.com>
If SMTP is not configured, jobs complete normally and results appear in the dashboard — email is simply skipped.
- Navigate to
/adminin the app. - Use Invite Researcher — enter email and set a starting budget.
- The backend creates the user in Supabase Auth (auto-confirmed), provisions their profile, and automatically sends an invite email.
- The user clicks Log In with Google in their email to log in via Google SSO (or requests a magic link).
Important: To enforce invite-only with Google SSO, disable "Allow new users to sign up" in Supabase Auth settings.
# Backend
cd api
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000
# Frontend (separate terminal)
cd web
npm install
npm run devRequires a running Supabase instance (supabase start) and a filled-in api/.env.
Each completed job produces a ZIP download containing:
| File | Contents |
|---|---|
transcriptions/ |
Individual .txt (or .csv in Table mode) file per image |
{label}_collated.txt |
All transcriptions concatenated in filename order |
{label}_inventory.csv |
Per-image metadata: confidence, flags, cost, date |
{label}_metrics.jsonl |
Raw API latency and token usage |
{label}_table_output.csv |
Structured CSV (table/both output mode only) |
{label}_settings.json |
Job configuration snapshot (model, mode, prompt, totals) |
| Service | Cost |
|---|---|
| Vercel (Pro) | ~$20/month (existing account) |
| Render (Starter) | $7/month — always-on, SMTP unblocked |
| Supabase (free tier) | $0 |
| Gmail SMTP | $0 |
| Infrastructure total | ~$7–27/month |
| Gemini Flash family (per image) | ~$0.02–0.05 (pay-as-you-go) |
| OpenAI / frontier models (per image) | higher; varies by model and page density |
| Decision | Rationale |
|---|---|
| BackgroundTasks over Redis/Celery | Zero broker overhead for ≤ 20 users |
| Supabase over raw Postgres | Auth + Storage + RLS in one managed service |
| Gmail SMTP over Brevo/SendGrid | Harvard Google Workspace SMTP avoids spam filters |
| Page-by-page PDF streaming | Yields pages sequentially to keep RAM footprint below 512MB on Render |
| Filename sort over filesystem order | Inode/creation-time ordering is non-deterministic |
| Dynamic model fetch over hardcoded list | Models appear/disappear as keys are added in Render |
Scribble — Harvard Library Digital Scholarship Program Matthew Cook · matt_cook@harvard.edu · library.harvard.edu/how-to/digital-scholarship-program