Skip to content

Cook4986/scribble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

237 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scribble

Archival document transcription for non-technical researchers.

Upload batches of archival images → receive structured transcription packages.

Stack Deploy Cost


Overview

Scribble empowers non-technical researchers to transcribe and natively translate image-based document collections at scale — no Python, no Jupyter, no local setup. Researchers upload batches of archival photographs and receive structured transcription packages by email and in their dashboard.

Designed for closed beta deployments (≤ ~20 users). Built on free-tier infrastructure.


Features

  • Multi-format ingest — JPEG, PNG, HEIC/HEIF (iOS native), PDF, ZIP
  • AI provider routing — Gemini and OpenAI (direct or via institutional API portals)
  • Flexible output — free-text narrative, structured CSV tables, or both
  • Zero-shot translation — automatic translation of narrative text and data tables directly into a target language (default English)
  • Cost estimation — pre-flight estimate before any API calls are made; adaptive feedback loop from real spend data
  • Per-researcher budgets — hard cap enforced per-image; job pauses immediately at limit
  • Email delivery — completion email with a signed download link for the output package
  • Welcome emails — automated invite email with Google sign-in steps and a link to in-app help
  • Admin dashboard — invite users, manage budgets, monitor spend, generate impersonation links
  • Mobile-ready — HEIC uploads, responsive layout, bottom navigation on small screens

Architecture

graph TD
    Browser[Browser / Next.js] -->|HTTPS + JWT| FastAPI[FastAPI Backend]
    FastAPI -->|JWT verification| SupabaseAuth[Supabase Auth]
    FastAPI -->|CRUD & RLS| SupabaseDB[Supabase Postgres DB]
    FastAPI -->|Upload / Download| SupabaseStorage[Supabase Storage]
    FastAPI -->|BackgroundTasks| Worker[Worker Thread]
    Worker -->|Paired OCR / Transcription| HUITGateway[HUIT Gemini / OpenAI Proxy]
    Worker -->|SMTP Notification| Gmail[Gmail SMTP Server]
Loading

Why BackgroundTasks instead of Celery/Redis? Zero broker overhead. Sufficient for ≤ 20 researchers with batches up to ~200 images. If a worker thread restarts, the recovery loop automatically resets active image states to pending on restart.


Supported Models

The active list is discovered dynamically and admin-curated, so it evolves over time. A model appears in the UI only if its provider's API key is configured. Representative examples:

Model ID Provider UI tag
huit-gpt-4.1-mini Harvard HUIT OpenAI Fast
huit-gpt-5.1 Harvard HUIT OpenAI Standard
huit-gpt-5.5 Harvard HUIT OpenAI Premier
huit-gemini-3.1-flash-lite-preview Harvard HUIT Gemini Fast
huit-gemini-3.5-flash Harvard HUIT Gemini Fast
huit-gemini-3.1-pro-preview Harvard HUIT Gemini Premier
gemini-* / gpt-* Direct Google / OpenAI Depends on model family

Tags are assigned automatically by the model sync worker from naming conventions: Flash/Mini/Nano/Lite models → Fast, non-frontier workhorse models such as GPT 5.1 → Standard, and current frontier/SOTA models such as GPT 5.5 or Gemini 3.1 Pro → Premier. Admins still choose which discovered models are curated for users.


Security

Layer Mechanism
Authentication Invite-only via Supabase Auth — Google SSO; admins can mint single-use magic links
JWT verification JWKS via PyJWKClient (ES256/RS256/HS256), with an HS256 shared-secret fallback
Data isolation Row-Level Security on all tables (auth.uid() scoping)
Rate limiting In-process sliding window: 3 submissions / 60s per researcher
Budget enforcement Pre-flight check + per-image worker check; job pauses at cap
Image validation PIL magic-byte verify() + 50 MP decompression-bomb limit, downscaled to 1600px
API key safety Shared keys live in server env vars; per-researcher keys are admin-managed and never returned by profile APIs
CORS Locked to exact frontend domain via CORS_ORIGINS

Deployment

Requirements

Service Purpose Cost
Supabase Database, auth, file storage Free tier
Vercel Frontend hosting Free Hobby or Pro
Render Backend API hosting Free or Starter ($7/mo)
AI API key Gemini / OpenAI (direct or via portal) Pay-as-you-go
Gmail SMTP Completion email notifications Included in Google Workspace

Note on Render tier: The free tier sleeps after 15 min of inactivity (cold start ~30s) and blocks outbound SMTP. The Starter plan ($7/mo) is always-on and unblocks port 465. Use UptimeRobot (free) to keep the free tier warm.

Production Readiness

  • Schema migrations applied from api/supabase/migrations/ in order
  • Storage buckets created: job-images, deliverables (private)
  • Vercel environment variables set
  • Render backend environment variables set
  • CORS_ORIGINS locked to exact Vercel URL
  • Successful End-to-End Test (E2E) Job completed

Step 1 — Fork and clone

git clone https://github.com/YOUR_USERNAME/scribble.git
cd scribble/app
cp api/.env.example api/.env   # fill in your values

Step 2 — Supabase

  1. Create a project at supabase.com
  2. SQL Editor → run all SQL files in api/supabase/migrations/ in order
  3. Storage → create two private buckets: job-images and deliverables
  4. Auth → Providers → enable Google (primary sign-in); enable Email if you want admin-issued magic links
  5. Auth → Settings → disable "Allow new users to sign up" to keep access invite-only
  6. Note your Project URL, anon key, service_role key, and JWT Secret

Step 3 — Vercel (frontend)

  1. New project → import your fork → Root Directory: web
  2. Add environment variables:
NEXT_PUBLIC_SUPABASE_URL       = your Supabase project URL
NEXT_PUBLIC_SUPABASE_ANON_KEY  = your Supabase anon key
NEXT_PUBLIC_API_URL            = your Render URL (fill after Step 4)
  1. Deploy — note your Vercel URL

The support address shown in the UI is set in web/src/lib/constants.ts (SUPPORT_EMAIL); admin authorization is gated server-side by ADMIN_EMAIL on Render (Step 4), not in the frontend.

Step 4 — Render (backend)

  1. render.com → New → Blueprint → connect your fork
  2. Render detects render.yaml — fill in secret env vars in the form
  3. Set ADMIN_EMAIL = the email that should have admin access (must match a researcher's login email)
  4. Set CORS_ORIGINS = your Vercel URL (no trailing slash)
  5. Deploy — note your Render URL
  6. Back in Vercel, set NEXT_PUBLIC_API_URL = your Render URL → redeploy

Step 5 — AI keys

Add whichever keys you have to Render's environment:

# Institutional portal (e.g. Harvard HUIT)
HUIT_GEMINI_API_KEY    HUIT_GEMINI_BASE_URL
HUIT_OPENAI_API_KEY    HUIT_OPENAI_BASE_URL

# Direct keys
GOOGLE_API_KEY         # ai.google.dev
OPENAI_API_KEY         # platform.openai.com

Models without a configured key are automatically hidden from the UI — no code changes needed.

Step 6 — Email (optional)

Configure Gmail SMTP in Render:

SMTP_USERNAME    your Google Workspace email
SMTP_PASSWORD    16-character App Password (Google Account → Security → App Passwords)
SMTP_FROM_EMAIL  Scribble <your@email.com>

If SMTP is not configured, jobs complete normally and results appear in the dashboard — email is simply skipped.


Adding a Researcher

  1. Navigate to /admin in the app.
  2. Use Invite Researcher — enter email and set a starting budget.
  3. The backend creates the user in Supabase Auth (auto-confirmed), provisions their profile, and automatically sends an invite email.
  4. The user clicks Log In with Google in their email to log in via Google SSO (or requests a magic link).

Important: To enforce invite-only with Google SSO, disable "Allow new users to sign up" in Supabase Auth settings.


Local Development

# Backend
cd api
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

# Frontend (separate terminal)
cd web
npm install
npm run dev

Requires a running Supabase instance (supabase start) and a filled-in api/.env.


Output Package

Each completed job produces a ZIP download containing:

File Contents
transcriptions/ Individual .txt (or .csv in Table mode) file per image
{label}_collated.txt All transcriptions concatenated in filename order
{label}_inventory.csv Per-image metadata: confidence, flags, cost, date
{label}_metrics.jsonl Raw API latency and token usage
{label}_table_output.csv Structured CSV (table/both output mode only)
{label}_settings.json Job configuration snapshot (model, mode, prompt, totals)

Running Costs

Service Cost
Vercel (Pro) ~$20/month (existing account)
Render (Starter) $7/month — always-on, SMTP unblocked
Supabase (free tier) $0
Gmail SMTP $0
Infrastructure total ~$7–27/month
Gemini Flash family (per image) ~$0.02–0.05 (pay-as-you-go)
OpenAI / frontier models (per image) higher; varies by model and page density

Design Decisions

Decision Rationale
BackgroundTasks over Redis/Celery Zero broker overhead for ≤ 20 users
Supabase over raw Postgres Auth + Storage + RLS in one managed service
Gmail SMTP over Brevo/SendGrid Harvard Google Workspace SMTP avoids spam filters
Page-by-page PDF streaming Yields pages sequentially to keep RAM footprint below 512MB on Render
Filename sort over filesystem order Inode/creation-time ordering is non-deterministic
Dynamic model fetch over hardcoded list Models appear/disappear as keys are added in Render

Scribble — Harvard Library Digital Scholarship Program Matthew Cook · matt_cook@harvard.edu · library.harvard.edu/how-to/digital-scholarship-program

About

Archival document transcription tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors