SchemaForge

ERD

Inspiration

The Database of AI Litigation (DAIL) is a critical public-interest resource tracking legal disputes involving artificial intelligence technologies. However, its backend relied on a low-code system that limited structured querying, scalability, and programmatic access. As AI-related litigation continues to grow in both volume and complexity, the underlying infrastructure risked becoming a bottleneck for research and long-term sustainability.

We were inspired to redesign the backend architecture to support structured legal research, enforce data integrity, and provide scalable API access for future expansion.

What it does

SchemaForge modernizes the DAIL backend by:

Redesigning the database using PostgreSQL
Applying normalization up to Third Normal Form (3NF)
Enforcing primary keys, foreign keys, and referential integrity
Extracting multi-value fields into structured reference tables
Building a clean ETL pipeline for migration and transformation
Deploying a REST API using FastAPI for programmatic access

The system transforms unstructured, loosely connected data into a fully relational, queryable, and scalable backend suitable for research-grade analysis.

How we built it

We began by analyzing the legacy dataset and identifying structural limitations such as missing primary keys, comma-separated multi-value fields, and implied relationships.

We redesigned the schema in PostgreSQL with:

Surrogate primary keys (e.g., case_id)
One-to-many relationships (Cases → Dockets → Documents)
Many-to-many bridge tables (Cases ↔ Issues, Causes, Algorithms, Organizations)
Unique constraints and foreign key enforcement

We then built a Python ETL pipeline to:

Load Excel sheets
Clean missing and inconsistent values
Normalize multi-value columns
Insert base entities first
Insert bridge relationships
Enforce constraints during migration

The database was deployed on Supabase, and a REST API was built using FastAPI and SQLAlchemy. The API automatically generates OpenAPI documentation and supports structured retrieval across entities.

The API was deployed using Render to provide a live, testable backend.

Challenges we ran into

The legacy data had no enforced primary keys.
Multi-value fields required careful parsing and normalization.
Missing and inconsistent values required defensive ETL logic.
Mapping documents through dockets required hierarchical modeling.
Cloud deployment required handling environment variables and secure credentials.
Ensuring no duplicate records while preserving data completeness required careful constraint design.

Accomplishments that we're proud of

Achieving full normalization up to Third Normal Form.
Enforcing referential integrity at the database level.
Eliminating duplicate and inconsistent classifications.
Successfully migrating legacy Excel data into a structured PostgreSQL schema.
Deploying a scalable API layer with automatic documentation.
Delivering a production-ready cloud deployment on Supabase and Render. We transformed a low-code backend into a relational architecture designed for long-term sustainability.

What we learned

Data modeling decisions directly impact research usability.
Normalization significantly improves consistency and scalability.
Enforcing integrity at the database layer is more reliable than relying on application logic.
ETL design is as important as schema design.
Cloud deployment introduces operational considerations beyond local development.
Building structured APIs forces clarity in schema design.