The Story of Health Lab Report Analyzer (HLRA)

🎯 What Inspired Us

Healthcare is deeply personal, yet managing health records remains surprisingly fragmented. Most people store their lab reports in email, old folders, or worse—lose them entirely. When they need to share results with a new doctor or track trends over time, there's no easy way to organize, extract, or visualize this critical information.

The Problem: Lab reports come in PDFs and scanned images with no standardized format. Extracting key numbers (blood glucose, cholesterol, hemoglobin) is manual and error-prone. Family health histories are even harder to maintain—especially when elderly relatives need their records consolidated.

The Inspiration: What if we could build a tool that:

📄 Instantly reads health reports using OCR (no manual typing)
👥 Manages health data for the whole family
📊 Visualizes trends over time
🔗 Allows secure sharing with doctors or family members
🔐 Keeps sensitive health data private and secure

We wanted to create something that healthcare professionals recommend to patients and that families actually use—not just software that feels like a chore.

📚 What We Learned

1. OCR is Harder Than It Looks

The Learning: Building OCR is a balance between accuracy and performance.

We initially tried a simple approach: load PDF → extract text → done. It failed spectacularly on scanned documents and handwritten notes. We learned:

Raw PDFs ≠ scanned images: PDFs with embedded text extract fine; scanned PDFs are images masquerading as PDFs
Tesseract needs tuning: Different medical forms need different preprocessing—contrast, scaling, deskewing
Image preprocessing matters: We spent hours tuning PIL filters, DPI settings, and character whitelists

Solution: We built a multi-step pipeline:

Try pdfplumber (best for structured PDFs)
Fall back to PyPDF2 for text-based PDFs
Last resort: Convert to images and OCR with Tesseract
Preprocess images (grayscale, resize, enhance contrast/sharpness)

This layered approach works ~95% of the time on real medical documents.

2. Async Python is a Game-Changer

The Learning: FastAPI + Motor + async/await makes a huge difference in throughput.

When we first wrote the backend, we used synchronous database calls. For a single user? Fine. For multiple users uploading files simultaneously? Slow. We switched to:

Motor: Async MongoDB driver (non-blocking database calls)
Async services: Made all business logic async
Concurrent file uploads: The server can now handle 10+ simultaneous uploads without blocking

Result: With 5 concurrent users, response times improved from 2–3s per upload to 200–500ms. One of the biggest wins of the project.

3. TypeScript + Zod Saves Bugs

The Learning: Full end-to-end type safety is worth the setup cost.

Early on, we had runtime errors because frontend sent a string where the backend expected a number. With TypeScript on frontend + Pydantic on backend + Zod validation on both, these errors disappeared:

Mismatched field names caught at compile time
Type mismatches caught instantly
API contract violations caught before reaching production

🛠️ How We Built It

Technology Choices

Backend: Why FastAPI?

Modern, fast ASGI framework with automatic API docs
Native async/await support
Pydantic for validation (type-safe, intuitive)
Active ecosystem and great docs
Produces beautiful Swagger UI automatically

Frontend: Why React + Vite?

React's component model is intuitive for dashboard-like apps
Vite is blazingly fast (dev server, build, HMR)
TypeScript for safety
Rich ecosystem (TanStack Query, Radix UI, Tailwind)

Database: Why MongoDB?

Flexible schema (health data varies widely by person)
Excellent for rapid iteration (no migrations)
Atlas cloud offering removes DevOps burden
Motor driver gives async support

Deployment: Render + Vercel?

Render: Simple Docker deployment, free tier is generous
Vercel: Optimized for React/TypeScript, instant deploys, excellent CDN
Together: Zero DevOps knowledge needed; push to GitHub → auto-deployed

Architecture

┌─────────────────────────────────────────────────────────┐
│                      Frontend (React)                     │
│  Vite, TypeScript, Tailwind, Radix UI, TanStack Query    │
│                    Hosted on Vercel                       │
└────────────────────┬────────────────────────────────────┘
                     │ HTTPS
                     ↓
┌─────────────────────────────────────────────────────────┐
│                   FastAPI Backend                         │
│     uvicorn, async Motor, OCR pipeline, JWT auth        │
│                  Hosted on Render                        │
└──────────────────┬──────────────┬──────────────────────┘
                   │              │
                   ↓              ↓
         ┌──────────────┐  ┌──────────────────┐
         │   MongoDB    │  │   File Upload    │
         │   Atlas      │  │   (local FS)     │
         └──────────────┘  └──────────────────┘

Development Phases

Phase 1: Core Auth & Upload (Weeks 1–3)

User registration/login with JWT
File upload endpoint
Basic file storage

Phase 2: OCR & Extraction (Weeks 4–6)

Tesseract integration
Multi-step PDF/image processing
Text extraction and parsing

Phase 3: Data Management (Weeks 7–9)

Family profile management
Health data models (glucose, cholesterol, etc.)
Sorting, filtering, pagination

Phase 4: Visualization & Sharing (Weeks 10–12)

Charts with Recharts
Report sharing with secure tokens
Notifications system

Phase 5: Polish & Deployment (Weeks 13–16)

Testing (unit + integration)
Performance optimization
Deployment to Render + Vercel
Documentation

🚧 Challenges We Faced

Challenge 1: Tesseract Installation Across Platforms

The Problem: Tesseract is a C++ binary, not a Python package. Installing it on Windows, macOS, and Linux requires different steps. CI/CD pipelines also needed it.

How We Solved It:

Created platform-specific installation scripts in SETUP.md
GitHub Actions workflow installs Tesseract + poppler-utils automatically
Docker image includes Tesseract for production consistency
Added helpful troubleshooting for common errors

Lesson: External dependencies require thorough documentation and testing on all platforms.

Challenge 2: Async/Await Complexity in Tests

The Problem: Writing tests for async code is tricky. You can't just use assert; you need pytest-asyncio with @pytest.mark.asyncio. Mock objects behave differently. Async context managers are confusing.

How We Solved It:

Created a conftest.py with shared fixtures for async tests
Built helper functions for mocking async database calls
Documented patterns in TESTING.md
Used fixtures scoped to session and function appropriately

Lesson: Async Python is powerful but requires discipline; good test infrastructure saves hours.

Challenge 3: CORS and Authentication in Fullstack Apps

The Problem: Frontend and backend are on different domains (Vercel and Render). CORS headers, token refresh, and cookie handling were initially broken. We had:

401 errors on token refresh
CORS blocks on every request
Stale tokens not refreshing automatically

How We Solved It:

Configured CORS properly in FastAPI: allowed origins from .env
Implemented JWT refresh token flow in AuthContext
Stored tokens in localStorage (with HTTPS in production)
Intercepted 401 responses to auto-refresh tokens

Lesson: Fullstack security is nuanced; test with actual deployed URLs, not just localhost.

Challenge 4: File Upload Management

The Problem: Users upload large PDFs (5–10 MB). Where do you store them?

Option 1: Local filesystem: Works on Render, but ephemeral (files lost on redeploy)
Option 2: S3 or cloud storage: More work, more cost
Option 3: Temporary storage + extract text only: Lose the original file

We chose temporary local storage during processing, then delete after text extraction. For persistence, we store extracted text + metadata in MongoDB.

How We Solved It:

Created uploads/ folder
Set MAX_FILE_SIZE to 10 MB in config
Cleanup temporary files after OCR
Store original metadata in database (filename, upload date, file hash)

Lesson: For MVP apps, simple solutions (local storage + cleanup) beat premature cloud architecture.

Challenge 5: Data Validation Across Frontend & Backend

The Problem: Frontend sends JSON; backend expects specific types and formats. Mismatches caused:

Type errors on field access
Validation errors on create operations
Silent failures where invalid data slipped through

How We Solved It:

Pydantic models for all backend schemas
Zod schemas for all frontend forms
API contracts documented in Swagger
Shared TypeScript types generated from Pydantic

Lesson: Type-driven development eliminates entire categories of bugs.

Challenge 6: Real-Time Notifications

The Problem: When a report is shared with another user, they should see a notification in real-time. Polling is wasteful; WebSockets are complex.

How We Solved It:

For MVP: Store notifications in MongoDB, frontend polls every 30 seconds
Database query is indexed and fast (< 50ms)
Refresh indicators show new notifications
Notifications auto-expire after 30 days

Lesson: Sometimes a "dumb" solution (polling) beats premature optimization (WebSockets). Revisit if polling becomes a bottleneck.

Challenge 7: Testing OCR Quality

The Problem: How do you test that Tesseract correctly reads a PDF? You can't use unit tests alone.

How We Solved It:

Created integration tests with real sample documents
Stored "golden" outputs (expected text) for test PDFs
Compare extracted text similarity (allow ~5% error tolerance)
Included sample health reports in test suite

Lesson: OCR is inherently non-deterministic; integration tests with real examples are essential.

Challenge 8: Deployment Configuration

The Problem: Environment variables, secrets, database URLs, CORS origins—all differ between dev, staging, and production. One typo in DATABASE_URL breaks everything.

How We Solved It:

Created .env.example with all required variables
Used pydantic-settings to validate env vars at startup
Deployment guides for Render + Vercel with copy-paste examples
GitHub Actions secrets for CI/CD

Lesson: Environment configuration is critical; automate validation early.

🎓 Key Lessons Learned

Start simple, optimize later: We spent weeks on OCR tuning that we didn't need initially. Good enough now beats perfect later.
Type safety pays dividends: TypeScript + Pydantic eliminated ~30% of bugs before they reached testing.
Async is fast: Motor + FastAPI async gives us 10x better concurrency than sync code, with no extra complexity.
Documentation saves time: Clear setup, testing, and deployment docs reduced onboarding time from days to hours.
Test infrastructure matters: Good fixtures and helpers make writing tests fun, not tedious.
Real deployment early: Testing locally is not enough. Deploy to Render + Vercel early and often.
Prefer boring technology: FastAPI, React, MongoDB, Vercel—all are mature, well-documented, and widely adopted. This meant faster development.
Full-stack type safety is possible: End-to-end types (frontend → API → backend) eliminate entire classes of bugs.

🚀 Future Roadmap

Short Term (Next 3 Months)

[ ] WebSocket support for real-time notifications
[ ] Advanced parsing for specific health metrics (auto-detect normal/abnormal ranges)
[ ] Export reports to PDF / print-friendly format
[ ] Email notifications for shared reports

Medium Term (Next 6 Months)

[ ] Integration with EHR systems (FHIR standard)
[ ] AI-powered insights (trend detection, anomaly alerts)
[ ] Mobile app (React Native)
[ ] Multi-language support

Long Term (Next Year+)

[ ] Wearable integration (Apple Health, Google Fit)
[ ] Doctor collaboration tools (secure messaging)
[ ] Clinical decision support (AI-powered recommendations)
[ ] Research dataset sharing (anonymized, with consent)

🙏 Acknowledgments

HLRA stands on the shoulders of giants:

Tesseract: 30+ years of OCR excellence
FastAPI: Modern Python web framework
React: UI library that just works
MongoDB: Flexible NoSQL database
Vercel & Render: Hosting made simple

Special thanks to the open-source community for countless packages, tutorials, and StackOverflow answers that made this project possible.

📝 Final Thoughts

Building HLRA taught us that health tech doesn't need to be complicated. A well-designed tool—even if it starts simple—can genuinely improve people's lives.

We built this because we wanted something we would use ourselves. We hope it helps you (and your family) manage health with less friction and more confidence.

If you find HLRA useful, please star the repo, contribute, or share it with someone who might benefit.

Built With

axios
bcrypt
css
fastapi
html
httpx
javascript
mongodb
passlib[bcrypt]
pdf2image
pdfplumber
pillow
poppler-utils
pydantic
pypdf2
pytest
pytest-asyncio
python
python-jose[cryptography]
python-multipart
react
recharts
sonner
tesseract
typescript
uvicorn
vite
zod

Updates

Kaustubh Nimbalkar started this project — Nov 16, 2025 11:07 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.