The Story of Health Lab Report Analyzer (HLRA)

🎯 What Inspired Us

Healthcare is deeply personal, yet managing health records remains surprisingly fragmented. Most people store their lab reports in email, old folders, or worseβ€”lose them entirely. When they need to share results with a new doctor or track trends over time, there's no easy way to organize, extract, or visualize this critical information.

The Problem: Lab reports come in PDFs and scanned images with no standardized format. Extracting key numbers (blood glucose, cholesterol, hemoglobin) is manual and error-prone. Family health histories are even harder to maintainβ€”especially when elderly relatives need their records consolidated.

The Inspiration: What if we could build a tool that:

  • πŸ“„ Instantly reads health reports using OCR (no manual typing)
  • πŸ‘₯ Manages health data for the whole family
  • πŸ“Š Visualizes trends over time
  • πŸ”— Allows secure sharing with doctors or family members
  • πŸ” Keeps sensitive health data private and secure

We wanted to create something that healthcare professionals recommend to patients and that families actually useβ€”not just software that feels like a chore.


πŸ“š What We Learned

1. OCR is Harder Than It Looks

The Learning: Building OCR is a balance between accuracy and performance.

We initially tried a simple approach: load PDF β†’ extract text β†’ done. It failed spectacularly on scanned documents and handwritten notes. We learned:

  • Raw PDFs β‰  scanned images: PDFs with embedded text extract fine; scanned PDFs are images masquerading as PDFs
  • Tesseract needs tuning: Different medical forms need different preprocessingβ€”contrast, scaling, deskewing
  • Image preprocessing matters: We spent hours tuning PIL filters, DPI settings, and character whitelists

Solution: We built a multi-step pipeline:

  1. Try pdfplumber (best for structured PDFs)
  2. Fall back to PyPDF2 for text-based PDFs
  3. Last resort: Convert to images and OCR with Tesseract
  4. Preprocess images (grayscale, resize, enhance contrast/sharpness)

This layered approach works ~95% of the time on real medical documents.

2. Async Python is a Game-Changer

The Learning: FastAPI + Motor + async/await makes a huge difference in throughput.

When we first wrote the backend, we used synchronous database calls. For a single user? Fine. For multiple users uploading files simultaneously? Slow. We switched to:

  • Motor: Async MongoDB driver (non-blocking database calls)
  • Async services: Made all business logic async
  • Concurrent file uploads: The server can now handle 10+ simultaneous uploads without blocking

Result: With 5 concurrent users, response times improved from 2–3s per upload to 200–500ms. One of the biggest wins of the project.

3. TypeScript + Zod Saves Bugs

The Learning: Full end-to-end type safety is worth the setup cost.

Early on, we had runtime errors because frontend sent a string where the backend expected a number. With TypeScript on frontend + Pydantic on backend + Zod validation on both, these errors disappeared:

  • Mismatched field names caught at compile time
  • Type mismatches caught instantly
  • API contract violations caught before reaching production

πŸ› οΈ How We Built It

Technology Choices

Backend: Why FastAPI?

  • Modern, fast ASGI framework with automatic API docs
  • Native async/await support
  • Pydantic for validation (type-safe, intuitive)
  • Active ecosystem and great docs
  • Produces beautiful Swagger UI automatically

Frontend: Why React + Vite?

  • React's component model is intuitive for dashboard-like apps
  • Vite is blazingly fast (dev server, build, HMR)
  • TypeScript for safety
  • Rich ecosystem (TanStack Query, Radix UI, Tailwind)

Database: Why MongoDB?

  • Flexible schema (health data varies widely by person)
  • Excellent for rapid iteration (no migrations)
  • Atlas cloud offering removes DevOps burden
  • Motor driver gives async support

Deployment: Render + Vercel?

  • Render: Simple Docker deployment, free tier is generous
  • Vercel: Optimized for React/TypeScript, instant deploys, excellent CDN
  • Together: Zero DevOps knowledge needed; push to GitHub β†’ auto-deployed

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Frontend (React)                     β”‚
β”‚  Vite, TypeScript, Tailwind, Radix UI, TanStack Query    β”‚
β”‚                    Hosted on Vercel                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ HTTPS
                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   FastAPI Backend                         β”‚
β”‚     uvicorn, async Motor, OCR pipeline, JWT auth        β”‚
β”‚                  Hosted on Render                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚              β”‚
                   ↓              ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   MongoDB    β”‚  β”‚   File Upload    β”‚
         β”‚   Atlas      β”‚  β”‚   (local FS)     β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Development Phases

Phase 1: Core Auth & Upload (Weeks 1–3)

  • User registration/login with JWT
  • File upload endpoint
  • Basic file storage

Phase 2: OCR & Extraction (Weeks 4–6)

  • Tesseract integration
  • Multi-step PDF/image processing
  • Text extraction and parsing

Phase 3: Data Management (Weeks 7–9)

  • Family profile management
  • Health data models (glucose, cholesterol, etc.)
  • Sorting, filtering, pagination

Phase 4: Visualization & Sharing (Weeks 10–12)

  • Charts with Recharts
  • Report sharing with secure tokens
  • Notifications system

Phase 5: Polish & Deployment (Weeks 13–16)

  • Testing (unit + integration)
  • Performance optimization
  • Deployment to Render + Vercel
  • Documentation

🚧 Challenges We Faced

Challenge 1: Tesseract Installation Across Platforms

The Problem: Tesseract is a C++ binary, not a Python package. Installing it on Windows, macOS, and Linux requires different steps. CI/CD pipelines also needed it.

How We Solved It:

  • Created platform-specific installation scripts in SETUP.md
  • GitHub Actions workflow installs Tesseract + poppler-utils automatically
  • Docker image includes Tesseract for production consistency
  • Added helpful troubleshooting for common errors

Lesson: External dependencies require thorough documentation and testing on all platforms.


Challenge 2: Async/Await Complexity in Tests

The Problem: Writing tests for async code is tricky. You can't just use assert; you need pytest-asyncio with @pytest.mark.asyncio. Mock objects behave differently. Async context managers are confusing.

How We Solved It:

  • Created a conftest.py with shared fixtures for async tests
  • Built helper functions for mocking async database calls
  • Documented patterns in TESTING.md
  • Used fixtures scoped to session and function appropriately

Lesson: Async Python is powerful but requires discipline; good test infrastructure saves hours.


Challenge 3: CORS and Authentication in Fullstack Apps

The Problem: Frontend and backend are on different domains (Vercel and Render). CORS headers, token refresh, and cookie handling were initially broken. We had:

  • 401 errors on token refresh
  • CORS blocks on every request
  • Stale tokens not refreshing automatically

How We Solved It:

  • Configured CORS properly in FastAPI: allowed origins from .env
  • Implemented JWT refresh token flow in AuthContext
  • Stored tokens in localStorage (with HTTPS in production)
  • Intercepted 401 responses to auto-refresh tokens

Lesson: Fullstack security is nuanced; test with actual deployed URLs, not just localhost.


Challenge 4: File Upload Management

The Problem: Users upload large PDFs (5–10 MB). Where do you store them?

  • Option 1: Local filesystem: Works on Render, but ephemeral (files lost on redeploy)
  • Option 2: S3 or cloud storage: More work, more cost
  • Option 3: Temporary storage + extract text only: Lose the original file

We chose temporary local storage during processing, then delete after text extraction. For persistence, we store extracted text + metadata in MongoDB.

How We Solved It:

  • Created uploads/ folder
  • Set MAX_FILE_SIZE to 10 MB in config
  • Cleanup temporary files after OCR
  • Store original metadata in database (filename, upload date, file hash)

Lesson: For MVP apps, simple solutions (local storage + cleanup) beat premature cloud architecture.


Challenge 5: Data Validation Across Frontend & Backend

The Problem: Frontend sends JSON; backend expects specific types and formats. Mismatches caused:

  • Type errors on field access
  • Validation errors on create operations
  • Silent failures where invalid data slipped through

How We Solved It:

  • Pydantic models for all backend schemas
  • Zod schemas for all frontend forms
  • API contracts documented in Swagger
  • Shared TypeScript types generated from Pydantic

Lesson: Type-driven development eliminates entire categories of bugs.


Challenge 6: Real-Time Notifications

The Problem: When a report is shared with another user, they should see a notification in real-time. Polling is wasteful; WebSockets are complex.

How We Solved It:

  • For MVP: Store notifications in MongoDB, frontend polls every 30 seconds
  • Database query is indexed and fast (< 50ms)
  • Refresh indicators show new notifications
  • Notifications auto-expire after 30 days

Lesson: Sometimes a "dumb" solution (polling) beats premature optimization (WebSockets). Revisit if polling becomes a bottleneck.


Challenge 7: Testing OCR Quality

The Problem: How do you test that Tesseract correctly reads a PDF? You can't use unit tests alone.

How We Solved It:

  • Created integration tests with real sample documents
  • Stored "golden" outputs (expected text) for test PDFs
  • Compare extracted text similarity (allow ~5% error tolerance)
  • Included sample health reports in test suite

Lesson: OCR is inherently non-deterministic; integration tests with real examples are essential.


Challenge 8: Deployment Configuration

The Problem: Environment variables, secrets, database URLs, CORS originsβ€”all differ between dev, staging, and production. One typo in DATABASE_URL breaks everything.

How We Solved It:

  • Created .env.example with all required variables
  • Used pydantic-settings to validate env vars at startup
  • Deployment guides for Render + Vercel with copy-paste examples
  • GitHub Actions secrets for CI/CD

Lesson: Environment configuration is critical; automate validation early.


πŸŽ“ Key Lessons Learned

  1. Start simple, optimize later: We spent weeks on OCR tuning that we didn't need initially. Good enough now beats perfect later.

  2. Type safety pays dividends: TypeScript + Pydantic eliminated ~30% of bugs before they reached testing.

  3. Async is fast: Motor + FastAPI async gives us 10x better concurrency than sync code, with no extra complexity.

  4. Documentation saves time: Clear setup, testing, and deployment docs reduced onboarding time from days to hours.

  5. Test infrastructure matters: Good fixtures and helpers make writing tests fun, not tedious.

  6. Real deployment early: Testing locally is not enough. Deploy to Render + Vercel early and often.

  7. Prefer boring technology: FastAPI, React, MongoDB, Vercelβ€”all are mature, well-documented, and widely adopted. This meant faster development.

  8. Full-stack type safety is possible: End-to-end types (frontend β†’ API β†’ backend) eliminate entire classes of bugs.


πŸš€ Future Roadmap

Short Term (Next 3 Months)

  • [ ] WebSocket support for real-time notifications
  • [ ] Advanced parsing for specific health metrics (auto-detect normal/abnormal ranges)
  • [ ] Export reports to PDF / print-friendly format
  • [ ] Email notifications for shared reports

Medium Term (Next 6 Months)

  • [ ] Integration with EHR systems (FHIR standard)
  • [ ] AI-powered insights (trend detection, anomaly alerts)
  • [ ] Mobile app (React Native)
  • [ ] Multi-language support

Long Term (Next Year+)

  • [ ] Wearable integration (Apple Health, Google Fit)
  • [ ] Doctor collaboration tools (secure messaging)
  • [ ] Clinical decision support (AI-powered recommendations)
  • [ ] Research dataset sharing (anonymized, with consent)

πŸ™ Acknowledgments

HLRA stands on the shoulders of giants:

  • Tesseract: 30+ years of OCR excellence
  • FastAPI: Modern Python web framework
  • React: UI library that just works
  • MongoDB: Flexible NoSQL database
  • Vercel & Render: Hosting made simple

Special thanks to the open-source community for countless packages, tutorials, and StackOverflow answers that made this project possible.


πŸ“ Final Thoughts

Building HLRA taught us that health tech doesn't need to be complicated. A well-designed toolβ€”even if it starts simpleβ€”can genuinely improve people's lives.

We built this because we wanted something we would use ourselves. We hope it helps you (and your family) manage health with less friction and more confidence.

If you find HLRA useful, please star the repo, contribute, or share it with someone who might benefit.


Built With

Share this project:

Updates