The Story of Health Lab Report Analyzer (HLRA)
π― What Inspired Us
Healthcare is deeply personal, yet managing health records remains surprisingly fragmented. Most people store their lab reports in email, old folders, or worseβlose them entirely. When they need to share results with a new doctor or track trends over time, there's no easy way to organize, extract, or visualize this critical information.
The Problem: Lab reports come in PDFs and scanned images with no standardized format. Extracting key numbers (blood glucose, cholesterol, hemoglobin) is manual and error-prone. Family health histories are even harder to maintainβespecially when elderly relatives need their records consolidated.
The Inspiration: What if we could build a tool that:
- π Instantly reads health reports using OCR (no manual typing)
- π₯ Manages health data for the whole family
- π Visualizes trends over time
- π Allows secure sharing with doctors or family members
- π Keeps sensitive health data private and secure
We wanted to create something that healthcare professionals recommend to patients and that families actually useβnot just software that feels like a chore.
π What We Learned
1. OCR is Harder Than It Looks
The Learning: Building OCR is a balance between accuracy and performance.
We initially tried a simple approach: load PDF β extract text β done. It failed spectacularly on scanned documents and handwritten notes. We learned:
- Raw PDFs β scanned images: PDFs with embedded text extract fine; scanned PDFs are images masquerading as PDFs
- Tesseract needs tuning: Different medical forms need different preprocessingβcontrast, scaling, deskewing
- Image preprocessing matters: We spent hours tuning PIL filters, DPI settings, and character whitelists
Solution: We built a multi-step pipeline:
- Try
pdfplumber(best for structured PDFs) - Fall back to
PyPDF2for text-based PDFs - Last resort: Convert to images and OCR with Tesseract
- Preprocess images (grayscale, resize, enhance contrast/sharpness)
This layered approach works ~95% of the time on real medical documents.
2. Async Python is a Game-Changer
The Learning: FastAPI + Motor + async/await makes a huge difference in throughput.
When we first wrote the backend, we used synchronous database calls. For a single user? Fine. For multiple users uploading files simultaneously? Slow. We switched to:
- Motor: Async MongoDB driver (non-blocking database calls)
- Async services: Made all business logic
async - Concurrent file uploads: The server can now handle 10+ simultaneous uploads without blocking
Result: With 5 concurrent users, response times improved from 2β3s per upload to 200β500ms. One of the biggest wins of the project.
3. TypeScript + Zod Saves Bugs
The Learning: Full end-to-end type safety is worth the setup cost.
Early on, we had runtime errors because frontend sent a string where the backend expected a number. With TypeScript on frontend + Pydantic on backend + Zod validation on both, these errors disappeared:
- Mismatched field names caught at compile time
- Type mismatches caught instantly
- API contract violations caught before reaching production
π οΈ How We Built It
Technology Choices
Backend: Why FastAPI?
- Modern, fast ASGI framework with automatic API docs
- Native async/await support
- Pydantic for validation (type-safe, intuitive)
- Active ecosystem and great docs
- Produces beautiful Swagger UI automatically
Frontend: Why React + Vite?
- React's component model is intuitive for dashboard-like apps
- Vite is blazingly fast (dev server, build, HMR)
- TypeScript for safety
- Rich ecosystem (TanStack Query, Radix UI, Tailwind)
Database: Why MongoDB?
- Flexible schema (health data varies widely by person)
- Excellent for rapid iteration (no migrations)
- Atlas cloud offering removes DevOps burden
- Motor driver gives async support
Deployment: Render + Vercel?
- Render: Simple Docker deployment, free tier is generous
- Vercel: Optimized for React/TypeScript, instant deploys, excellent CDN
- Together: Zero DevOps knowledge needed; push to GitHub β auto-deployed
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React) β
β Vite, TypeScript, Tailwind, Radix UI, TanStack Query β
β Hosted on Vercel β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β HTTPS
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β uvicorn, async Motor, OCR pipeline, JWT auth β
β Hosted on Render β
ββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ
β β
β β
ββββββββββββββββ ββββββββββββββββββββ
β MongoDB β β File Upload β
β Atlas β β (local FS) β
ββββββββββββββββ ββββββββββββββββββββ
Development Phases
Phase 1: Core Auth & Upload (Weeks 1β3)
- User registration/login with JWT
- File upload endpoint
- Basic file storage
Phase 2: OCR & Extraction (Weeks 4β6)
- Tesseract integration
- Multi-step PDF/image processing
- Text extraction and parsing
Phase 3: Data Management (Weeks 7β9)
- Family profile management
- Health data models (glucose, cholesterol, etc.)
- Sorting, filtering, pagination
Phase 4: Visualization & Sharing (Weeks 10β12)
- Charts with Recharts
- Report sharing with secure tokens
- Notifications system
Phase 5: Polish & Deployment (Weeks 13β16)
- Testing (unit + integration)
- Performance optimization
- Deployment to Render + Vercel
- Documentation
π§ Challenges We Faced
Challenge 1: Tesseract Installation Across Platforms
The Problem: Tesseract is a C++ binary, not a Python package. Installing it on Windows, macOS, and Linux requires different steps. CI/CD pipelines also needed it.
How We Solved It:
- Created platform-specific installation scripts in
SETUP.md - GitHub Actions workflow installs Tesseract + poppler-utils automatically
- Docker image includes Tesseract for production consistency
- Added helpful troubleshooting for common errors
Lesson: External dependencies require thorough documentation and testing on all platforms.
Challenge 2: Async/Await Complexity in Tests
The Problem: Writing tests for async code is tricky. You can't just use assert; you need pytest-asyncio with @pytest.mark.asyncio. Mock objects behave differently. Async context managers are confusing.
How We Solved It:
- Created a
conftest.pywith shared fixtures for async tests - Built helper functions for mocking async database calls
- Documented patterns in
TESTING.md - Used fixtures scoped to
sessionandfunctionappropriately
Lesson: Async Python is powerful but requires discipline; good test infrastructure saves hours.
Challenge 3: CORS and Authentication in Fullstack Apps
The Problem: Frontend and backend are on different domains (Vercel and Render). CORS headers, token refresh, and cookie handling were initially broken. We had:
- 401 errors on token refresh
- CORS blocks on every request
- Stale tokens not refreshing automatically
How We Solved It:
- Configured CORS properly in FastAPI: allowed origins from
.env - Implemented JWT refresh token flow in
AuthContext - Stored tokens in
localStorage(with HTTPS in production) - Intercepted 401 responses to auto-refresh tokens
Lesson: Fullstack security is nuanced; test with actual deployed URLs, not just localhost.
Challenge 4: File Upload Management
The Problem: Users upload large PDFs (5β10 MB). Where do you store them?
- Option 1: Local filesystem: Works on Render, but ephemeral (files lost on redeploy)
- Option 2: S3 or cloud storage: More work, more cost
- Option 3: Temporary storage + extract text only: Lose the original file
We chose temporary local storage during processing, then delete after text extraction. For persistence, we store extracted text + metadata in MongoDB.
How We Solved It:
- Created
uploads/folder - Set
MAX_FILE_SIZEto 10 MB in config - Cleanup temporary files after OCR
- Store original metadata in database (filename, upload date, file hash)
Lesson: For MVP apps, simple solutions (local storage + cleanup) beat premature cloud architecture.
Challenge 5: Data Validation Across Frontend & Backend
The Problem: Frontend sends JSON; backend expects specific types and formats. Mismatches caused:
- Type errors on field access
- Validation errors on create operations
- Silent failures where invalid data slipped through
How We Solved It:
- Pydantic models for all backend schemas
- Zod schemas for all frontend forms
- API contracts documented in Swagger
- Shared TypeScript types generated from Pydantic
Lesson: Type-driven development eliminates entire categories of bugs.
Challenge 6: Real-Time Notifications
The Problem: When a report is shared with another user, they should see a notification in real-time. Polling is wasteful; WebSockets are complex.
How We Solved It:
- For MVP: Store notifications in MongoDB, frontend polls every 30 seconds
- Database query is indexed and fast (< 50ms)
- Refresh indicators show new notifications
- Notifications auto-expire after 30 days
Lesson: Sometimes a "dumb" solution (polling) beats premature optimization (WebSockets). Revisit if polling becomes a bottleneck.
Challenge 7: Testing OCR Quality
The Problem: How do you test that Tesseract correctly reads a PDF? You can't use unit tests alone.
How We Solved It:
- Created integration tests with real sample documents
- Stored "golden" outputs (expected text) for test PDFs
- Compare extracted text similarity (allow ~5% error tolerance)
- Included sample health reports in test suite
Lesson: OCR is inherently non-deterministic; integration tests with real examples are essential.
Challenge 8: Deployment Configuration
The Problem: Environment variables, secrets, database URLs, CORS originsβall differ between dev, staging, and production. One typo in DATABASE_URL breaks everything.
How We Solved It:
- Created
.env.examplewith all required variables - Used
pydantic-settingsto validate env vars at startup - Deployment guides for Render + Vercel with copy-paste examples
- GitHub Actions secrets for CI/CD
Lesson: Environment configuration is critical; automate validation early.
π Key Lessons Learned
Start simple, optimize later: We spent weeks on OCR tuning that we didn't need initially. Good enough now beats perfect later.
Type safety pays dividends: TypeScript + Pydantic eliminated ~30% of bugs before they reached testing.
Async is fast: Motor + FastAPI async gives us 10x better concurrency than sync code, with no extra complexity.
Documentation saves time: Clear setup, testing, and deployment docs reduced onboarding time from days to hours.
Test infrastructure matters: Good fixtures and helpers make writing tests fun, not tedious.
Real deployment early: Testing locally is not enough. Deploy to Render + Vercel early and often.
Prefer boring technology: FastAPI, React, MongoDB, Vercelβall are mature, well-documented, and widely adopted. This meant faster development.
Full-stack type safety is possible: End-to-end types (frontend β API β backend) eliminate entire classes of bugs.
π Future Roadmap
Short Term (Next 3 Months)
- [ ] WebSocket support for real-time notifications
- [ ] Advanced parsing for specific health metrics (auto-detect normal/abnormal ranges)
- [ ] Export reports to PDF / print-friendly format
- [ ] Email notifications for shared reports
Medium Term (Next 6 Months)
- [ ] Integration with EHR systems (FHIR standard)
- [ ] AI-powered insights (trend detection, anomaly alerts)
- [ ] Mobile app (React Native)
- [ ] Multi-language support
Long Term (Next Year+)
- [ ] Wearable integration (Apple Health, Google Fit)
- [ ] Doctor collaboration tools (secure messaging)
- [ ] Clinical decision support (AI-powered recommendations)
- [ ] Research dataset sharing (anonymized, with consent)
π Acknowledgments
HLRA stands on the shoulders of giants:
- Tesseract: 30+ years of OCR excellence
- FastAPI: Modern Python web framework
- React: UI library that just works
- MongoDB: Flexible NoSQL database
- Vercel & Render: Hosting made simple
Special thanks to the open-source community for countless packages, tutorials, and StackOverflow answers that made this project possible.
π Final Thoughts
Building HLRA taught us that health tech doesn't need to be complicated. A well-designed toolβeven if it starts simpleβcan genuinely improve people's lives.
We built this because we wanted something we would use ourselves. We hope it helps you (and your family) manage health with less friction and more confidence.
If you find HLRA useful, please star the repo, contribute, or share it with someone who might benefit.
Built With
- axios
- bcrypt
- css
- fastapi
- html
- httpx
- javascript
- mongodb
- passlib[bcrypt]
- pdf2image
- pdfplumber
- pillow
- poppler-utils
- pydantic
- pypdf2
- pytest
- pytest-asyncio
- python
- python-jose[cryptography]
- python-multipart
- react
- recharts
- sonner
- tesseract
- typescript
- uvicorn
- vite
- zod
Log in or sign up for Devpost to join the conversation.