Inspiration
Every university student knows the feeling — it's 2am before an exam, you need to find where your professor explained a specific concept, and it's buried somewhere across 47 PDFs from four different courses. You open one, Ctrl+F, nothing. Open another. Repeat. Google solved this problem for the entire internet in 1998 but your local course files are still stuck in the dark ages. This particular frustration was the spark. We wanted to build the Google Search of your university folder — something that treats your lecture slides, textbook chapters, and past papers as a single searchable knowledge base.
What it does
Cross-Course Search is a fully local PDF search engine built for students. You drop all your course PDFs into one folder, run the app, and get an instant search bar that queries every document simultaneously. Results appear in under 10ms, ranked by relevance, grouped by document, showing the exact page number and a highlighted snippet so you know what you're clicking before you open anything. It ships with three interfaces depending on how you like to work — a browser-based web UI, a native Windows desktop app, and a rich interactive terminal. Everything runs entirely on your own machine. No files are uploaded, no account required, no internet needed after setup bridging the digital divide in places with less access to internet.
How we built it
The core engine is a Python in-memory index built on startup. Every PDF in the folder gets parsed page by page using pdfplumber, with PyMuPDF available as a faster optional engine. Each page becomes a record storing the filename, title, page number, and a pre-lowercased copy of the text for fast comparison. The search function runs multi-term AND matching with frequency scoring — every word in your query must appear on the page, and pages that mention your terms more often rank higher. The whole thing runs in under 5ms for a typical student collection. The web interface is Flask with six REST endpoints and a single HTML frontend that debounces keystrokes and fires fetch requests for instant-as-you-type results. The terminal UI uses raw ANSI escape codes with colorama for Windows compatibility — no third-party terminal library needed. The desktop app is pure tkinter, which ships bundled with Python on Windows, so it installs with zero extra dependencies beyond the PDF libraries.
Challenges we ran into
The biggest challenge was the gap between "works on my machine" and "works for someone who just installed Python for the first time." Getting pip recognized in PowerShell, navigating to the right folder before running the server, understanding why the app returned a 404, fixing smart quote encoding errors that corrupted the source file during copy-paste — none of these are glamorous engineering problems but they are the real ones that determine whether a tool actually gets used. On the technical side, pdfplumber silently returns empty strings for certain PDF pages that visually contain content, which required building a three-engine fallback chain so the app degrades gracefully rather than missing pages without warning. Scanned PDFs — images masquerading as text documents — are a hard limitation we had to honestly document rather than pretend to solve.
Accomplishments that we're proud of
Getting the full round trip from keystroke to ranked results on screen down to under 200ms. Building three genuinely distinct interfaces — web, desktop, and terminal — each appropriate to a different kind of user, all powered by a single shared indexing engine with no duplicated logic. The terminal UI rendering a full colored animated interface natively in Windows PowerShell without any third-party library. The desktop app requiring no installation beyond the PDF libraries because tkinter is already there. Shipping a complete professional Word documentation file alongside the code — cover page, table of contents, page numbers, troubleshooting table — that addresses the exact real-world errors a first-time user will actually hit. And getting the whole thing from zero to a working pushed GitHub repo in a single session.
What we learned
The setup experience is the product. Feature quality is irrelevant if someone hits a confusing error in the first two minutes and closes the terminal. We learned that PDF text extraction is far less reliable than it appears — the same document can yield completely different text output across different parsers, and building in silent fallbacks is not optional. We learned that Git merge conflicts and PATH errors are not beginner problems to be embarrassed about — they are the real friction every developer hits when moving from local code to a shareable project. And we learned that shipping three interfaces instead of one forced cleaner architecture — when the same index has to serve a REST API, a tkinter canvas, and an ANSI terminal simultaneously, the separation of concerns has to be real.
What's next for Cross-Course Search
The highest-impact next feature is semantic search — replacing exact keyword matching with sentence embeddings so that searching "matrix transformation" also surfaces pages about "linear maps" and "change of basis" even if those exact words don't appear. After that, incremental indexing so adding a single new PDF doesn't require re-parsing the entire collection from scratch. Then a browser extension that overlays the search UI directly on top of Google Meet, Zoom, or your university's LMS during a live lecture. OCR support via Tesseract for scanned PDFs. And a PyInstaller-packaged .exe so the desktop app can be shared with classmates who have never touched Python — one double-click and it runs.
Built With
- ansi/colorama
- claude
- css3
- flask
- google-fonts
- html5
- javascript
- pdfplumber
- pymupdf
- pypdf
- python
- reportlab
- tkinter
Log in or sign up for Devpost to join the conversation.