MolecularAI — Structural Analysis Workbench
Inspiration
My background is in chemistry. I spent time as a chemistry major involved in drug discovery research, where I saw firsthand how fragmented the tools were. Researchers would jump between the RCSB Protein Data Bank to find structures, separate literature databases to understand function, and specialized desktop software just to visualize what they were looking at. None of it talked to each other, and none of it was fast.
The core problem in early-stage drug discovery is understanding the relationship between a protein's structure and its function. This is governed by the fundamental principle:
$$\text{Structure} \rightarrow \text{Function} \rightarrow \text{Disease Mechanism} \rightarrow \text{Drug Target}$$
When I started this hackathon, I kept coming back to that frustration from my research days. What if you could type a protein name and immediately get the structure, the science, and the ability to ask questions, all in one place? That's where MolecularAI came from.
What It Does
MolecularAI is a structural analysis workbench for proteins and molecules. You type a gene name, protein name, PDB ID, amino acid sequence, or SMILES string into a single search bar and the app:
- Resolves it to the correct canonical structure
- Renders it in interactive 3D
- Generates a full AI-powered scientific briefing covering function, disease relevance, structural highlights, drug interactions, and key binding sites
- Enables natural language follow-up via a context-aware Research Assistant
- Saves analyzed proteins to a persistent Molecule Library backed by MongoDB Atlas
How We Built It
Frontend
- React + Vite + Tailwind CSS for the UI
- 3Dmol.js for interactive 3D molecular rendering with Cartoon, Surface, and Stick display modes
Backend
- Python + FastAPI hosted on Vultr
- RDKit for cheminformatics descriptor computation from SMILES strings
- BioPython for sequence handling and PDB parsing
Protein Resolution Pipeline
Gene names and protein names are resolved through a multi-step lookup:
$$\text{Query} \xrightarrow{\text{UniProt}} \text{Accession} \xrightarrow{\text{RCSB}} \text{PDB ID} \xrightarrow{\text{3Dmol.js}} \text{3D Structure}$$
Raw amino acid sequences bypass this pipeline and go directly to ESMFold for ab initio structure prediction.
AI Layer
- Google Gemini 1.5 Pro powers both the structural analysis summaries and the multi-turn Research Assistant chat
- Full conversation history is maintained on every request with protein context injected as system context
- Embeddings use Gemini text-embedding-004 (768 dimensions)
Database
- MongoDB Atlas with Motor async driver stores saved molecules
- Search suggestions are cached with a 24-hour TTL to reduce UniProt API calls
Molecular Descriptors
For small molecules, RDKit computes key drug-likeness properties evaluated against Lipinski's Rule of Five:
$$\text{MW} \leq 500 \quad \log P \leq 5 \quad \text{HBD} \leq 5 \quad \text{HBA} \leq 10$$
where MW is molecular weight, $\log P$ is the octanol-water partition coefficient, HBD is hydrogen bond donors, and HBA is hydrogen bond acceptors.
Challenges We Ran Into
Protein Lookup Accuracy
Early versions used fuzzy text search against PDB titles, which returned completely wrong results. Searching BRCA1 would return a paper about a protein called "Next to BRCA1 gene 1" because the name appeared in the title metadata. Switching to UniProt canonical resolution with RCSB cross-referencing fixed accuracy entirely.
MongoDB Atlas TLS Failures
We ran into SSL handshake failures across all three shard nodes:
[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error
This turned out to be caused by special characters in the connection string password being misinterpreted during URI encoding, not a code-level SSL issue. The fix was setting a clean alphanumeric password and using explicit certifi CA configuration:
client = AsyncIOMotorClient(
uri,
tls=True,
tlsCAFile=certifi.where(),
serverSelectionTimeoutMS=30000
)
Gemini Response Truncation
Getting complete, untruncated summaries from Gemini required careful attention to where generation_config is applied. Setting max_output_tokens on the model instantiation rather than the generate_content() call caused it to be silently ignored. The fix:
response = model.generate_content(
prompt,
generation_config=genai.types.GenerationConfig(
max_output_tokens=4096,
temperature=0.3
)
)
Accessing the response via response.candidates[0].content.parts[0].text instead of response.text also proved more reliable for extracting the full output.
Stale Closure Bug in Chat History
Managing multi-turn conversation state in React without stale closures required a ref-based approach:
const messagesRef = useRef(messages);
useEffect(() => {
messagesRef.current = messages;
}, [messages]);
Reading from messagesRef.current inside async handlers ensures the full conversation history is always sent to the backend, not a snapshot from when the component last rendered.
Accomplishments We're Proud Of
- Getting the full pipeline working end to end, from a plain gene name like
TP53to a rendered 3D structure with a complete Gemini analysis, in a single search - Protein lookup accuracy: UniProt canonical resolution returns the scientifically correct structure every time
- A Research Assistant that genuinely holds context across a multi-turn conversation
- A persistent Molecule Library that survives sessions and reloads a full analysis in one click
What We Learned
- Bioinformatics APIs are powerful but unforgiving. UniProt, RCSB, and ESMFold each have their own data shapes, rate limits, and failure modes
- Gemini's
generation_configmust be passed at the call level, not the model instantiation level - MongoDB Atlas TLS issues are almost always environmental rather than code-level
- Scope aggressively for a hackathon. Every feature that made the demo compelling required cutting two features that sounded good on paper
What's Next for MolecularAI
- Binding pocket detection — automatically highlight druggable sites on the 3D structure using fpocket
- Molecular docking — upload a candidate ligand, run AutoDock Vina on the Vultr GPU backend, visualize ranked docking poses scored by binding affinity $\Delta G$
- Mutation impact prediction — model point mutations and predict stability changes $\Delta\Delta G$ before running wet lab experiments
- Multi-protein comparison — load two structures side by side and ask Gemini to compare binding sites
- Voice narration — guided audio walkthrough of the structural analysis for accessibility
The long-term vision is a platform where a researcher can go from a gene name to a shortlist of candidate drug interactions, computationally, in minutes, without specialized bioinformatics training.
Built With
- fastapi
- gemini-api
- gemini-text-embedding-004
- mongodb
- python
- react
- tailwind
- typescript
- vite
Log in or sign up for Devpost to join the conversation.