BRCA1 model and gemini-powered overview
gemini-powered research assistant along with mongodb-powered protein storage

MolecularAI — Structural Analysis Workbench

Inspiration

My background is in chemistry. I spent time as a chemistry major involved in drug discovery research, where I saw firsthand how fragmented the tools were. Researchers would jump between the RCSB Protein Data Bank to find structures, separate literature databases to understand function, and specialized desktop software just to visualize what they were looking at. None of it talked to each other, and none of it was fast.

The core problem in early-stage drug discovery is understanding the relationship between a protein's structure and its function. This is governed by the fundamental principle:

$$\text{Structure} \rightarrow \text{Function} \rightarrow \text{Disease Mechanism} \rightarrow \text{Drug Target}$$

When I started this hackathon, I kept coming back to that frustration from my research days. What if you could type a protein name and immediately get the structure, the science, and the ability to ask questions, all in one place? That's where MolecularAI came from.

What It Does

MolecularAI is a structural analysis workbench for proteins and molecules. You type a gene name, protein name, PDB ID, amino acid sequence, or SMILES string into a single search bar and the app:

Resolves it to the correct canonical structure
Renders it in interactive 3D
Generates a full AI-powered scientific briefing covering function, disease relevance, structural highlights, drug interactions, and key binding sites
Enables natural language follow-up via a context-aware Research Assistant
Saves analyzed proteins to a persistent Molecule Library backed by MongoDB Atlas

How We Built It

Frontend

React + Vite + Tailwind CSS for the UI
3Dmol.js for interactive 3D molecular rendering with Cartoon, Surface, and Stick display modes

Backend

Python + FastAPI hosted on Vultr
RDKit for cheminformatics descriptor computation from SMILES strings
BioPython for sequence handling and PDB parsing

Protein Resolution Pipeline

Gene names and protein names are resolved through a multi-step lookup:

$$\text{Query} \xrightarrow{\text{UniProt}} \text{Accession} \xrightarrow{\text{RCSB}} \text{PDB ID} \xrightarrow{\text{3Dmol.js}} \text{3D Structure}$$

Raw amino acid sequences bypass this pipeline and go directly to ESMFold for ab initio structure prediction.

AI Layer

Google Gemini 1.5 Pro powers both the structural analysis summaries and the multi-turn Research Assistant chat
Full conversation history is maintained on every request with protein context injected as system context
Embeddings use Gemini text-embedding-004 (768 dimensions)

Database

MongoDB Atlas with Motor async driver stores saved molecules
Search suggestions are cached with a 24-hour TTL to reduce UniProt API calls

Molecular Descriptors

For small molecules, RDKit computes key drug-likeness properties evaluated against Lipinski's Rule of Five:

$$\text{MW} \leq 500 \quad \log P \leq 5 \quad \text{HBD} \leq 5 \quad \text{HBA} \leq 10$$

where MW is molecular weight, $\log P$ is the octanol-water partition coefficient, HBD is hydrogen bond donors, and HBA is hydrogen bond acceptors.

Challenges We Ran Into

Protein Lookup Accuracy

Early versions used fuzzy text search against PDB titles, which returned completely wrong results. Searching BRCA1 would return a paper about a protein called "Next to BRCA1 gene 1" because the name appeared in the title metadata. Switching to UniProt canonical resolution with RCSB cross-referencing fixed accuracy entirely.

MongoDB Atlas TLS Failures

We ran into SSL handshake failures across all three shard nodes:

[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error

This turned out to be caused by special characters in the connection string password being misinterpreted during URI encoding, not a code-level SSL issue. The fix was setting a clean alphanumeric password and using explicit certifi CA configuration:

client = AsyncIOMotorClient(
    uri,
    tls=True,
    tlsCAFile=certifi.where(),
    serverSelectionTimeoutMS=30000
)

Gemini Response Truncation

Getting complete, untruncated summaries from Gemini required careful attention to where generation_config is applied. Setting max_output_tokens on the model instantiation rather than the generate_content() call caused it to be silently ignored. The fix:

response = model.generate_content(
    prompt,
    generation_config=genai.types.GenerationConfig(
        max_output_tokens=4096,
        temperature=0.3
    )
)

Accessing the response via response.candidates[0].content.parts[0].text instead of response.text also proved more reliable for extracting the full output.

Stale Closure Bug in Chat History

Managing multi-turn conversation state in React without stale closures required a ref-based approach:

const messagesRef = useRef(messages);
useEffect(() => {
    messagesRef.current = messages;
}, [messages]);

Reading from messagesRef.current inside async handlers ensures the full conversation history is always sent to the backend, not a snapshot from when the component last rendered.

Accomplishments We're Proud Of

Getting the full pipeline working end to end, from a plain gene name like TP53 to a rendered 3D structure with a complete Gemini analysis, in a single search
Protein lookup accuracy: UniProt canonical resolution returns the scientifically correct structure every time
A Research Assistant that genuinely holds context across a multi-turn conversation
A persistent Molecule Library that survives sessions and reloads a full analysis in one click

What We Learned

Bioinformatics APIs are powerful but unforgiving. UniProt, RCSB, and ESMFold each have their own data shapes, rate limits, and failure modes
Gemini's generation_config must be passed at the call level, not the model instantiation level
MongoDB Atlas TLS issues are almost always environmental rather than code-level
Scope aggressively for a hackathon. Every feature that made the demo compelling required cutting two features that sounded good on paper

What's Next for MolecularAI

Binding pocket detection — automatically highlight druggable sites on the 3D structure using fpocket
Molecular docking — upload a candidate ligand, run AutoDock Vina on the Vultr GPU backend, visualize ranked docking poses scored by binding affinity $\Delta G$
Mutation impact prediction — model point mutations and predict stability changes $\Delta\Delta G$ before running wet lab experiments
Multi-protein comparison — load two structures side by side and ask Gemini to compare binding sites
Voice narration — guided audio walkthrough of the structural analysis for accessibility

The long-term vision is a platform where a researcher can go from a gene name to a shortlist of candidate drug interactions, computationally, in minutes, without specialized bioinformatics training.

Built With

fastapi
gemini-api
gemini-text-embedding-004
mongodb
python
react
tailwind
typescript
vite

Updates

Israel Mazon started this project — Mar 29, 2026 03:32 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.