MolecularAI — Structural Analysis Workbench

Inspiration

My background is in chemistry. I spent time as a chemistry major involved in drug discovery research, where I saw firsthand how fragmented the tools were. Researchers would jump between the RCSB Protein Data Bank to find structures, separate literature databases to understand function, and specialized desktop software just to visualize what they were looking at. None of it talked to each other, and none of it was fast.

The core problem in early-stage drug discovery is understanding the relationship between a protein's structure and its function. This is governed by the fundamental principle:

$$\text{Structure} \rightarrow \text{Function} \rightarrow \text{Disease Mechanism} \rightarrow \text{Drug Target}$$

When I started this hackathon, I kept coming back to that frustration from my research days. What if you could type a protein name and immediately get the structure, the science, and the ability to ask questions, all in one place? That's where MolecularAI came from.


What It Does

MolecularAI is a structural analysis workbench for proteins and molecules. You type a gene name, protein name, PDB ID, amino acid sequence, or SMILES string into a single search bar and the app:

  1. Resolves it to the correct canonical structure
  2. Renders it in interactive 3D
  3. Generates a full AI-powered scientific briefing covering function, disease relevance, structural highlights, drug interactions, and key binding sites
  4. Enables natural language follow-up via a context-aware Research Assistant
  5. Saves analyzed proteins to a persistent Molecule Library backed by MongoDB Atlas

How We Built It

Frontend

  • React + Vite + Tailwind CSS for the UI
  • 3Dmol.js for interactive 3D molecular rendering with Cartoon, Surface, and Stick display modes

Backend

  • Python + FastAPI hosted on Vultr
  • RDKit for cheminformatics descriptor computation from SMILES strings
  • BioPython for sequence handling and PDB parsing

Protein Resolution Pipeline

Gene names and protein names are resolved through a multi-step lookup:

$$\text{Query} \xrightarrow{\text{UniProt}} \text{Accession} \xrightarrow{\text{RCSB}} \text{PDB ID} \xrightarrow{\text{3Dmol.js}} \text{3D Structure}$$

Raw amino acid sequences bypass this pipeline and go directly to ESMFold for ab initio structure prediction.

AI Layer

  • Google Gemini 1.5 Pro powers both the structural analysis summaries and the multi-turn Research Assistant chat
  • Full conversation history is maintained on every request with protein context injected as system context
  • Embeddings use Gemini text-embedding-004 (768 dimensions)

Database

  • MongoDB Atlas with Motor async driver stores saved molecules
  • Search suggestions are cached with a 24-hour TTL to reduce UniProt API calls

Molecular Descriptors

For small molecules, RDKit computes key drug-likeness properties evaluated against Lipinski's Rule of Five:

$$\text{MW} \leq 500 \quad \log P \leq 5 \quad \text{HBD} \leq 5 \quad \text{HBA} \leq 10$$

where MW is molecular weight, $\log P$ is the octanol-water partition coefficient, HBD is hydrogen bond donors, and HBA is hydrogen bond acceptors.


Challenges We Ran Into

Protein Lookup Accuracy

Early versions used fuzzy text search against PDB titles, which returned completely wrong results. Searching BRCA1 would return a paper about a protein called "Next to BRCA1 gene 1" because the name appeared in the title metadata. Switching to UniProt canonical resolution with RCSB cross-referencing fixed accuracy entirely.

MongoDB Atlas TLS Failures

We ran into SSL handshake failures across all three shard nodes:

[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error

This turned out to be caused by special characters in the connection string password being misinterpreted during URI encoding, not a code-level SSL issue. The fix was setting a clean alphanumeric password and using explicit certifi CA configuration:

client = AsyncIOMotorClient(
    uri,
    tls=True,
    tlsCAFile=certifi.where(),
    serverSelectionTimeoutMS=30000
)

Gemini Response Truncation

Getting complete, untruncated summaries from Gemini required careful attention to where generation_config is applied. Setting max_output_tokens on the model instantiation rather than the generate_content() call caused it to be silently ignored. The fix:

response = model.generate_content(
    prompt,
    generation_config=genai.types.GenerationConfig(
        max_output_tokens=4096,
        temperature=0.3
    )
)

Accessing the response via response.candidates[0].content.parts[0].text instead of response.text also proved more reliable for extracting the full output.

Stale Closure Bug in Chat History

Managing multi-turn conversation state in React without stale closures required a ref-based approach:

const messagesRef = useRef(messages);
useEffect(() => {
    messagesRef.current = messages;
}, [messages]);

Reading from messagesRef.current inside async handlers ensures the full conversation history is always sent to the backend, not a snapshot from when the component last rendered.


Accomplishments We're Proud Of

  • Getting the full pipeline working end to end, from a plain gene name like TP53 to a rendered 3D structure with a complete Gemini analysis, in a single search
  • Protein lookup accuracy: UniProt canonical resolution returns the scientifically correct structure every time
  • A Research Assistant that genuinely holds context across a multi-turn conversation
  • A persistent Molecule Library that survives sessions and reloads a full analysis in one click

What We Learned

  • Bioinformatics APIs are powerful but unforgiving. UniProt, RCSB, and ESMFold each have their own data shapes, rate limits, and failure modes
  • Gemini's generation_config must be passed at the call level, not the model instantiation level
  • MongoDB Atlas TLS issues are almost always environmental rather than code-level
  • Scope aggressively for a hackathon. Every feature that made the demo compelling required cutting two features that sounded good on paper

What's Next for MolecularAI

  • Binding pocket detection — automatically highlight druggable sites on the 3D structure using fpocket
  • Molecular docking — upload a candidate ligand, run AutoDock Vina on the Vultr GPU backend, visualize ranked docking poses scored by binding affinity $\Delta G$
  • Mutation impact prediction — model point mutations and predict stability changes $\Delta\Delta G$ before running wet lab experiments
  • Multi-protein comparison — load two structures side by side and ask Gemini to compare binding sites
  • Voice narration — guided audio walkthrough of the structural analysis for accessibility

The long-term vision is a platform where a researcher can go from a gene name to a shortlist of candidate drug interactions, computationally, in minutes, without specialized bioinformatics training.

Built With

Share this project:

Updates