Inspiration

Information about cancer and related diseases is scattered across many different places. Some of it lives in structured knowledge graphs like Hetionet, which aren’t easily accessible to most people. Some of it is buried inside research papers and clinical trials that are difficult and time-consuming to navigate. While frontier AI tools can summarize information quickly, they are known to hallucinate or provide answers without clearly showing where the information comes from.

We felt there was a gap between powerful biomedical data sources and accessible, trustworthy question-answer systems.

We wanted to build something that acts as a one-stop, question-driven resource for cancer knowledge — a system that understands meaning rather than just keywords, retrieves information from curated biomedical sources, and clearly points users back to where that information came from.

We built OncoPhrase with that goal in mind. It combines structured graph data, research literature, and clinical trial information into a single semantic search layer, and then uses an LLM only to summarize retrieved evidence. The result is a tool that can serve a wide range of users — from individuals curious about cancer biology, to patients and families trying to educate themselves, to researchers looking to quickly explore associations between genes, diseases, and treatments.

That motivation is what led us to build OncoPhrase.

What it does

OncoPhrase is a domain-grounded semantic discovery engine for oncology. It allows users to ask natural-language questions about cancer-related genes, diseases, drugs, and clinical trials, and retrieves relevant information based on meaning rather than exact keyword matches.

Users can ask questions such as:

  • “genes associated with lung cancer”
  • “drugs used in EGFR-mutated tumors”
  • “clinical trials related to BRCA1”
  • “mechanisms of resistance in melanoma”

Instead of returning a generic answer, the system:

  • surfaces relevant genes, compounds, or trials
  • shows structured biological context
  • displays relevance scores
  • and provides summaries grounded in specific retrieved records

This makes exploration faster, more transparent, and more controllable than traditional keyword search or general-purpose chat tools.

How we built it

We began by curating a cancer-focused subset of biomedical data. This included structured entities from Hetionet (genes, diseases, compounds), PubMed abstracts related to oncology, and cancer-relevant clinical trial records. For each item, we created a structured search_text representation that captures meaningful biological context such as known associations, conditions, interventions, and summaries.

We then generated transformer-based vector embeddings for every record. By embedding structured graph data, literature, and clinical trials into the same semantic space, we created a unified retrieval layer across heterogeneous biomedical sources.

When a user submits a query, the query is embedded into the same vector space. We compute cosine similarity to identify the most semantically relevant records across all sources. These results are retrieved efficiently using a vector database backend powered by Actian VectorAI DB.

Finally, the retrieved records are passed to an LLM layer, which generates a concise summary grounded strictly in the retrieved evidence. This retrieval-first design ensures transparency, significantly reduces hallucination risk, and allows us to inspect and control the knowledge sources driving each response.

Challenges we ran into

One of the biggest challenges was data completeness and coverage. While gene-disease and compound-disease relationships were well represented in our subset, certain relationships we expected — such as explicit drug–gene targeting links — were not always present. This required us to carefully inspect the underlying data sources and rethink how we structured and augmented our records.

Another major challenge was constructing meaningful text representations from structured biomedical data. Knowledge graphs are powerful, but they are not naturally optimized for semantic retrieval. We had to iteratively refine how we converted graph relationships, literature summaries, and clinical trial metadata into coherent search_text fields that preserved biological context without introducing excessive noise.

Integrating heterogeneous data sources also introduced complexity. We had to ensure consistent schema design, metadata tagging (such as source tracking), and alignment between records and embeddings across multiple corpora.

Finally, handling large biomedical datasets and embeddings introduced practical engineering hurdles — including file size limits, version control constraints, and efficient merging of multi-source vector indices. Addressing these challenges required careful data validation, cleanup, and reproducibility checks to ensure the final system remained reliable and scalable.

Accomplishments that we're proud of

  • Designed and built a fully functional oncology-focused semantic discovery engine from end to end — from raw biomedical data to a working question-answer interface.
  • Unified structured knowledge graph data, research literature, and clinical trial records into a single semantic retrieval layer.
  • Embedded thousands of heterogeneous biomedical records into a shared vector space while preserving source context and metadata.
  • Implemented structured filtering (Gene / Disease / Compound / ClinicalTrial) to support targeted and controllable exploration.
  • Integrated a scalable vector database backend to enable efficient similarity search across multiple data sources.
  • Added a retrieval-first LLM layer that generates summaries grounded strictly in retrieved evidence, improving transparency and reducing hallucination risk.

Most importantly, we demonstrated that domain-grounded semantic retrieval provides more meaningful, inspectable results than basic keyword search — especially in complex biomedical domains like oncology.

What we learned

We learned how powerful embedding-based retrieval can be in structured scientific domains. We also gained hands-on experience with vector databases, data preprocessing for biomedical knowledge graphs, and the trade-offs between semantic breadth and precision.

We also learned that data quality and coverage matter just as much as model choice.

What’s next for OncoPhrase

While our current focus is oncology, the architecture behind OncoPhrase is domain-agnostic. A key next step is expanding beyond cancer to support broader disease areas using the same retrieval-first framework.

Within oncology, we plan to enrich the knowledge base with more granular biological information, including isoform-level data and refined drug–gene interaction mappings. Isoforms often play distinct functional roles in disease, and incorporating this level of detail would significantly improve biological precision.

Another major direction is integrating regulatory and treatment updates. We aim to build a continuously updated module that tracks newly FDA-approved therapies and emerging treatment modalities. By periodically ingesting approval data and structured drug information, OncoPhrase could maintain an up-to-date repository of therapies, mechanisms, and clinical nuances.

Long term, we envision OncoPhrase evolving into a continuously updated, domain-grounded AI assistant for biomedical exploration — combining structured knowledge, literature, regulatory updates, and clinical context into a transparent and scalable discovery platform. We also plan to deploy it as a publicly accessible web application to make trustworthy biomedical exploration more widely available.

Built With

Share this project:

Updates