BioRAG System - Advanced Protein Cluster Retrieval

A modular, scalable bio-RAG system that uses recursive retrieval to handle megacontext scenarios for protein cluster analysis. Built with LlamaIndex and designed to scale to millions of proteins without context overflow.

📺 Demo Video

🔄 Recursive Retrieval Architecture

The system implements a two-stage retrieval strategy:

Stage 1: Find the most semantically relevant CLUSTER SUMMARIES
Stage 2: Automatically fetch DETAILED PROTEINS only from those clusters

This prevents context overflow while maintaining comprehensive coverage.

📁 Project Structure

backend/
├── bio_rag/                    # Main package
│   ├── __init__.py            # Package exports
│   ├── config.py              # Configuration management
│   ├── data_parsers.py        # STRING database parsers
│   ├── graph_builder.py       # LlamaIndex setup
│   ├── internet_search.py     # Future enhancement hooks
│   ├── rag_system.py          # Main RAG logic
│   ├── utils.py               # Utility functions
│   └── cli.py                 # Interactive interface
├── main.py                    # Entry point
├── requirements.txt           # Dependencies
├── README.md                  # This file
└── 05.py                      # Original monolithic version

🚀 Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Set up Environment

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

3. Prepare Data Files

visit STRING to download data https://string-db.org/cgi/download?sessionId=bRhB5OTfIjaY&species_text=Homo+sapiens Place your STRING database files in the /backend/data/ directory:

9606.clusters.info.v12.0.txt
9606.clusters.tree.v12.0.txt
9606.clusters.proteins.v12.0.txt
9606.protein.info.v12.0.txt

4. Run the System

Interactive Mode:

python main.py --interactive

Single Query:

python main.py --query "protein kinases involved in cell cycle"

With Custom Settings:

python main.py --interactive --max-clusters 200 --verbose

🔧 Configuration Options

Command Line Arguments

--interactive: Run interactive CLI mode
--query "text": Run a single query and exit
--config [default|mock]: Choose configuration preset
--max-clusters N: Maximum clusters to embed
--max-proteins N: Maximum proteins per cluster
--debug: Enable detailed debug output
--verbose: Enable verbose logging
--internet-search: Enable internet search (placeholder)

💡 Usage Examples

Interactive Mode Commands

# Regular queries
Find protein kinases involved in cell cycle regulation
What are the main metabolic enzyme clusters?
Tell me about cluster CL:39184
Proteins involved in DNA repair mechanisms

# Special commands
debug    # Test retrieval system
info     # Show system statistics
help     # Show help message
quit     # Exit

🏗️ Architecture Details

Key Components

Data Parsers (data_parsers.py)
- Parse STRING database files
- Build enriched cluster records
- Handle protein metadata
Graph Builder (graph_builder.py)
- Create LlamaIndex vector stores
- Set up recursive retriever
- Handle index persistence
RAG System (rag_system.py)
- Coordinate query processing
- Manage retrieval pipeline
- Handle error cases
Utilities (utils.py)
- Intelligent cluster sampling
- Text processing functions
- Validation utilities

Scalability Features

Intelligent Sampling: Uses 40% importance + 40% diversity + 20% random sampling
Protein Truncation: Limits proteins per cluster to avoid token overflow
Index Caching: Persists embeddings to disk with change detection
Memory Management: Loads full dataset but only embeds sampled subset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioRAG System - Advanced Protein Cluster Retrieval

📺 Demo Video

🔄 Recursive Retrieval Architecture

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Set up Environment

3. Prepare Data Files

4. Run the System

🔧 Configuration Options

Command Line Arguments

💡 Usage Examples

Interactive Mode Commands

🏗️ Architecture Details

Key Components

Scalability Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
backend		backend
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BioRAG System - Advanced Protein Cluster Retrieval

📺 Demo Video

🔄 Recursive Retrieval Architecture

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Set up Environment

3. Prepare Data Files

4. Run the System

🔧 Configuration Options

Command Line Arguments

💡 Usage Examples

Interactive Mode Commands

🏗️ Architecture Details

Key Components

Scalability Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages