A modular, scalable bio-RAG system that uses recursive retrieval to handle megacontext scenarios for protein cluster analysis. Built with LlamaIndex and designed to scale to millions of proteins without context overflow.
The system implements a two-stage retrieval strategy:
- Stage 1: Find the most semantically relevant CLUSTER SUMMARIES
- Stage 2: Automatically fetch DETAILED PROTEINS only from those clusters
This prevents context overflow while maintaining comprehensive coverage.
backend/
├── bio_rag/ # Main package
│ ├── __init__.py # Package exports
│ ├── config.py # Configuration management
│ ├── data_parsers.py # STRING database parsers
│ ├── graph_builder.py # LlamaIndex setup
│ ├── internet_search.py # Future enhancement hooks
│ ├── rag_system.py # Main RAG logic
│ ├── utils.py # Utility functions
│ └── cli.py # Interactive interface
├── main.py # Entry point
├── requirements.txt # Dependencies
├── README.md # This file
└── 05.py # Original monolithic version
pip install -r requirements.txtCreate a .env file with your OpenAI API key:
OPENAI_API_KEY=your_api_key_herevisit STRING to download data
https://string-db.org/cgi/download?sessionId=bRhB5OTfIjaY&species_text=Homo+sapiens
Place your STRING database files in the /backend/data/ directory:
9606.clusters.info.v12.0.txt9606.clusters.tree.v12.0.txt9606.clusters.proteins.v12.0.txt9606.protein.info.v12.0.txt
Interactive Mode:
python main.py --interactiveSingle Query:
python main.py --query "protein kinases involved in cell cycle"With Custom Settings:
python main.py --interactive --max-clusters 200 --verbose--interactive: Run interactive CLI mode--query "text": Run a single query and exit--config [default|mock]: Choose configuration preset--max-clusters N: Maximum clusters to embed--max-proteins N: Maximum proteins per cluster--debug: Enable detailed debug output--verbose: Enable verbose logging--internet-search: Enable internet search (placeholder)
# Regular queries
Find protein kinases involved in cell cycle regulation
What are the main metabolic enzyme clusters?
Tell me about cluster CL:39184
Proteins involved in DNA repair mechanisms
# Special commands
debug # Test retrieval system
info # Show system statistics
help # Show help message
quit # Exit-
Data Parsers (
data_parsers.py)- Parse STRING database files
- Build enriched cluster records
- Handle protein metadata
-
Graph Builder (
graph_builder.py)- Create LlamaIndex vector stores
- Set up recursive retriever
- Handle index persistence
-
RAG System (
rag_system.py)- Coordinate query processing
- Manage retrieval pipeline
- Handle error cases
-
Utilities (
utils.py)- Intelligent cluster sampling
- Text processing functions
- Validation utilities
- Intelligent Sampling: Uses 40% importance + 40% diversity + 20% random sampling
- Protein Truncation: Limits proteins per cluster to avoid token overflow
- Index Caching: Persists embeddings to disk with change detection
- Memory Management: Loads full dataset but only embeds sampled subset