Skip to content

JiwaniZakir/spectra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spectra

Benchmark, compare, and optimize RAG pipelines with rigorous statistical testing.

Python FAISS Optuna License Stars Forks


Quick Start Β β€’Β  Features Β β€’Β  Architecture Β β€’Β  Benchmarks Β β€’Β  Contributing


Overview

Spectra is a systematic RAG evaluation toolkit that implements 12 retrieval strategies spanning dense, sparse, hybrid, generative, iterative, and graph-based approaches. It provides automated A/B testing with statistical rigor (Welch's t-test, Mann-Whitney U, bootstrap), 8 quality metrics aligned with the RAGAS framework, and Pareto-optimal pipeline selection powered by Optuna -- everything needed to move from prototype to production RAG.


✨ Features

  • 12 Retrieval Strategies -- Dense, BM25, SPLADE, Hybrid RRF, HyDE, Self-RAG, ColBERT, Multi-hop, Graph-RAG, IRCoT, Chain-of-Note, Cross-encoder Reranker
  • 8 Evaluation Metrics -- Faithfulness, Answer Relevance, Context Relevance, Coherence, Context Recall, Context Precision, Answer Correctness, Latency Score
  • Statistical A/B Testing -- Welch's t-test, Mann-Whitney U, paired t-test, bootstrap permutation with Bonferroni correction
  • Optuna-Powered Optimization -- Automated hyperparameter search across chunking, embedding, retrieval, and re-ranking stages
  • Pareto Frontier Analysis -- Multi-objective selection with NSGA-II crowding distance for quality vs. latency trade-offs
  • 4 Chunking Strategies -- Fixed, Semantic, Recursive, and Document-Aware splitting
  • 3 Vector Store Backends -- FAISS, Qdrant, pgvector
  • Streamlit Dashboard -- Interactive visualization of metrics, comparisons, and optimization results
  • Synthetic Data Generation -- Automatic evaluation dataset creation for rapid iteration

πŸ—οΈ Architecture

graph LR
    A[Documents] --> B[Chunking]
    B --> C[Embedding]
    C --> D[VectorStore]
    D --> E[Retrieval]
    E --> F[Evaluation]
    F --> G[Optimization]

    subgraph Chunking
        B1[Fixed]
        B2[Semantic]
        B3[Recursive]
        B4[Document-Aware]
    end

    subgraph VectorStore
        D1[FAISS]
        D2[Qdrant]
        D3[pgvector]
    end

    subgraph "Retrieval Strategies x12"
        E1[Dense / Sparse / Hybrid]
        E2[HyDE / Self-RAG / ColBERT]
        E3[Multi-hop / Graph-RAG / IRCoT]
        E4[Chain-of-Note / SPLADE / Reranker]
    end

    subgraph Evaluation
        F1[8 Quality Metrics]
        F2[A/B Testing]
        F3[Synthetic Data]
    end

    subgraph Optimization
        G1[Optuna HPO]
        G2[Pareto Frontier]
    end

    B --> B1 & B2 & B3 & B4
    D --> D1 & D2 & D3
    E --> E1 & E2 & E3 & E4
    F --> F1 & F2 & F3
    G --> G1 & G2
Loading

πŸ”§ Tech Stack

Python Pydantic NumPy scikit-learn SciPy FAISS Optuna Streamlit


πŸš€ Quick Start

Installation

# Core
pip install spectra-rag

# With all vector store backends + dashboard
pip install spectra-rag[all]

# Development
pip install spectra-rag[dev]

Build a Pipeline

from spectra.retrieval.base import Document
from spectra.retrieval.dense import DenseRetriever
from spectra.optimization.pipeline import RAGPipeline, PipelineConfig

docs = [
    Document(id="1", content="RAG combines retrieval with generation..."),
    Document(id="2", content="Dense retrieval uses learned embeddings..."),
]

pipeline = RAGPipeline(
    DenseRetriever(),
    PipelineConfig(chunk_size=256, top_k=3),
)
pipeline.ingest(docs)

result = pipeline.query("What is RAG?")
for doc, score in zip(result.retrieved_documents, result.scores):
    print(f"[{score:.3f}] {doc.content[:80]}...")

Evaluate Quality

from spectra.evaluation.metrics import EvaluationSample, compute_all_metrics

sample = EvaluationSample(
    query="What is RAG?",
    answer=result.retrieved_documents[0].content,
    contexts=[d.content for d in result.retrieved_documents],
    ground_truth="RAG combines retrieval with generation.",
    latency_seconds=result.latency_seconds,
)
metrics = compute_all_metrics([sample])
print(metrics.scores)

A/B Test Strategies

from spectra.evaluation.ab_testing import ABTest, ABTestConfig

ab = ABTest(ABTestConfig(significance_level=0.05))
results = ab.compare(samples_dense, samples_hybrid, "Dense", "Hybrid")

for r in results:
    print(f"{r.metric}: p={r.p_value:.4f}, winner={r.winner}")

Optimize with Optuna

from spectra.optimization.optimizer import PipelineOptimizer, OptimizerConfig

optimizer = PipelineOptimizer(
    documents=docs,
    eval_queries=queries,
    eval_ground_truths=ground_truths,
    config=OptimizerConfig(n_trials=100),
)
result = optimizer.optimize()
print(f"Best config: {result.best_config}")
print(f"Best score:  {result.best_score:.4f}")

Launch Dashboard

spectra
# or
streamlit run src/spectra/dashboard/app.py

🎯 Retrieval Strategies

# Strategy Type Key Idea Reference
1 Dense (Bi-encoder) Dense Embed query & docs into shared vector space Karpukhin et al., 2020
2 BM25 Sparse Classic TF-IDF lexical matching Robertson & Zaragoza, 2009
3 SPLADE Sparse Learned sparse representations via MLM Formal et al., 2021
4 Hybrid (RRF) Hybrid Reciprocal Rank Fusion of dense + sparse Cormack et al., 2009
5 HyDE Generative Embed LLM-generated hypothetical answers Gao et al., 2022
6 Self-RAG Iterative Reflection tokens for adaptive retrieval Asai et al., 2023
7 ColBERT Late Interaction Token-level MaxSim scoring Khattab & Zaharia, 2020
8 Multi-hop Iterative Decompose complex questions into retrieval steps Trivedi et al., 2023
9 Graph-RAG Graph Knowledge graph traversal for entity-rich queries --
10 IRCoT Iterative Interleaved chain-of-thought + retrieval Trivedi et al., 2023
11 Chain-of-Note Generative Reading notes for robust retrieval Yu et al., 2023
12 Cross-encoder Reranker Reranking Joint query-document scoring as second stage Nogueira & Cho, 2019

πŸ“Š Evaluation Metrics

All metrics return scores in [0, 1]:

Metric What It Measures Ground Truth
Faithfulness Fraction of answer claims supported by context --
Answer Relevance Semantic similarity between answer and query --
Context Relevance Fraction of retrieved contexts relevant to query --
Coherence Logical flow and consistency of the answer --
Context Recall Coverage of ground-truth facts in retrieved context Required
Context Precision Precision of retrieved docs against ground truth Required
Answer Correctness Token F1 + semantic similarity vs ground truth Required
Latency Score Normalized inverse latency (lower latency = higher) --

⚑ Benchmarks

Benchmark results on a synthetic corpus of 1,000 documents (avg 500 tokens):

Strategy Indexing (s) Avg Query (ms) P95 Query (ms) QPS
BM25 0.05 1.2 2.1 830
Dense 4.2 3.8 6.5 263
Hybrid 4.3 5.1 8.7 196
ColBERT 8.1 12.4 18.3 81
HyDE 4.2 850+ 1,200+ 1.2
Self-RAG 4.2 2,500+ 3,800+ 0.4

HyDE, Self-RAG, IRCoT, and Chain-of-Note latencies are dominated by LLM calls. Actual throughput depends on the LLM provider and concurrency settings.

Reproduce locally:

python benchmarks/strategy_benchmark.py

πŸ“‚ Project Structure

spectra/
β”œβ”€β”€ src/spectra/
β”‚   β”œβ”€β”€ retrieval/          # 12 retrieval strategy implementations
β”‚   β”œβ”€β”€ chunking/           # Fixed, Semantic, Recursive, Document-Aware
β”‚   β”œβ”€β”€ evaluation/         # 8 metrics, A/B testing, RAGAS, synthetic data
β”‚   β”œβ”€β”€ optimization/       # Optuna optimizer, Pareto frontier analysis
β”‚   β”œβ”€β”€ vectorstores/       # FAISS, Qdrant, pgvector backends
β”‚   β”œβ”€β”€ dashboard/          # Streamlit interactive UI
β”‚   └── utils/              # Embedding models, LLM client
β”œβ”€β”€ examples/               # quickstart, compare_strategies, optimize_pipeline
β”œβ”€β”€ tests/                  # Unit & integration tests
β”œβ”€β”€ benchmarks/             # Performance benchmark harness
└── pyproject.toml

πŸ“š References

Expand paper references
  1. HyDE -- Gao et al., "Precise Zero-Shot Dense Retrieval without Relevance Labels", ACL 2023. arXiv:2212.10496
  2. Self-RAG -- Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection", ICLR 2024. arXiv:2310.11511
  3. IRCoT -- Trivedi et al., "Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions", ACL 2023. arXiv:2212.10509
  4. ColBERT -- Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", SIGIR 2020. arXiv:2004.12832
  5. Chain-of-Note -- Yu et al., "Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models", 2023. arXiv:2311.09210
  6. RAGAS -- Es et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation", EMNLP 2023. arXiv:2309.15217
  7. SPLADE -- Formal et al., "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking", SIGIR 2021. arXiv:2107.05720
  8. Reciprocal Rank Fusion -- Cormack, Clarke & Buettcher, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods", SIGIR 2009.

🀝 Contributing

Contributions are welcome. Please open an issue or submit a pull request.

git clone https://github.com/JiwaniZakir/spectra.git
cd spectra
pip install -e ".[dev]"
pytest tests/ -v

πŸ“„ License

Released under the Apache License 2.0.

About

Systematic RAG evaluation implementing 12 retrieval strategies with automated A/B testing, quality metrics, and Pareto-optimal pipeline selection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages