An NLP pipeline that processes 200K consumer complaints from the CFPB database, comparing older and newer techniques for embedding, clustering, and retrieval.
Live Interactive Dashboard: complaint-intelligence-system.streamlit.app
Takes raw complaint text, cleans it, generates embeddings with two different models, clusters the complaints using two different methods, and provides a semantic search interface. Everything is benchmarked so you can see the actual tradeoffs.
The pipeline compares:
- Embeddings: MiniLM (384d, fast) vs BGE (768d, more accurate)
- Clustering: KMeans (fixed k) vs BERTopic (auto-discovers topics)
- Retrieval: Vector search, BM25, hybrid, and reranked hybrid
graph TD
A[CFPB Raw Data] -->|run_pipeline.py| B[Text Cleaning & Preprocessing]
subgraph Clustering Subsystem
B --> E1[KMeans Clustering]
B --> E2[BERTopic Pipeline]
E2 --> E2a[UMAP Dimension Reduction]
E2a --> E2b[HDBSCAN Density Clustering]
end
subgraph Embedding Subsystem
B --> C1[all-MiniLM-L6-v2]
B --> C2[bge-base-en-v1.5]
end
C1 & C2 --> G[FAISS Dense Index]
B --> H[BM25 Sparse Index]
subgraph Retrieval Pipeline
G & H --> I[Hybrid Search]
I -->|Reciprocal Rank Fusion| J[Candidate Generation]
J -->|Cross-Encoder Reranker| K[Reranked Top Results]
end
K --> L[LLM Summarizer / RAG Context]
L --> M[Streamlit Web Dashboard]
- FAISS (Dense) provides conceptual matching (e.g., searching for "stolen card" retrieves complaints mentioning "unauthorized charge" or "lost wallet"), and runs under 35ms. However, it suffers from keyword mismatch.
- BM25 (Sparse) guarantees matches for specific product terms, account numbers, or unique company names but misses conceptual synonyms and suffers from Python search overhead (~580ms).
- Hybrid Reranked Retrieval solves this by using BM25 and FAISS for fast candidate generation, then ranking results with
MS-MARCO-MiniLM-L-6-v2. This delivers the highest precision results at the cost of ~300ms additional latency.
- MiniLM (384-dimensions) achieves high throughput (374.6 texts/sec), making it highly scalable for real-time applications.
- BGE (768-dimensions) has a lower throughput (59.7 texts/sec) but achieves significantly higher intra-cluster coherence (0.77 vs. 0.53).
- Alignment Analysis: The two embedding spaces only share 38.5% of their top-10 nearest neighbors for any given complaint. This confirms that higher-dimensional models capture structurally different semantic relationships than smaller ones.
- KMeans forces every complaint into a group, which guarantees coverage but creates high intra-cluster variance on messy, real-world data.
- BERTopic leverages HDBSCAN to isolate only dense, high-confidence topic regions, filtering out 55% of the complaints as outliers/noise. This honest approach produces highly distinct clusters for the remaining data, which we auto-labeled using Gemini.
Ran on 200K complaints using a T4 GPU on Google Colab.
| Model | Dim | Speed | Cosine Sim (mean) | Intra-cluster Coherence |
|---|---|---|---|---|
| MiniLM | 384 | 374.6 texts/sec | 0.41 | 0.53 |
| BGE | 768 | 59.7 texts/sec | 0.72 | 0.77 |
BGE is 6x slower but produces much tighter clusters. The two models only agree on 38.5% of top-10 neighbors, meaning they capture different aspects of the text.
| Metric | KMeans (k=6) | BERTopic |
|---|---|---|
| Clusters found | 6 | 30 |
| Outliers | 0 | 110,456 (55%) |
| Silhouette | 0.0338 | 0.0301 |
KMeans forces every complaint into a cluster. BERTopic flags 55% as noise, which is honest but means over half the data gets no label. Both have low silhouette scores, which makes sense since complaint text is messy and overlapping.
| Method | p50 (ms) | p95 (ms) |
|---|---|---|
| Vector (FAISS) | 35 | 41 |
| BM25 | 589 | 929 |
| Hybrid (RRF) | 614 | 959 |
| Reranked Hybrid | 911 | 1,356 |
Pure vector search is fast. Adding BM25 and reranking improves result quality but adds significant latency. Whether that tradeoff is worth it depends on the use case.
git clone https://github.com/AswaniSahoo/complaint-intelligence-system.git
cd complaint-intelligence-system
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
pip install -r requirements.txtDownload from the CFPB website and put it in data/raw/complaints.csv.
# Full run with both models and benchmarks
python run_pipeline.py --sample-size 200000 --model both --clustering both --benchmark
# Quick run (MiniLM + KMeans only)
python run_pipeline.py
# With LLM summarization (needs GEMINI_API_KEY in .env)
python run_pipeline.py --with-llmstreamlit run app/app.pyFor large runs, use Colab with a T4 GPU. The full pipeline notebook is at notebooks/Full_pipeline_output.ipynb.
!git clone https://github.com/AswaniSahoo/complaint-intelligence-system.git
%cd complaint-intelligence-system
!pip install -r requirements.txt -q
!mkdir -p data/raw
!wget -q -O data/raw/complaints.csv.zip \
"https://files.consumerfinance.gov/ccdb/complaints.csv.zip"
!cd data/raw && unzip -o complaints.csv.zip && rm complaints.csv.zip
!python run_pipeline.py --sample-size 200000 --model both --clustering both --benchmark| Flag | What it does | Default |
|---|---|---|
--sample-size N |
How many complaints to process | 15000 |
--model {minilm,bge,both} |
Which embedding model(s) | minilm |
--clustering {kmeans,bertopic,both} |
Which clustering method(s) | kmeans |
--benchmark |
Run retrieval latency tests | off |
--with-llm |
Generate LLM summaries | off |
--provider {gemini,groq,together} |
LLM provider | gemini |
7 pages:
- Overview - complaint counts, product/issue distributions, time trends
- Clusters - drill into each cluster, see what products/issues it contains
- Complaint Viewer - browse and filter individual complaints
- Semantic Search - search complaints by meaning using FAISS
- Embedding Comparison - MiniLM vs BGE metrics side by side
- Clustering Comparison - KMeans vs BERTopic quality metrics
- Retrieval Benchmark - latency comparison across all four retrievers
Screenshots
├── app/
│ └── app.py # Streamlit dashboard
├── src/
│ ├── preprocess.py # Text cleaning
│ ├── embeddings.py # Embedding generation (MiniLM + BGE)
│ ├── embedding_benchmark.py # Embedding comparison metrics
│ ├── clustering.py # KMeans + BERTopic
│ ├── topic_labeler.py # LLM-based topic labeling
│ ├── rag.py # FAISS-based search
│ ├── llm_utils.py # LLM provider abstraction
│ ├── visualizer.py # UMAP projections
│ ├── retrievers/
│ │ ├── base.py # Retriever interface
│ │ ├── vector_retriever.py # FAISS search
│ │ ├── bm25_retriever.py # BM25 keyword search
│ │ ├── hybrid_retriever.py # Vector + BM25 fusion
│ │ ├── reranker.py # Cross-encoder reranking
│ │ └── reranked_retriever.py # Two-stage retrieval
│ └── evaluation/
│ └── retrieval_benchmark.py # Latency benchmarks
├── data/
│ ├── raw/ # CFPB data (not committed)
│ ├── processed/ # Cleaned data + embeddings
│ └── results/ # Benchmark outputs (JSON)
├── notebooks/
│ └── Full_pipeline_output.ipynb # Complete Colab run
├── tests/
├── run_pipeline.py # Main pipeline script
└── requirements.txt
sentence-transformersfor embeddingsbertopic,hdbscan,umap-learnfor topic modelingfaiss-cpufor vector searchrank-bm25for keyword searchcross-encoder/ms-marco-MiniLM-L-6-v2for rerankingstreamlit+plotlyfor the dashboard- Google Gemini for topic labeling (optional)
MIT






