About Doc or Who
The Problem
Enterprise documents are scattered across drives, emails, and folders. Traditional search returns 50 PDFs—useless. Knowledge about relationships between documents, people, and contracts remains hidden.
Our Solution
Doc or Who combines three integrated search paradigms into a unified platform:
1. Hybrid Search (Semantic + Keyword)
- BM25 (Whoosh) catches exact keywords and regulatory terms
- Vector embeddings (ChromaDB) understand semantic meaning
- RRF fusion combines both signals intelligently
- Result: Find "supplier contracts 2024" whether documents say "agreement" or "contract"
2. Morphological Normalization
Spanish stemming (Snowball) handles inflected forms:
reuniones→reunioncontratación→contratproveedores→proveedor
This recovers results traditional BM25 misses entirely.
3. AI Agent with Function-Calling
Users ask in natural language:
- "How many Q4 invoices from TechCorp?" → Routes to SQL engine over CSV data
- "Who handles supplier relationships?" → Extracts from entity graph + documents
- "Summarize all Q4 contracts" → LLM synthesis with citations
The agent logs every decision, tool call, and failure—transparency that debugging reveals immediately.
What Makes Us Different
| Feature | Doc or Who | Typical Search |
|---|---|---|
| Search Type | Hybrid (semantic + BM25) | Keyword-only or semantic-only |
| Morphology | Spanish stemming built-in | No morphological awareness |
| Entity Graph | Automatic relationship extraction | No relationship view |
| SQL Integration | Query CSVs as tables | Documents only |
| AI Agent | Function-calling with logging | No agent or black-box agent |
| Clustering | 2-level hierarchical with AI labels | Tag-based or no clustering |
| Cascading Filters | 5 dimensions with dynamic counts | Static filters |
Technical Highlights
- Zero external databases: Whoosh + ChromaDB + DuckDB all local
- Fast inference: Groq LLM (70B, sub-1s latency)
- Retrocompatible schema evolution: Gracefully handles old indices
- Structured logging: Agent tracing in
data/agent.logfor debugging - Multi-format support: PDF (with OCR), TXT, CSV, XLSX, DOCX
Architecture
Query → Hybrid Search (BM25 + Embeddings + RRF)
→ Entity Graph (spaCy NER + NetworkX)
→ AI Agent (Groq 70B with tools)
├ search_documents
├ query_data (SQL/DuckDB)
├ get_entity_info
└ find_connection
→ Unified Results with Explanations
Key Differentiator: Transparency + Function-Calling
Most AI search hides how results appear. Doc or Who shows:
- Which search engine matched (semantic vs. keyword vs. stemming)
- Which fallback was used (fuzzy 3-grams vs. numeric normalization)
- Which agent tools executed and with what arguments
- Why a result appeared (matched fields, confidence scores)
This transparency turns "magic AI" into trustworthy, debuggable search.
Built With
- chromadb
- duckdb
- fastapi
- networkx
- next.js
- nltk
- python
- scikit-learn
- sentence-transformers
- spacy
- typescript
Log in or sign up for Devpost to join the conversation.