Document Assistant Agent
Main Search Engine
Topic Clustering Classification
Named Entity Graph

About Doc or Who

The Problem

Enterprise documents are scattered across drives, emails, and folders. Traditional search returns 50 PDFs—useless. Knowledge about relationships between documents, people, and contracts remains hidden.

Our Solution

Doc or Who combines three integrated search paradigms into a unified platform:

1. Hybrid Search (Semantic + Keyword)

BM25 (Whoosh) catches exact keywords and regulatory terms
Vector embeddings (ChromaDB) understand semantic meaning
RRF fusion combines both signals intelligently
Result: Find "supplier contracts 2024" whether documents say "agreement" or "contract"

2. Morphological Normalization

Spanish stemming (Snowball) handles inflected forms:

reuniones → reunion
contratación → contrat
proveedores → proveedor

This recovers results traditional BM25 misses entirely.

3. AI Agent with Function-Calling

Users ask in natural language:

"How many Q4 invoices from TechCorp?" → Routes to SQL engine over CSV data
"Who handles supplier relationships?" → Extracts from entity graph + documents
"Summarize all Q4 contracts" → LLM synthesis with citations

The agent logs every decision, tool call, and failure—transparency that debugging reveals immediately.

What Makes Us Different

Feature	Doc or Who	Typical Search
Search Type	Hybrid (semantic + BM25)	Keyword-only or semantic-only
Morphology	Spanish stemming built-in	No morphological awareness
Entity Graph	Automatic relationship extraction	No relationship view
SQL Integration	Query CSVs as tables	Documents only
AI Agent	Function-calling with logging	No agent or black-box agent
Clustering	2-level hierarchical with AI labels	Tag-based or no clustering
Cascading Filters	5 dimensions with dynamic counts	Static filters

Technical Highlights

Zero external databases: Whoosh + ChromaDB + DuckDB all local
Fast inference: Groq LLM (70B, sub-1s latency)
Retrocompatible schema evolution: Gracefully handles old indices
Structured logging: Agent tracing in data/agent.log for debugging
Multi-format support: PDF (with OCR), TXT, CSV, XLSX, DOCX

Architecture

Query → Hybrid Search (BM25 + Embeddings + RRF)
     → Entity Graph (spaCy NER + NetworkX)
     → AI Agent (Groq 70B with tools)
       ├ search_documents
       ├ query_data (SQL/DuckDB)
       ├ get_entity_info
       └ find_connection
     → Unified Results with Explanations