About Doc or Who

The Problem

Enterprise documents are scattered across drives, emails, and folders. Traditional search returns 50 PDFs—useless. Knowledge about relationships between documents, people, and contracts remains hidden.

Our Solution

Doc or Who combines three integrated search paradigms into a unified platform:

1. Hybrid Search (Semantic + Keyword)

  • BM25 (Whoosh) catches exact keywords and regulatory terms
  • Vector embeddings (ChromaDB) understand semantic meaning
  • RRF fusion combines both signals intelligently
  • Result: Find "supplier contracts 2024" whether documents say "agreement" or "contract"

2. Morphological Normalization

Spanish stemming (Snowball) handles inflected forms:

  • reunionesreunion
  • contratacióncontrat
  • proveedoresproveedor

This recovers results traditional BM25 misses entirely.

3. AI Agent with Function-Calling

Users ask in natural language:

  • "How many Q4 invoices from TechCorp?" → Routes to SQL engine over CSV data
  • "Who handles supplier relationships?" → Extracts from entity graph + documents
  • "Summarize all Q4 contracts" → LLM synthesis with citations

The agent logs every decision, tool call, and failure—transparency that debugging reveals immediately.

What Makes Us Different

Feature Doc or Who Typical Search
Search Type Hybrid (semantic + BM25) Keyword-only or semantic-only
Morphology Spanish stemming built-in No morphological awareness
Entity Graph Automatic relationship extraction No relationship view
SQL Integration Query CSVs as tables Documents only
AI Agent Function-calling with logging No agent or black-box agent
Clustering 2-level hierarchical with AI labels Tag-based or no clustering
Cascading Filters 5 dimensions with dynamic counts Static filters

Technical Highlights

  • Zero external databases: Whoosh + ChromaDB + DuckDB all local
  • Fast inference: Groq LLM (70B, sub-1s latency)
  • Retrocompatible schema evolution: Gracefully handles old indices
  • Structured logging: Agent tracing in data/agent.log for debugging
  • Multi-format support: PDF (with OCR), TXT, CSV, XLSX, DOCX

Architecture

Query → Hybrid Search (BM25 + Embeddings + RRF)
     → Entity Graph (spaCy NER + NetworkX)
     → AI Agent (Groq 70B with tools)
       ├ search_documents
       ├ query_data (SQL/DuckDB)
       ├ get_entity_info
       └ find_connection
     → Unified Results with Explanations

Key Differentiator: Transparency + Function-Calling

Most AI search hides how results appear. Doc or Who shows:

  • Which search engine matched (semantic vs. keyword vs. stemming)
  • Which fallback was used (fuzzy 3-grams vs. numeric normalization)
  • Which agent tools executed and with what arguments
  • Why a result appeared (matched fields, confidence scores)

This transparency turns "magic AI" into trustworthy, debuggable search.

Built With

Share this project:

Updates