About the Project

The Problem That Started It All

I've always been fascinated by the idea validation problem in startups. Entrepreneurs spend months building something only to realize nobody wants it. The signals were always there - buried in Reddit threads, Hacker News discussions, YouTube, and Product Hunt launches. People constantly talk about what frustrates them and what they'd pay for, but this data is scattered and impossible to analyze at scale. I wanted to build something that could surface actual market demand before anyone writes a single line of code.


What I Learned

  • Elasticsearch was completely new to me. Learning hybrid search that combines BM25 keyword matching with dense vector embeddings was fascinating- BM25 handles exact matches well, while vector similarity catches semantic meaning and context.

  • The bigger challenge was Elasticsearch's Open Inference integration with Vertex AI. The two-stage retrieval concept (fast hybrid search for top 100, then AI reranking for best 20) made sense on paper, but getting the inference endpoint configurations working took considerable debugging - service account permissions, endpoint initialization, and graceful fallback handling all needed careful tuning. Might not have handled very well yet, but will be deep diving into it after the hackathon to understand it better.

  • Working with Gemini 2.5 Pro taught me about prompt engineering. Being extremely specific about scoring criteria with clear ranges (0-100) and examples helped significantly.

  • The quality scoring system was born from spam frustration. I built a multi-signal detector checking repeated characters, excessive links (>3), emoji spam (>5), and readability metrics. It filters 30-40% of collected posts before indexing.

How I Built It

Architecture Decisions

  • Single Database - Elasticsearch only. No Postgres, MongoDB, or Redis. This eliminated sync issues but means Elasticsearch handles everything: full-text search, vector similarity, aggregations, and analytics. The index stores text fields, a 768-dimensional dense vector, nested sentiment/quality objects, and temporal data.

  • Stateless API - No authentication system (no NextAuth, Clerk, or JWT). Everything is public. Intentional for the MVP - I wanted to focus on core features without user management complexity. Makes deployment simple (just Vercel), though no usage tracking or rate limiting is yet.

  • Next.js 14 Serverless - Six API routes on Vercel with 60-second timeouts:

    1. /api/search - Hybrid search + optional AI reranking
    2. /api/validate-idea - Automated validation using Gemini
    3. /api/chat - Streaming chat with grounded responses (SSE)
    4. /api/analyze-opportunity - Deep market analysis
    5. /api/collect - Manual data collection trigger
    6. /api/analytics - Elasticsearch aggregations
  • Error handling degrades gracefully - if reranking fails, falls back to standard search.

    • Multi-Model Vertex AI - Four models for different use cases:
    • Gemini 2.5 Flash - Fast idea validation and chat
    • Gemini 2.5 Pro - Detailed opportunity analysis
    • text-embedding-004 - Search embeddings (768-dim)
    • semantic-ranker-512 - Result reranking via Elasticsearch

The Data Pipeline

  • After every search, the system automatically collects fresh data in the background without blocking your results. This keeps the index up to date with the latest discussions.

The process

  • Parallel scraping across YouTube, Reddit, Hacker News, and Product Hunt
  • Quality filtering - Sentiment analysis, spam detection, and readability scoring remove low-quality posts
  • Smart embeddings - Text is converted to 768-dimensional vectors using Vertex AI
  • Bulk indexing - Everything gets indexed into Elasticsearch
  • Getting Open Inference working was tricky: lazy endpoint creation, service account permissions, graceful fallback when unavailable.

Key Features

  • Validation Flow:
  • User enters idea → Gemini generates search query
  • Search Elasticsearch for discussions
  • Gemini scores: demand, problem severity, willingness to pay, competition, timing
  • Returns structured report with citations (10 seconds vs hours of manual searching)

    • Analytics Dashboard - Built entirely on Elasticsearch aggregations:
    • Date histograms for trending topics
    • Platform/tag breakdowns
    • Sentiment distribution
    • Engagement heatmaps using Painless scripting for hour-of-day analysis

Challenges I Faced

  • Platform-specific quirks created a normalization nightmare. Product Hunt uses GraphQL with nested topic structures (topics.edges[].node.name), Reddit has inconsistent formats where selftext might be empty and timestamps are in Unix seconds, YouTube requires two API calls (search, then fetch details) with engagement metrics as strings needing parsing, and HackerNews has two separate APIs (Firebase and Algolia) with stories that can be deleted or dead. I solved this with a unified SocialPost interface that normalizes everything - YouTube's likeCount and Reddit's score both map to a generic score field, all timestamp formats convert to JavaScript Date objects, and each connector has a normalize function. Without this abstraction, search and indexing would've been unmaintainable.

  • Deduplication turned into a bigger problem than expected. The same discussion often appears multiple times on the same platform, sometimes with slight title variations or different URLs pointing to the same content. I built a normalization system that strips protocols and URL parameters, title content, plus normalizes titles by removing punctuation and truncating to 100 characters. Even then, some duplicates slip through when titles are significantly reworded.

  • API rate limits hit hard once I started scaling data collection. YouTube's quota system is particularly brutal - you burn through your daily limit fast. Reddit blocks you if you make requests too quickly. Also, one of the major issues I faced was that Reddit's rate limiting of the IP is very fast. I implemented Promise.allSettled so one platform failure doesn't kill the entire collection job. Still not ideal for real-time needs. Will switch to a new account for Reddit while making the hackathon submission.

  • The background collection pipeline needed careful orchestration. Fetching from four platforms, running sentiment analysis, calculating quality scores, generating embeddings, and bulk indexing - all while handling partial failures gracefully. I use Promise.allSettled everywhere, so one platform timeout doesn't break the entire job.

What's Next

  • First priority is expanding platforms - X, Stack Overflow, GitHub issues, and Quora.

  • Validation API performance - The /api/validate-idea endpoint takes nearly a minute to complete. The main bottleneck is sequential Gemini calls and Elasticsearch searches. Solution: Multi-layer caching. Cache Elasticsearch results for identical search queries (1-hour TTL) since social discussions don't change that rapidly. Cache generated embeddings for common keywords to avoid repeated Vertex AI calls. Cache the entire validation report for identical ideas (6-hour TTL) - if someone already validated "AI chatbot for customer service" today, return the cached analysis instantly instead of reprocessing. This could drop response time from 60 seconds to under 2 seconds for cache hits, and even misses would benefit from cached embeddings. Redis with query hash keys would handle this cleanly.

  • Export functionality feature, CSV for raw data and PDF reports with charts would make this shareable with teams and investors.

  • Multi-lingual support would open up non-English markets. The embedding models already handle multiple languages, but sentiment analysis and prompts require reworking.

  • I keep thinking about a browser extension that lets you highlight any discussion and instantly validate if it's a real opportunity - lightweight models client-side, full pipeline for deep analysis.

  • Testing infrastructure - Zero test coverage currently. At a minimum, we need unit tests and integration tests for the API routes.

  • Will also deep dive into improving the indexing of data and provide more relevant results.


This project pushed me deep into ElasticSearch infrastructure and AI embeddings territory I didn't expect to navigate. Along the way, I developed strong opinions on embedding models and learned more about search architecture than I anticipated. But the real win? I built a system that automatically surfaces market opportunities from conversations, turning noise into actionable intelligence. That was the whole point.

Built With

Share this project:

Updates