Extract product data from any e-commerce site. Export to XLSX, CSV, JSON, or Google Merchant Center feeds.
Getting Started Β· Features Β· Dashboard Β· CLI Reference Β· API Docs Β· Architecture
HarvestHub is a TypeScript-first web scraping platform that uses Python's Scrapling library for adaptive, anti-bot product data extraction. It intelligently extracts product information from any e-commerce site using a 3-tier strategy with confidence scoring.
Who is this for?
- π E-commerce professionals monitoring competitor pricing
- π Data analysts building product datasets
- π Developers automating product data pipelines
- π Marketers generating Google Merchant Center feeds
Why HarvestHub?
- Works on any e-commerce site β no site-specific templates needed
- Confidence scoring tells you how reliable each data point is
- Premium exports β not just data dumps, but styled XLSX with summary sheets
- Zero cost β no paid APIs, proxies, or services required
- 140 tests passing β production-grade reliability
# 1. Clone & install
git clone https://github.com/SufficientDaikon/harvesthub.git
cd harvesthub
npm install
pip install -r engine/requirements.txt
# 2. Verify everything works
npx tsx src/cli/index.ts status
# 3. Scrape some products
npx tsx src/cli/index.ts scrape --input "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
# 4. Export your data
npx tsx src/cli/index.ts export -f xlsx -o products.xlsx
# 5. Launch the dashboard
npx tsx src/cli/index.ts dashboard -p 4000
# Open http://localhost:4000Or use the one-command bootstrap on Windows:
.\setup.ps1HarvestHub doesn't rely on brittle CSS selectors. It uses an intelligent extraction pipeline:
| Tier | Method | Confidence | Speed |
|---|---|---|---|
| 1οΈβ£ | JSON-LD β Structured data from <script type="application/ld+json"> |
Highest (90-100%) | Fastest |
| 2οΈβ£ | Microdata β Schema.org attributes (itemprop, itemtype) |
High (70-90%) | Fast |
| 3οΈβ£ | CSS Heuristics β Intelligent DOM analysis with scoring | Medium (40-70%) | Moderate |
Each extracted field gets an individual confidence score so you know exactly how reliable your data is.
| Format | Description |
|---|---|
| XLSX | Dark-themed headers, conditional formatting, hyperlinks, summary sheet, auto-filters |
| CSV | UTF-8 with BOM for Excel compatibility |
| JSON | Structured export with metadata header |
| GMC | Google Merchant Center compliant TSV feed with field validation |
A premium SaaS-style web dashboard with:
- π Full-text search across products
- π Charts and aggregated statistics
- π Dark/light theme toggle
- π Schedule management UI
- π Price change history and trends
- πΎ One-click export downloads
- π WebSocket live updates during scrapes
Schedule recurring scrapes with standard cron expressions:
# Scrape every day at 9 AM
npx tsx src/cli/index.ts schedule add --name "Daily scrape" --cron "0 9 * * *" --urls products.txt
# List all schedules
npx tsx src/cli/index.ts schedule list
# Start the scheduler (runs in foreground)
npx tsx src/cli/index.ts schedule startWhen products are re-scraped, HarvestHub automatically:
- Detects price changes with delta and percentage calculations
- Maintains full price history per product
- Provides trend indicators (up/down/stable)
- Exposes history via API:
GET /api/products/:id/history
- Rate Limiting β Per-domain token bucket prevents blocking
- Retry Engine β Exponential backoff with jitter, error classification (transient/permanent/blocked)
- User Agent Rotation β 20 real browser user agents, round-robin
- Proxy Rotation β Optional proxy pool with file or env-var configuration
- Stealth Mode β Scrapling's StealthyFetcher for protected sites
harvest status Show system health & store stats
harvest scrape --urls file.txt Scrape from URL file
harvest scrape --input url1,url2 Scrape specific URLs
harvest scrape --stealth Use stealth mode for protected sites
harvest scrape --dry-run Validate URLs without scraping
harvest scrape --proxies proxies.txt Use proxy rotation
harvest export -f xlsx -o report.xlsx Export to premium Excel
harvest export -f csv -o data.csv Export to CSV
harvest export -f json -o data.json Export to JSON
harvest export -f gmc -o feed.tsv Export Google Merchant Center feed
harvest migrate Import legacy export_all.py data
harvest dashboard -p 4000 Start web dashboard
harvest schedule list List all schedules
harvest schedule add --name "..." ... Create a new schedule
harvest schedule remove <id> Delete a schedule
harvest schedule enable <id> Enable a schedule
harvest schedule disable <id> Disable a schedule
harvest schedule start Start all enabled schedules
| Flag | Default | Description |
|---|---|---|
--urls <file> |
β | Path to URL file (one per line) |
--input <urls> |
β | Comma-separated URLs |
--output <path> |
data/exports/products.xlsx |
Output file path |
--format <fmt> |
xlsx |
Export format (xlsx, csv, json, gmc) |
--retries <n> |
3 |
Max retries per URL |
--concurrency <n> |
3 |
Concurrent domain limit |
--timeout <ms> |
30000 |
Request timeout |
--stealth |
false |
Bypass bot protection (slower) |
--no-export |
β | Scrape and store only |
--dry-run |
β | Validate URLs without scraping |
--proxies <file> |
β | Proxy list file |
# Lines starting with # are ignored
# One URL per line, blank lines are skipped
https://example.com/product/widget-pro
https://store.example.com/items/gadget-x
https://shop.example.org/p/thingamajig
Launch with npx tsx src/cli/index.ts dashboard -p 4000 and open http://localhost:4000
| Page | URL | Description |
|---|---|---|
| Dashboard | / or /dashboard |
Product explorer, stats, charts |
| Documentation | /docs |
Full documentation & API reference |
| Marketing | /marketing |
Product landing page |
All endpoints available at http://localhost:4000 when the dashboard is running.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/status |
System health, engine status, product count |
GET |
/api/products |
Paginated products with search/filter |
GET |
/api/products/:id/history |
Price history for a product |
GET |
/api/stats |
Aggregated statistics, top brands/categories |
GET |
/api/export/:format |
Download export file |
GET |
/api/jobs |
Last 20 scrape jobs |
GET |
/api/schedules |
List all schedules |
POST |
/api/schedules |
Create a schedule |
PATCH |
/api/schedules/:id |
Update a schedule |
DELETE |
/api/schedules/:id |
Delete a schedule |
WS |
/ws |
WebSocket for real-time scrape events |
| Param | Type | Description |
|---|---|---|
page |
number | Page number (default: 1) |
limit |
number | Items per page (default: 50, max: 200) |
search |
string | Search title, description, SKU |
brand |
string | Filter by brand |
availability |
string | Filter by availability status |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TypeScript Layer β
β β
β CLI βββ URL Parser βββ Rate Limiter βββ Retry Engine β
β β β
β Scrape Bridge β
β (JSON IPC) β
β β β
β Dashboard βββ API Server βββ Store βββ Normalizer β
β (Express) (REST+WS) (JSON) (prices, β
β currency) β
β β β
β Exporters: XLSX β CSV β JSON β GMC β β
β Scheduler: Cron jobs (croner) β β
β Price Differ: Change detection β β
ββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββ
β Python Engine β
β (Scrapling) β
β β
β Tier 1: JSON-LD extraction β
β Tier 2: Microdata extraction β
β Tier 3: CSS heuristics β
β β
β Confidence scoring per field β
ββββββββββββββββββββββββββββββββββ
harvesthub/
βββ engine/
β βββ scraper.py # Python Scrapling engine (380 lines)
β βββ requirements.txt # Python dependencies
βββ src/
β βββ types/ # TypeScript interfaces (Product, Job, Schedule)
β βββ lib/ # Errors, logger, URL parser/validator
β βββ core/ # Scrape bridge, rate limiter, retry engine,
β β # UA pool, proxy pool, scheduler, WS broadcast,
β β # price differ, job runner
β βββ pipeline/ # Data normalization
β βββ store/ # JSON file persistence (atomic writes)
β βββ export/ # XLSX, CSV, JSON, GMC exporters
β βββ api/ # Express server + Vercel handler
β βββ cli/ # Commander.js CLI (6 commands)
β βββ __tests__/ # 14 test files, 140 tests
βββ dashboard/
β βββ index.html # SaaS dashboard
β βββ docs.html # Documentation page
β βββ marketing.html # Landing page
βββ api/
β βββ index.ts # Vercel serverless entry
βββ data/
β βββ store/ # Product database
β βββ exports/ # Generated files
β βββ logs/ # Application logs
βββ package.json
βββ tsconfig.json
βββ vercel.json # Vercel deployment config
βββ setup.ps1 # One-command bootstrap
| Technology | Role |
|---|---|
| TypeScript | Core application (strict mode) |
| Python + Scrapling | HTTP fetching + HTML parsing engine |
| Node.js | Runtime |
| Express | API server + static file serving |
| ExcelJS | Premium XLSX export |
| Commander.js | CLI framework |
| Croner | Cron job scheduling |
| WebSocket (ws) | Real-time scrape event broadcasting |
| Pino | Structured JSON logging |
| Zod | Runtime type validation |
| Vitest | Testing framework |
| Nanoid | Unique ID generation |
# Run all 140 tests
npx vitest run
# Run with coverage report
npm run test:coverage
# Run with watch mode
npx vitest
# Type check
npx tsc --noEmitTest coverage:
- Unit tests: errors, normalizer, URL parser/validator, UA pool, proxy pool, rate limiter, retry engine, price differ, WS broadcast, store, scheduler
- Integration tests: API endpoints (products, stats, jobs, schedules, export, price history), export pipeline (all 4 formats)
npx tsx src/cli/index.ts dashboard -p 4000One-click deploy β the dashboard runs in read-only mode on Vercel (scraping endpoints return 503 since the Python engine isn't available in serverless).
Or deploy manually:
npm i -g vercel
vercel deployHow it works:
api/index.tsβ Vercel serverless adapter wrapping the Express appdashboard/β Served as static files- Scraping endpoints (
POST /api/batch) return503 Scraping unavailable in serverless mode - All read-only endpoints (products, stats, exports) work normally
Run the entire platform in a container β no Node.js or Python installation required.
# Build and start with Docker Compose
docker compose up --build
# Or build and run standalone
docker build -t harvesthub .
docker run -p 4000:4000 -v ./data:/app/data harvesthubThe ./data directory is mounted as a volume so scraped products persist across container restarts.
Environment variables:
| Variable | Default | Description |
|---|---|---|
NODE_ENV |
production |
Node environment |
PORT |
4000 |
API server port |
Install HarvestHub globally for CLI access:
npm install -g harvest-hubThen use anywhere:
harvest scrape --input "https://example.com/product"
harvest export -f xlsx -o products.xlsx
harvest dashboard -p 4000import { createServer, loadProducts } from 'harvest-hub';
import type { Product, ScrapeJob } from 'harvest-hub';
// Start the API server programmatically
const app = createServer(4000);
// Load stored products
const products: Product[] = await loadProducts();npm login
npm publishThe prepublishOnly script ensures npm run build && npm test passes before every publish.
The extension/ directory contains a Manifest V3 Chrome extension for one-click product data extraction.
- Open Chrome and navigate to
chrome://extensions/ - Enable Developer mode (toggle in the top right)
- Click Load unpacked
- Select the
extension/directory from this repo
- Navigate to any product page
- Click the HarvestHub extension icon
- Click πΎ Scrape This Page
- View extracted product data (title, price, availability, confidence)
- Click πΎ Save to HarvestHub to persist
Click βοΈ Settings in the popup to configure:
- API Endpoint β point to your HarvestHub server (default:
http://localhost:4000)
- Node.js β₯ 18
- Python β₯ 3.9
- pip packages:
scrapling,orjson,browserforge
HarvestHub extracts 16+ fields per product, each with individual confidence scores:
| Field | Type | Description |
|---|---|---|
title |
string | Product title |
price |
number | Current price |
currency |
string | ISO currency code |
description |
string | Product description |
images |
string[] | Image URLs |
availability |
enum | in_stock, out_of_stock, pre_order, unknown |
brand |
string | Brand name |
sku |
string | Stock keeping unit |
mpn |
string | Manufacturer part number |
gtin |
string | Global trade item number |
category |
string | Product category |
rating |
number | Average rating |
reviewCount |
number | Number of reviews |
specifications |
object | Key-value specifications |
seller |
string | Seller name |
shipping |
string | Shipping info |
We welcome contributions! Please see our Contributing Guide for details on:
- Setting up your development environment
- Code style and conventions
- Testing guidelines
- Pull request process
MIT β use it however you want.
Built with πΎ by HarvestHub