GitHub - GopalGB/web-scrappie: Desktop GUI tool for scraping product images & metadata from e-commerce websites. Reads URLs from spreadsheets/PDFs, visits pages with a real browser, extracts product images and titles, outputs formatted Excel workbooks.

Desktop GUI tool for scraping product images & metadata from e-commerce websites
_{Give it a spreadsheet of URLs. Get back a beautifully formatted workbook with images, titles, and more.}

What's New in v1.2.0

Fresh features to make scraping smoother and faster.

	Feature	Details
📄	CSV & JSON Export	Export results as CSV or JSON alongside the default Excel workbook
🎨	Dark / Light Theme Toggle	Switch between dark and light themes from the GUI
🔍	Preview URL Count	See how many URLs will be scraped before you hit Start
📂	Auto-Open Output	Automatically open the output file when scraping completes
🗑️	Clear Log Button	One-click clear for the log panel
📥	Drag & Drop File Input	Drop your spreadsheet right onto the window -- no file picker needed

Demo

   +-----------------+          +-------------------+          +------------------+
   |                 |          |                   |          |                  |
   |   INPUT FILE    |  -----→ |   web_scrappie    |  -----→ |     OUTPUT       |
   |                 |          |                   |          |                  |
   |  .xlsx / .ods   |          |  Opens browser    |          |  Excel workbook  |
   |  .xls  / .pdf   |          |  Scrolls pages    |          |  CSV / JSON      |
   |                 |          |  Extracts images  |          |  Downloaded imgs |
   +-----------------+          +-------------------+          +------------------+

_{Input File (URLs by category) → Scrape (automated browser) → Output (formatted workbook + images)}

Screenshots

Clean dark UI -- configure settings, pick your file, hit Start

Live progress tracking with color-coded log output

Why?

I got tired of manually copying product images and titles from retail sites. Now I maintain a spreadsheet of URLs and let this handle the rest.

Features

Extraction & Parsing

Smart input parsing -- reads URLs from .ods, .xlsx, .xls, or .pdf files
3 extraction methods -- preloaded state (React/Next.js), DOM links, then standalone images
Deep page scraping -- auto-scrolls, clicks "Load More", waits for lazy content
Deduplication -- results are deduplicated by image URL per category

Output & Performance

Multiple export formats -- Excel, CSV, and JSON
Excel with thumbnails -- embeds scaled images directly into cells
Formatted output -- alternating row colors, frozen headers, auto-filters, hyperlinks
Parallel image downloads -- configurable thread count for fast downloads

Browser & Network

Anti-detection -- uses undetected-chromedriver to bypass bot protection
Handles SSL issues -- works on restricted/corporate networks
Headless mode -- run Chrome invisibly for faster scraping

Developer Experience

Auto-installs dependencies -- just run it, first launch handles everything
Dark / Light theme -- toggle from the GUI
Drag & drop -- drop files directly onto the window
Live logging -- color-coded progress with clear-log support

Quick Start

Prerequisites

Python 3.8+
Google Chrome installed

Run

# Clone
git clone https://github.com/GopalGB/web-scrappie.git
cd web-scrappie

# Run (dependencies install automatically on first launch)
python web_scrappie.py

Or install dependencies manually first:

pip install -r requirements.txt
python web_scrappie.py

Input File Format

Your spreadsheet needs at least two columns. The tool auto-detects which column has categories and which has URLs based on header names.

Category	URL
Shoes	`https://example.com/shoes`
Bags	`https://example.com/bags`
Watches	`https://example.com/watches`

If headers aren't recognized, it assumes column 1 = category, column 2 = URL
Multiple sheets are supported -- each sheet is processed independently
PDFs: extracts every URL found in text and annotations

Settings

All configurable from the GUI:

Setting	Default	Description
Max Scrolls	15	Times to scroll down per page (for lazy-loaded content)
Scroll Pause	2.0s	Wait time between scrolls
Page Wait	8s	Wait time after initial page load
DL Threads	8	Parallel threads for image downloading
Headless	Off	Run Chrome invisibly (faster, but some sites block it)
Download Images	Off	Download images locally & embed thumbnails in Excel

How the Scraper Works

The scraper tries three methods in order, using the first one that returns results:

Preloaded State -- Many React/Next.js sites embed product data in window.__PRELOADED_STATE__ or window.__NEXT_DATA__. Pulls structured data directly. Fastest and most reliable.
Product Links -- Finds all <a> tags containing <img> elements. Extracts image source and alt text as the product title.
Standalone Images -- Falls back to grabbing every <img> with alt text, filtering out icons and tiny spacer images.

Output

The Excel workbook includes:

Summary sheet -- category names, item counts, generation timestamp
One sheet per category -- columns: #, Title, Image URL, Page URL
Embedded thumbnails (if enabled) -- scaled to fit cells without stretching
Styling -- alternating row colors, frozen header rows, auto-filters, clickable hyperlinks

New in v1.2.0: also exports CSV and JSON files alongside the workbook.

Tips

Getting zero results? Uncheck headless mode -- some sites aggressively block headless browsers
SSL errors? The tool handles certificate issues automatically, useful on corporate networks
Small images skipped -- files under 500 bytes (tracking pixels, spacers) are filtered out
Deduplication -- results are deduplicated by image URL within each category

Tech Stack

Component	Library
GUI	customtkinter
Browser automation	selenium + undetected-chromedriver
Spreadsheet I/O	pandas + openpyxl + odfpy
PDF parsing	pdfplumber
Image processing	Pillow
HTTP	requests

License

MIT

Made by Gopal Bagaswar

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
web_scrappie.py		web_scrappie.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's New in v1.2.0

Demo

Screenshots

Why?

Features

Quick Start

Prerequisites

Run

Input File Format

Settings

How the Scraper Works

Output

Tips

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What's New in v1.2.0

Demo

Screenshots

Why?

Features

Quick Start

Prerequisites

Run

Input File Format

Settings

How the Scraper Works

Output

Tips

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages