Skip to content

GopalGB/web-scrappie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

web_scrappie banner

Python 3.8+ MIT License CustomTkinter Chrome GitHub Stars

Desktop GUI tool for scraping product images & metadata from e-commerce websites
Give it a spreadsheet of URLs. Get back a beautifully formatted workbook with images, titles, and more.


What's New in v1.2.0

Fresh features to make scraping smoother and faster.

Feature Details
πŸ“„ CSV & JSON Export Export results as CSV or JSON alongside the default Excel workbook
🎨 Dark / Light Theme Toggle Switch between dark and light themes from the GUI
πŸ” Preview URL Count See how many URLs will be scraped before you hit Start
πŸ“‚ Auto-Open Output Automatically open the output file when scraping completes
πŸ—‘οΈ Clear Log Button One-click clear for the log panel
πŸ“₯ Drag & Drop File Input Drop your spreadsheet right onto the window -- no file picker needed

Demo

   +-----------------+          +-------------------+          +------------------+
   |                 |          |                   |          |                  |
   |   INPUT FILE    |  -----β†’ |   web_scrappie    |  -----β†’ |     OUTPUT       |
   |                 |          |                   |          |                  |
   |  .xlsx / .ods   |          |  Opens browser    |          |  Excel workbook  |
   |  .xls  / .pdf   |          |  Scrolls pages    |          |  CSV / JSON      |
   |                 |          |  Extracts images  |          |  Downloaded imgs |
   +-----------------+          +-------------------+          +------------------+

Input File (URLs by category) β†’ Scrape (automated browser) β†’ Output (formatted workbook + images)


Screenshots

Main Window

Clean dark UI -- configure settings, pick your file, hit Start

Scraping in Progress

Live progress tracking with color-coded log output


Why?

I got tired of manually copying product images and titles from retail sites. Now I maintain a spreadsheet of URLs and let this handle the rest.


Features

Extraction & Parsing

  • Smart input parsing -- reads URLs from .ods, .xlsx, .xls, or .pdf files
  • 3 extraction methods -- preloaded state (React/Next.js), DOM links, then standalone images
  • Deep page scraping -- auto-scrolls, clicks "Load More", waits for lazy content
  • Deduplication -- results are deduplicated by image URL per category

Output & Performance

  • Multiple export formats -- Excel, CSV, and JSON
  • Excel with thumbnails -- embeds scaled images directly into cells
  • Formatted output -- alternating row colors, frozen headers, auto-filters, hyperlinks
  • Parallel image downloads -- configurable thread count for fast downloads

Browser & Network

  • Anti-detection -- uses undetected-chromedriver to bypass bot protection
  • Handles SSL issues -- works on restricted/corporate networks
  • Headless mode -- run Chrome invisibly for faster scraping

Developer Experience

  • Auto-installs dependencies -- just run it, first launch handles everything
  • Dark / Light theme -- toggle from the GUI
  • Drag & drop -- drop files directly onto the window
  • Live logging -- color-coded progress with clear-log support

Quick Start

Prerequisites

  • Python 3.8+
  • Google Chrome installed

Run

# Clone
git clone https://github.com/GopalGB/web-scrappie.git
cd web-scrappie

# Run (dependencies install automatically on first launch)
python web_scrappie.py

Or install dependencies manually first:

pip install -r requirements.txt
python web_scrappie.py

Input File Format

Your spreadsheet needs at least two columns. The tool auto-detects which column has categories and which has URLs based on header names.

Category URL
Shoes https://example.com/shoes
Bags https://example.com/bags
Watches https://example.com/watches
  • If headers aren't recognized, it assumes column 1 = category, column 2 = URL
  • Multiple sheets are supported -- each sheet is processed independently
  • PDFs: extracts every URL found in text and annotations

Settings

All configurable from the GUI:

Setting Default Description
Max Scrolls 15 Times to scroll down per page (for lazy-loaded content)
Scroll Pause 2.0s Wait time between scrolls
Page Wait 8s Wait time after initial page load
DL Threads 8 Parallel threads for image downloading
Headless Off Run Chrome invisibly (faster, but some sites block it)
Download Images Off Download images locally & embed thumbnails in Excel

How the Scraper Works

The scraper tries three methods in order, using the first one that returns results:

  1. Preloaded State -- Many React/Next.js sites embed product data in window.__PRELOADED_STATE__ or window.__NEXT_DATA__. Pulls structured data directly. Fastest and most reliable.

  2. Product Links -- Finds all <a> tags containing <img> elements. Extracts image source and alt text as the product title.

  3. Standalone Images -- Falls back to grabbing every <img> with alt text, filtering out icons and tiny spacer images.


Output

The Excel workbook includes:

  • Summary sheet -- category names, item counts, generation timestamp
  • One sheet per category -- columns: #, Title, Image URL, Page URL
  • Embedded thumbnails (if enabled) -- scaled to fit cells without stretching
  • Styling -- alternating row colors, frozen header rows, auto-filters, clickable hyperlinks

New in v1.2.0: also exports CSV and JSON files alongside the workbook.


Tips

  • Getting zero results? Uncheck headless mode -- some sites aggressively block headless browsers
  • SSL errors? The tool handles certificate issues automatically, useful on corporate networks
  • Small images skipped -- files under 500 bytes (tracking pixels, spacers) are filtered out
  • Deduplication -- results are deduplicated by image URL within each category

Tech Stack

Component Library
GUI customtkinter
Browser automation selenium + undetected-chromedriver
Spreadsheet I/O pandas + openpyxl + odfpy
PDF parsing pdfplumber
Image processing Pillow
HTTP requests

License

MIT


Made by Gopal Bagaswar

About

Desktop GUI tool for scraping product images & metadata from e-commerce websites. Reads URLs from spreadsheets/PDFs, visits pages with a real browser, extracts product images and titles, outputs formatted Excel workbooks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages