Desktop GUI tool for scraping product images & metadata from e-commerce websites
Give it a spreadsheet of URLs. Get back a beautifully formatted workbook with images, titles, and more.
Fresh features to make scraping smoother and faster.
| Feature | Details | |
|---|---|---|
| π | CSV & JSON Export | Export results as CSV or JSON alongside the default Excel workbook |
| π¨ | Dark / Light Theme Toggle | Switch between dark and light themes from the GUI |
| π | Preview URL Count | See how many URLs will be scraped before you hit Start |
| π | Auto-Open Output | Automatically open the output file when scraping completes |
| ποΈ | Clear Log Button | One-click clear for the log panel |
| π₯ | Drag & Drop File Input | Drop your spreadsheet right onto the window -- no file picker needed |
+-----------------+ +-------------------+ +------------------+
| | | | | |
| INPUT FILE | -----β | web_scrappie | -----β | OUTPUT |
| | | | | |
| .xlsx / .ods | | Opens browser | | Excel workbook |
| .xls / .pdf | | Scrolls pages | | CSV / JSON |
| | | Extracts images | | Downloaded imgs |
+-----------------+ +-------------------+ +------------------+
Input File (URLs by category) β Scrape (automated browser) β Output (formatted workbook + images)
Clean dark UI -- configure settings, pick your file, hit Start
Live progress tracking with color-coded log output
I got tired of manually copying product images and titles from retail sites. Now I maintain a spreadsheet of URLs and let this handle the rest.
|
Extraction & Parsing
|
Output & Performance
|
|
Browser & Network
|
Developer Experience
|
- Python 3.8+
- Google Chrome installed
# Clone
git clone https://github.com/GopalGB/web-scrappie.git
cd web-scrappie
# Run (dependencies install automatically on first launch)
python web_scrappie.pyOr install dependencies manually first:
pip install -r requirements.txt
python web_scrappie.pyYour spreadsheet needs at least two columns. The tool auto-detects which column has categories and which has URLs based on header names.
| Category | URL |
|---|---|
| Shoes | https://example.com/shoes |
| Bags | https://example.com/bags |
| Watches | https://example.com/watches |
- If headers aren't recognized, it assumes column 1 = category, column 2 = URL
- Multiple sheets are supported -- each sheet is processed independently
- PDFs: extracts every URL found in text and annotations
All configurable from the GUI:
| Setting | Default | Description |
|---|---|---|
| Max Scrolls | 15 | Times to scroll down per page (for lazy-loaded content) |
| Scroll Pause | 2.0s | Wait time between scrolls |
| Page Wait | 8s | Wait time after initial page load |
| DL Threads | 8 | Parallel threads for image downloading |
| Headless | Off | Run Chrome invisibly (faster, but some sites block it) |
| Download Images | Off | Download images locally & embed thumbnails in Excel |
The scraper tries three methods in order, using the first one that returns results:
-
Preloaded State -- Many React/Next.js sites embed product data in
window.__PRELOADED_STATE__orwindow.__NEXT_DATA__. Pulls structured data directly. Fastest and most reliable. -
Product Links -- Finds all
<a>tags containing<img>elements. Extracts image source and alt text as the product title. -
Standalone Images -- Falls back to grabbing every
<img>with alt text, filtering out icons and tiny spacer images.
The Excel workbook includes:
- Summary sheet -- category names, item counts, generation timestamp
- One sheet per category -- columns: #, Title, Image URL, Page URL
- Embedded thumbnails (if enabled) -- scaled to fit cells without stretching
- Styling -- alternating row colors, frozen header rows, auto-filters, clickable hyperlinks
New in v1.2.0: also exports CSV and JSON files alongside the workbook.
- Getting zero results? Uncheck headless mode -- some sites aggressively block headless browsers
- SSL errors? The tool handles certificate issues automatically, useful on corporate networks
- Small images skipped -- files under 500 bytes (tracking pixels, spacers) are filtered out
- Deduplication -- results are deduplicated by image URL within each category
| Component | Library |
|---|---|
| GUI | customtkinter |
| Browser automation | selenium + undetected-chromedriver |
| Spreadsheet I/O | pandas + openpyxl + odfpy |
| PDF parsing | pdfplumber |
| Image processing | Pillow |
| HTTP | requests |
Made by Gopal Bagaswar

