Advanced Web Scraper

Welcome to Advanced Web Scraper, your ultimate companion for web data extraction! We provide flexible and powerful solutions tailored to meet various web scraping requirements. Our toolkit consists of two robust scripts: GeneralScraper.py and AdvancedScraper.py. This README will introduce you to their unique features and capabilities.

Comparison Table

Here's a feature comparison between the two scripts in a tabular format:

Feature / Aspect	GeneralScraper.py	AdvancedScraper.py
Framework	Requests & Beautiful Soup	Playwright
Network Resilience	Exponential Backoff	Handled by Playwright
Scraping Method	HTML Parser	Browser Automation
Relative Links Handling	urljoin() Function	Built-in Functionality
Real-time Rendering	No	Yes
JavaScript Support	Limited	Full
Data Types	Posts, Links, Texts, Query	Enumerated types
Custom Tag Selection	No	Yes
Error Handling	Print Statements	Print Statements
Result Display	Console Printing	Console Printing
Saving Results	Not Implemented	Multiple Formats Supported
Usage	Simple, Less Powerful	More Complex, More Powerful

GeneralScraper.py

Our beginner-friendly scraper designed for individuals new to web data extraction. It leverages popular libraries such as requests, BeautifulSoup, and urllib to deliver essential functionalities while keeping things simple.

Features:

User-friendly interface guiding users through each step
Four primary data extraction methods:
- Extract post contents (<p> tags)
- Collect internal and external links (<a> tags with href attributes)
- Aggregate all visible text contents
- Perform query-based searches across the entire website

Getting Started:

Make sure you have Python installed.
Install necessary packages:
```
pip install requests beautifulsoup4
```
Run the script:
```
python GeneralScraper.py
```

Required Packages:

Usage:

Enter the target URL when prompted, making sure it starts with 'http://' or 'https://'.
Choose a data extraction method based on the provided options:
- 0: Extract post contents (typically <p> tags)
- 1: Gather internal & external links (<a> tags with href attribute)
- 2: Collect all visible text contents
- 3: Conduct a query-based search across the whole website; enter the desired query when requested
Review the extracted results corresponding to your selection. If no matches are found, don't worry—helpful messages will be displayed!

Happy responsible scraping! Remember always to abide by site owners' terms and conditions. Explore wisely! 😊✨

AdvancedScraper.py

Unlock advanced web scraping capabilities with our state-of-the-art solution driven by the acclaimed Playwright library. Specifically crafted for seasoned developers demanding fine-grained project control, this script offers superior performance and versatility.

Features ✨:

Five comprehensive data extraction techniques:
- Extract posts (DataType.POSTS)
- Extract all links (DataType.LINKS)
- Extract all texts (DataType.ALL_TEXTS)
- Search by query (DataType.SEARCH_QUERY)
- Custom tag extraction (DataType.CUSTOM_TAG)
Multiple output formats available:
- Print (OutputFormat.PRINT)
- Text File (OutputFormat.TEXT_FILE)
- JSON (OutputFormat.JSON)
- CSV (OutputFormat.CSV)

Requirements 📋:

Python >= 3.7 or higher
playwright 1.41.1 package installed

Setup Instructions:

Fulfill the prerequisites:
- Python installation
- Internet access for downloading the required package
Install Playwright Library: Once inside the activated virtual environment, install the playwright library:

pip install playwright

Install Required Browsers: With the playwright library installed, go ahead and include the supported browsers:

playwright install

Run the Script: Finally, execute the AdvancedScraper.py script within the active virtual environment:

python AdvancedScraper.py

Contribution Guidelines

We appreciate any contributions towards improving our web scraping tools! Please familiarize yourself with our contribution guidelines before getting started.

License

This project is distributed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
AdvancedScraper.PY		AdvancedScraper.PY
GeneralScraper.PY		GeneralScraper.PY
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced Web Scraper

Comparison Table

GeneralScraper.py

AdvancedScraper.py

Contribution Guidelines

License

About

Uh oh!

Releases

Packages

Languages

mshojaei77/AdvancedWebScraper

Folders and files

Latest commit

History

Repository files navigation

Advanced Web Scraper

Comparison Table

GeneralScraper.py

AdvancedScraper.py

Contribution Guidelines

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages