PAA Scraper - People Also Ask Questions Extractor

A professional Python tool for extracting "People Also Ask" (PAA) questions from Google search results. This tool provides both a command-line interface and a REST API for easy integration into your workflows.

Features

Simple CLI Interface - Easy-to-use command-line tool for quick queries
REST API - Flask-based API with proper authentication and error handling
Batch Processing - Process multiple queries efficiently
Configurable - Environment-based configuration management
Type-Safe - Full type hints for better code quality
Well-Tested - Comprehensive unit tests included
Docker Support - Ready-to-deploy Docker configuration
Logging - Built-in logging for debugging and monitoring

Installation

From Source

# Clone the repository
git clone https://github.com/your-username/paa-scraper.git
cd paa-scraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements-new.txt

# Install the package
pip install -e .

Using Docker

# Build the image
docker build -f Dockerfile.new -t paa-scraper .

# Run the container
docker run -p 5000:5000 -e PAA_API_KEY=your-secret-key paa-scraper

Using Docker Compose

# Set your API key in .env file
echo "PAA_API_KEY=your-secret-key" > .env

# Start the service
docker-compose up -d

Quick Start

Command Line Usage

# Scrape questions for a single query
paa-scraper scrape "what is python programming"

# Specify country and language
paa-scraper scrape "best restaurants" --country uk --language en

# Save results to a file
paa-scraper scrape "machine learning" --output results.json

# Batch process multiple queries from a file
paa-scraper batch queries.txt --output batch_results.json

Python Library Usage

from paa_scraper import scrape_related_questions

# Scrape questions
questions = scrape_related_questions(
    query="artificial intelligence",
    country="us",
    language="en"
)

print(f"Found {len(questions)} questions:")
for question in questions:
    print(f"  - {question}")

REST API Usage

Start the API Server

# Set your API key
export PAA_API_KEY=your-secret-key

# Run the server
paa-api

API Endpoints

Health Check

curl http://localhost:5000/

Scrape Single Query (GET)

curl "http://localhost:5000/api/v1/scrape?query=python&api_key=your-secret-key"

Scrape Single Query (POST)

curl -X POST http://localhost:5000/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "country": "us",
    "language": "en",
    "api_key": "your-secret-key"
  }'

Batch Scrape

curl -X POST http://localhost:5000/api/v1/batch \
  -H "Content-Type: application/json" \
  -d '{
    "queries": ["python", "javascript", "rust"],
    "country": "us",
    "language": "en",
    "api_key": "your-secret-key"
  }'

Configuration

Configuration can be done via environment variables. Copy .env.example to .env and customize:

cp .env.example .env

Environment Variables

Variable	Description	Default
`PAA_DEFAULT_COUNTRY`	Default country code	`us`
`PAA_DEFAULT_LANGUAGE`	Default language code	`en`
`PAA_REQUEST_TIMEOUT`	Request timeout in seconds	`10`
`PAA_API_HOST`	API server host	`0.0.0.0`
`PAA_API_PORT`	API server port	`5000`
`PAA_API_KEY`	API authentication key	None
`PAA_LOG_LEVEL`	Logging level	`INFO`

Project Structure

paa-scraper/
├── src/
│   └── paa_scraper/
│       ├── __init__.py           # Package initialization
│       ├── scraper.py            # Core scraping logic
│       ├── text_utils.py         # Text processing utilities
│       ├── config.py             # Configuration management
│       ├── api/
│       │   ├── __init__.py
│       │   └── flask_app.py      # Flask REST API
│       └── cli/
│           ├── __init__.py
│           └── main.py           # CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_scraper.py           # Scraper tests
│   └── test_api.py               # API tests
├── .env.example                  # Example environment config
├── .gitignore                    # Git ignore rules
├── docker-compose.yml            # Docker Compose config
├── Dockerfile.new                # Docker configuration
├── README.md                     # This file
├── requirements-new.txt          # Production dependencies
├── requirements-dev.txt          # Development dependencies
└── setup.py                      # Package setup

Development

Setup Development Environment

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run tests with coverage
pytest --cov=src/paa_scraper tests/

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/
pylint src/

Running Tests

# Run all tests
python -m pytest

# Run specific test file
python -m pytest tests/test_scraper.py

# Run with verbose output
python -m pytest -v

# Run with coverage report
python -m pytest --cov=src/paa_scraper --cov-report=html

API Response Format

Successful Response

{
  "success": true,
  "data": {
    "query": "python programming",
    "country": "us",
    "language": "en",
    "questions": [
      "What is Python used for?",
      "Is Python easy to learn?",
      "How do I start learning Python?"
    ],
    "count": 3
  }
}

Error Response

{
  "success": false,
  "error": "Scraping failed",
  "message": "Failed to fetch search results: Network timeout"
}

Troubleshooting

Common Issues

1. No questions found

Google may not always show PAA questions for every query
Try different queries or check if Google is accessible from your location

2. Rate limiting

Google may rate-limit requests from the same IP
Consider adding delays between requests or using proxies

3. API authentication errors

Ensure PAA_API_KEY is set correctly
Check that the API key matches in both server and client

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style

Follow PEP 8 guidelines
Use type hints for all functions
Write docstrings for all public functions and classes
Add tests for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for educational and research purposes only. Please respect Google's Terms of Service and robots.txt. Use responsibly and consider implementing rate limiting and caching in production environments.

Author

Pushkar Singh

Acknowledgments

Built with Beautiful Soup for HTML parsing
Uses Flask for the REST API
Inspired by the need for structured PAA data extraction

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/paa_scraper		src/paa_scraper
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.new		Dockerfile.new
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt
requirements-new.txt		requirements-new.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

PAA Scraper - People Also Ask Questions Extractor

Features

Installation

From Source

Using Docker

Using Docker Compose

Quick Start

Command Line Usage

Python Library Usage

REST API Usage

Start the API Server

API Endpoints

Configuration

Environment Variables

Project Structure

Development

Setup Development Environment

Running Tests

API Response Format

Successful Response

Error Response

Troubleshooting

Common Issues

Contributing

Code Style

License

Disclaimer

Author

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages