Skip to content

pushkarsingh32/paa-scraper

Repository files navigation

PAA Scraper - People Also Ask Questions Extractor

A professional Python tool for extracting "People Also Ask" (PAA) questions from Google search results. This tool provides both a command-line interface and a REST API for easy integration into your workflows.

Features

  • Simple CLI Interface - Easy-to-use command-line tool for quick queries
  • REST API - Flask-based API with proper authentication and error handling
  • Batch Processing - Process multiple queries efficiently
  • Configurable - Environment-based configuration management
  • Type-Safe - Full type hints for better code quality
  • Well-Tested - Comprehensive unit tests included
  • Docker Support - Ready-to-deploy Docker configuration
  • Logging - Built-in logging for debugging and monitoring

Installation

From Source

# Clone the repository
git clone https://github.com/your-username/paa-scraper.git
cd paa-scraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements-new.txt

# Install the package
pip install -e .

Using Docker

# Build the image
docker build -f Dockerfile.new -t paa-scraper .

# Run the container
docker run -p 5000:5000 -e PAA_API_KEY=your-secret-key paa-scraper

Using Docker Compose

# Set your API key in .env file
echo "PAA_API_KEY=your-secret-key" > .env

# Start the service
docker-compose up -d

Quick Start

Command Line Usage

# Scrape questions for a single query
paa-scraper scrape "what is python programming"

# Specify country and language
paa-scraper scrape "best restaurants" --country uk --language en

# Save results to a file
paa-scraper scrape "machine learning" --output results.json

# Batch process multiple queries from a file
paa-scraper batch queries.txt --output batch_results.json

Python Library Usage

from paa_scraper import scrape_related_questions

# Scrape questions
questions = scrape_related_questions(
    query="artificial intelligence",
    country="us",
    language="en"
)

print(f"Found {len(questions)} questions:")
for question in questions:
    print(f"  - {question}")

REST API Usage

Start the API Server

# Set your API key
export PAA_API_KEY=your-secret-key

# Run the server
paa-api

API Endpoints

Health Check

curl http://localhost:5000/

Scrape Single Query (GET)

curl "http://localhost:5000/api/v1/scrape?query=python&api_key=your-secret-key"

Scrape Single Query (POST)

curl -X POST http://localhost:5000/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "country": "us",
    "language": "en",
    "api_key": "your-secret-key"
  }'

Batch Scrape

curl -X POST http://localhost:5000/api/v1/batch \
  -H "Content-Type: application/json" \
  -d '{
    "queries": ["python", "javascript", "rust"],
    "country": "us",
    "language": "en",
    "api_key": "your-secret-key"
  }'

Configuration

Configuration can be done via environment variables. Copy .env.example to .env and customize:

cp .env.example .env

Environment Variables

Variable Description Default
PAA_DEFAULT_COUNTRY Default country code us
PAA_DEFAULT_LANGUAGE Default language code en
PAA_REQUEST_TIMEOUT Request timeout in seconds 10
PAA_API_HOST API server host 0.0.0.0
PAA_API_PORT API server port 5000
PAA_API_KEY API authentication key None
PAA_LOG_LEVEL Logging level INFO

Project Structure

paa-scraper/
├── src/
│   └── paa_scraper/
│       ├── __init__.py           # Package initialization
│       ├── scraper.py            # Core scraping logic
│       ├── text_utils.py         # Text processing utilities
│       ├── config.py             # Configuration management
│       ├── api/
│       │   ├── __init__.py
│       │   └── flask_app.py      # Flask REST API
│       └── cli/
│           ├── __init__.py
│           └── main.py           # CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_scraper.py           # Scraper tests
│   └── test_api.py               # API tests
├── .env.example                  # Example environment config
├── .gitignore                    # Git ignore rules
├── docker-compose.yml            # Docker Compose config
├── Dockerfile.new                # Docker configuration
├── README.md                     # This file
├── requirements-new.txt          # Production dependencies
├── requirements-dev.txt          # Development dependencies
└── setup.py                      # Package setup

Development

Setup Development Environment

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run tests with coverage
pytest --cov=src/paa_scraper tests/

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/
pylint src/

Running Tests

# Run all tests
python -m pytest

# Run specific test file
python -m pytest tests/test_scraper.py

# Run with verbose output
python -m pytest -v

# Run with coverage report
python -m pytest --cov=src/paa_scraper --cov-report=html

API Response Format

Successful Response

{
  "success": true,
  "data": {
    "query": "python programming",
    "country": "us",
    "language": "en",
    "questions": [
      "What is Python used for?",
      "Is Python easy to learn?",
      "How do I start learning Python?"
    ],
    "count": 3
  }
}

Error Response

{
  "success": false,
  "error": "Scraping failed",
  "message": "Failed to fetch search results: Network timeout"
}

Troubleshooting

Common Issues

1. No questions found

  • Google may not always show PAA questions for every query
  • Try different queries or check if Google is accessible from your location

2. Rate limiting

  • Google may rate-limit requests from the same IP
  • Consider adding delays between requests or using proxies

3. API authentication errors

  • Ensure PAA_API_KEY is set correctly
  • Check that the API key matches in both server and client

Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Style

  • Follow PEP 8 guidelines
  • Use type hints for all functions
  • Write docstrings for all public functions and classes
  • Add tests for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for educational and research purposes only. Please respect Google's Terms of Service and robots.txt. Use responsibly and consider implementing rate limiting and caching in production environments.

Author

Pushkar Singh

Acknowledgments

  • Built with Beautiful Soup for HTML parsing
  • Uses Flask for the REST API
  • Inspired by the need for structured PAA data extraction

About

A professional Python tool for extracting 'People Also Ask' questions from Google search results

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages