Skip to content

abdibrokhim/github-crawler

Repository files navigation

GitHub Crawler

A production-grade GitHub repository crawler built with FastAPI, SQLAlchemy, and GraphQL.

Features

  • Crawls GitHub repositories via the GitHub GraphQL API
  • Stores repository data in PostgreSQL
  • Provides both REST and GraphQL APIs
  • Built with clean/hexagonal architecture
  • Includes observability and health checks
  • Containerized with Docker
  • CI/CD pipeline with GitHub Actions

Tech Stack

  • Python 3.11+
  • FastAPI - Modern web framework
  • SQLAlchemy - SQL toolkit and ORM
  • Alembic - Database migrations
  • Strawberry GraphQL - GraphQL server
  • Neon.tech PostgreSQL - Cloud Postgres service
  • UV - Fast Python package manager
  • Docker - Containerization

Getting Started

Prerequisites

  • Python 3.11+
  • PostgreSQL database
  • GitHub API token with repo access
  • UV package manager (curl -LsSf https://astral.sh/uv/install.sh | sh)

Local Development Setup

  1. Clone the repository:
git clone https://github.com/yourusername/github-crawler.git
cd github-crawler
  1. Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
uv pip install -r requirements.txt
  1. Create a .env file with required environment variables:
GITHUB_TOKEN=your_github_token
DATABASE_URL=postgresql://user:password@localhost:5432/github_crawler
  1. Run migrations:
alembic upgrade head
  1. Start the application:
uvicorn app.main:app --reload

Using Docker

  1. Build the Docker image:
docker build -t github-crawler .
  1. Run the container:
docker run -p 8000:8000 \
  -e GITHUB_TOKEN=your_github_token \
  -e DATABASE_URL=postgresql://user:password@host:5432/github_crawler \
  github-crawler

API Endpoints

REST API

  • GET /api/repositories/trending?limit=10 - Get trending repositories by star count
  • POST /api/crawler/run?count=100 - Trigger a crawler run
  • GET /healthz - Liveness probe
  • GET /readyz - Readiness probe

GraphQL API

  • /graphql - GraphQL endpoint
  • Example query:
    query {
      trendingRepositories(limit: 10) {
        githubId
        name
        owner
        starCount
      }
    }

Running the crawler

python3 dev.py 100

This will crawl the first 100 repositories (with specified requirements in the graphql query. check app/services/github_client.py > fetch_trending_repositories() for the query) and store the data in the database.

Database Schema

CREATE TABLE repos (
  github_id BIGINT PRIMARY KEY,
  name TEXT NOT NULL,
  owner TEXT NOT NULL,
  star_count INT NOT NULL,
  last_crawled TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ix_repos_star_count ON repos (star_count);
CREATE INDEX ix_repos_last_crawled ON repos (last_crawled);

Scaling to 500+ Million Repos

We might consider using a distributed work queue to crawl the repositories in parallel. Using Kafka or RabbitMQ, we can shard the repo IDs across consumers and launch hundreds of concurrent workers, respecting the 100-concurrent-request cap per token. Then, we can store the raw GitHub events in a data lake like S3 + Athena or BigQuery for historical analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published