GitHub Crawler

A production-grade GitHub repository crawler built with FastAPI, SQLAlchemy, and GraphQL.

Features

Crawls GitHub repositories via the GitHub GraphQL API
Stores repository data in PostgreSQL
Provides both REST and GraphQL APIs
Built with clean/hexagonal architecture
Includes observability and health checks
Containerized with Docker
CI/CD pipeline with GitHub Actions

Tech Stack

Python 3.11+
FastAPI - Modern web framework
SQLAlchemy - SQL toolkit and ORM
Alembic - Database migrations
Strawberry GraphQL - GraphQL server
Neon.tech PostgreSQL - Cloud Postgres service
UV - Fast Python package manager
Docker - Containerization

Getting Started

Prerequisites

Python 3.11+
PostgreSQL database
GitHub API token with repo access
UV package manager (curl -LsSf https://astral.sh/uv/install.sh | sh)

Local Development Setup

Clone the repository:

git clone https://github.com/yourusername/github-crawler.git
cd github-crawler

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

uv pip install -r requirements.txt

Create a .env file with required environment variables:

GITHUB_TOKEN=your_github_token
DATABASE_URL=postgresql://user:password@localhost:5432/github_crawler

Run migrations:

alembic upgrade head

Start the application:

uvicorn app.main:app --reload

Using Docker

Build the Docker image:

docker build -t github-crawler .

Run the container:

docker run -p 8000:8000 \
  -e GITHUB_TOKEN=your_github_token \
  -e DATABASE_URL=postgresql://user:password@host:5432/github_crawler \
  github-crawler

API Endpoints

REST API

GET /api/repositories/trending?limit=10 - Get trending repositories by star count
POST /api/crawler/run?count=100 - Trigger a crawler run
GET /healthz - Liveness probe
GET /readyz - Readiness probe

GraphQL API

/graphql - GraphQL endpoint

Example query:

query {
  trendingRepositories(limit: 10) {
    githubId
    name
    owner
    starCount
  }
}

Running the crawler

python3 dev.py 100

This will crawl the first 100 repositories (with specified requirements in the graphql query. check app/services/github_client.py > fetch_trending_repositories() for the query) and store the data in the database.

Database Schema

CREATE TABLE repos (
  github_id BIGINT PRIMARY KEY,
  name TEXT NOT NULL,
  owner TEXT NOT NULL,
  star_count INT NOT NULL,
  last_crawled TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ix_repos_star_count ON repos (star_count);
CREATE INDEX ix_repos_last_crawled ON repos (last_crawled);

Scaling to 500+ Million Repos

We might consider using a distributed work queue to crawl the repositories in parallel. Using Kafka or RabbitMQ, we can shard the repo IDs across consumers and launch hundreds of concurrent workers, respecting the 100-concurrent-request cap per token. Then, we can store the raw GitHub events in a data lake like S3 + Athena or BigQuery for historical analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
alembic		alembic
app		app
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
TASK.MD		TASK.MD
alembic.ini		alembic.ini
dev.py		dev.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GitHub Crawler

Features

Tech Stack

Getting Started

Prerequisites

Local Development Setup

Using Docker

API Endpoints

REST API

GraphQL API

Running the crawler

Database Schema

Scaling to 500+ Million Repos

About

Uh oh!

Releases

Packages

Languages

abdibrokhim/github-crawler

Folders and files

Latest commit

History

Repository files navigation

GitHub Crawler

Features

Tech Stack

Getting Started

Prerequisites

Local Development Setup

Using Docker

API Endpoints

REST API

GraphQL API

Running the crawler

Database Schema

Scaling to 500+ Million Repos

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages