A production-grade GitHub repository crawler built with FastAPI, SQLAlchemy, and GraphQL.
- Crawls GitHub repositories via the GitHub GraphQL API
- Stores repository data in PostgreSQL
- Provides both REST and GraphQL APIs
- Built with clean/hexagonal architecture
- Includes observability and health checks
- Containerized with Docker
- CI/CD pipeline with GitHub Actions
- Python 3.11+
- FastAPI - Modern web framework
- SQLAlchemy - SQL toolkit and ORM
- Alembic - Database migrations
- Strawberry GraphQL - GraphQL server
- Neon.tech PostgreSQL - Cloud Postgres service
- UV - Fast Python package manager
- Docker - Containerization
- Python 3.11+
- PostgreSQL database
- GitHub API token with repo access
- UV package manager (
curl -LsSf https://astral.sh/uv/install.sh | sh)
- Clone the repository:
git clone https://github.com/yourusername/github-crawler.git
cd github-crawler- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
uv pip install -r requirements.txt- Create a
.envfile with required environment variables:
GITHUB_TOKEN=your_github_token
DATABASE_URL=postgresql://user:password@localhost:5432/github_crawler
- Run migrations:
alembic upgrade head- Start the application:
uvicorn app.main:app --reload- Build the Docker image:
docker build -t github-crawler .- Run the container:
docker run -p 8000:8000 \
-e GITHUB_TOKEN=your_github_token \
-e DATABASE_URL=postgresql://user:password@host:5432/github_crawler \
github-crawlerGET /api/repositories/trending?limit=10- Get trending repositories by star countPOST /api/crawler/run?count=100- Trigger a crawler runGET /healthz- Liveness probeGET /readyz- Readiness probe
/graphql- GraphQL endpoint- Example query:
query { trendingRepositories(limit: 10) { githubId name owner starCount } }
python3 dev.py 100This will crawl the first 100 repositories (with specified requirements in the graphql query. check app/services/github_client.py > fetch_trending_repositories() for the query) and store the data in the database.
CREATE TABLE repos (
github_id BIGINT PRIMARY KEY,
name TEXT NOT NULL,
owner TEXT NOT NULL,
star_count INT NOT NULL,
last_crawled TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ix_repos_star_count ON repos (star_count);
CREATE INDEX ix_repos_last_crawled ON repos (last_crawled);We might consider using a distributed work queue to crawl the repositories in parallel. Using Kafka or RabbitMQ, we can shard the repo IDs across consumers and launch hundreds of concurrent workers, respecting the 100-concurrent-request cap per token. Then, we can store the raw GitHub events in a data lake like S3 + Athena or BigQuery for historical analysis.