Skip to content

ray074/hpc-pagerank

Repository files navigation

High-Performance PageRank on Power-Law Graphs

Overview

In this High Performance Computing (HPC) engineering project, I've built, benchmarked and tested 3 variants of PageRank on large, sparse Power-Law graphs. The first version is a baseline Python implementation, which improved to SciPy with linear algebra, and finally, a custom Just-In-Time (JIT) Numba implementation.

Key features

  • Three implementations: Baseline Python, SciPy with Linear Algebra and Custom JIT Numba
  • Correctness tests vs NetworkX
  • Command line interface (CLI) for benchmarking. Produces a CSV output
  • Fixed iteration vs. tol (tolerance) benchmarking
  • Comprehensive results suite (see RESULTS.md)
  • Plots and profiling data (see docs/profiling.md)
  • HPC-specific notes (see docs/hpc-notes.md)
  • Development hygiene: Continuous Integration (CI) and tests

What’s implemented

Implementations

  • baseline: pure Python loops over incoming CSR (Compressed Sparse Row)
  • sparse_total: SciPy sparse approach including matrix build and iteration
  • sparse_build: a one-off cost to build the transition matrix (this is reported separately)
  • sparse_cached: SciPy iteration-only time (for a fairer comparison)
  • numba: custom CSR traversal loop, JIT-compiled with Numba (a warmup is performed before timing)

Benchmarking

  • CLI generates graphs and writes them as .npz (incoming CSR)
  • CLI benchmarks implementations and appends to results/results.csv
  • Plot script reads CSV data and generates figures
  • Correctness: parity tests vs NetworkX and cross-implementation comparisons

Project Visualisation

An end-to-end visualisation of the pathway I took to complete the project.

flowchart LR
  A["generate_graph CLI"] --> X[".npz<br/>(incoming CSR)"] --> B["CSRIncoming"]

  B --> C1["baseline<br/>Python loops"]
  B --> C2["sparse<br/>SciPy"]
  B --> C3["numba<br/>JIT CSR loop"]

  C2 --> J["build once"] --> D1["sparse_build<br/>(A matrix)"]
  D1 --> D2["sparse_cached<br/>(A * r loop)"]

  C1 --> E["pagerank_bench"]
  C2 --> E
  C3 --> E
  D2 --> E

  E --> F["results/results.csv<br/>median of repeats"]
  F --> G["bench/plot_results.py"]
  G --> H["plots in results/"]
  H --> I["RESULTS.md + README snapshot"]
Loading

Repo structure

  • src/hpc_pagerank/
    • graphs.py - graph generation (BA/SBM), conversion to incoming CSR, save/load .npz
    • pagerank.py - baseline, SciPy sparse, Numba implementations
    • bench_core.py - benchmark runner (repeats, median) + CSV append
  • src/generate_graph/ - python -m generate_graph ...
  • src/pagerank_bench/ - python -m pagerank_bench ...
  • bench/plot_results.py - generate plots from CSV data
  • tests/ - correctness + parity tests
  • docs/assets/ - committed “paper figures” for README/RESULTS
  • RESULTS.md - full methodology, tables, and plots

Results snapshot (fixed iteration, tol=0, iters=300, median of 5 runs)

Benchmarked on Barabási–Albert (BA) graphs. Full details + methodology in RESULTS.md.

n (nodes) nnz baseline (s) sparse_cached (s) numba (s) speedup baseline/sparse_cached speedup baseline/numba
5,000 39,968 16.866 0.0412 0.0491 409.8× 343.6×
10,000 79,968 32.748 0.0935 0.1120 350.1× 292.4×
15,000 119,968 50.310 0.1386 0.1487 362.9× 338.4×
25,000 199,968 84.597 0.2394 0.3001 353.5× 281.9×
40,000 319,968 135.674 0.4047 0.4717 335.3× 287.6×

Plots

Runtime vs n (linear scale)

Runtime vs n (logarithmic scale)

Runtime vs n (zoomed in: numba vs sparse_cached)

Speedup vs n

Install

Standard pip installation

pip install -r requirements.txt
pip install -e .

Run tests

pytest -q

Quick start

1) Generate a graph (BA)

python -m generate_graph --type ba -n 10000 -m 4 --seed 0 --out data/ba_10k.npz

2) Benchmark all implementations (a fixed iteration is best for scaling plots)

tol=0 disables early convergence so every run does exactly iters iterations.

python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

3) Accuracy-based benchmarking

python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 1e-8 --repeats 5 --out results/results.csv

4) Generate plots from the CSV

Examples (script supports filtering and a logarithmic scale; the filenames are tagged to avoid overwriting):

python bench/plot_results.py --csv results/results.csv --outdir results --tol 0 --impls baseline,numba,sparse_cached --logy --tag log
python bench/plot_results.py --csv results/results.csv --outdir results --tol 0 --impls numba,sparse_cached --tag zoom

Reproducing the sweep used in RESULTS.md

I've evaluated the implementation on graph sizes from 5k to 40k nodes. While these provide a clear baseline on local hardware, I recommend generating graphs of up to 200k nodes (or more) so that you can clearly identify scaling bottlenecks and see more pronounced results.

python -m generate_graph --type ba -n 5000  -m 4 --seed 0 --out data/ba_5k.npz
python -m pagerank_bench --graph data/ba_5k.npz  --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 10000 -m 4 --seed 0 --out data/ba_10k.npz
python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 15000 -m 4 --seed 0 --out data/ba_15k.npz
python -m pagerank_bench --graph data/ba_15k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 25000 -m 4 --seed 0 --out data/ba_25k.npz
python -m pagerank_bench --graph data/ba_25k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 40000 -m 4 --seed 0 --out data/ba_40k.npz
python -m pagerank_bench --graph data/ba_40k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

Further reading

  • RESULTS.md - includes the method for benchmarking, tables, and plots.
  • docs/profiling.md - profiling evidence and my interpretation.
  • docs/hpc-notes.md - HPC-specific notes, the future of this repo and what I would do on supercomputer architecture.

About

High-Performance PageRank algorithms on Power-Law graphs

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages