High-Performance PageRank on Power-Law Graphs

Overview

In this High Performance Computing (HPC) engineering project, I've built, benchmarked and tested 3 variants of PageRank on large, sparse Power-Law graphs. The first version is a baseline Python implementation, which improved to SciPy with linear algebra, and finally, a custom Just-In-Time (JIT) Numba implementation.

Key features

Three implementations: Baseline Python, SciPy with Linear Algebra and Custom JIT Numba
Correctness tests vs NetworkX
Command line interface (CLI) for benchmarking. Produces a CSV output
Fixed iteration vs. tol (tolerance) benchmarking
Comprehensive results suite (see RESULTS.md)
Plots and profiling data (see docs/profiling.md)
HPC-specific notes (see docs/hpc-notes.md)
Development hygiene: Continuous Integration (CI) and tests

What’s implemented

Implementations

baseline: pure Python loops over incoming CSR (Compressed Sparse Row)
sparse_total: SciPy sparse approach including matrix build and iteration
sparse_build: a one-off cost to build the transition matrix (this is reported separately)
sparse_cached: SciPy iteration-only time (for a fairer comparison)
numba: custom CSR traversal loop, JIT-compiled with Numba (a warmup is performed before timing)

Benchmarking

CLI generates graphs and writes them as .npz (incoming CSR)
CLI benchmarks implementations and appends to results/results.csv
Plot script reads CSV data and generates figures
Correctness: parity tests vs NetworkX and cross-implementation comparisons

Project Visualisation

An end-to-end visualisation of the pathway I took to complete the project.

flowchart LR
  A["generate_graph CLI"] --> X[".npz<br/>(incoming CSR)"] --> B["CSRIncoming"]

  B --> C1["baseline<br/>Python loops"]
  B --> C2["sparse<br/>SciPy"]
  B --> C3["numba<br/>JIT CSR loop"]

  C2 --> J["build once"] --> D1["sparse_build<br/>(A matrix)"]
  D1 --> D2["sparse_cached<br/>(A * r loop)"]

  C1 --> E["pagerank_bench"]
  C2 --> E
  C3 --> E
  D2 --> E

  E --> F["results/results.csv<br/>median of repeats"]
  F --> G["bench/plot_results.py"]
  G --> H["plots in results/"]
  H --> I["RESULTS.md + README snapshot"]

Repo structure

src/hpc_pagerank/
- graphs.py - graph generation (BA/SBM), conversion to incoming CSR, save/load .npz
- pagerank.py - baseline, SciPy sparse, Numba implementations
- bench_core.py - benchmark runner (repeats, median) + CSV append
src/generate_graph/ - python -m generate_graph ...
src/pagerank_bench/ - python -m pagerank_bench ...
bench/plot_results.py - generate plots from CSV data
tests/ - correctness + parity tests
docs/assets/ - committed “paper figures” for README/RESULTS
RESULTS.md - full methodology, tables, and plots

Results snapshot (fixed iteration, tol=0, iters=300, median of 5 runs)

Benchmarked on Barabási–Albert (BA) graphs. Full details + methodology in RESULTS.md.

n (nodes)	nnz	baseline (s)	sparse_cached (s)	numba (s)	speedup baseline/sparse_cached	speedup baseline/numba
5,000	39,968	16.866	0.0412	0.0491	409.8×	343.6×
10,000	79,968	32.748	0.0935	0.1120	350.1×	292.4×
15,000	119,968	50.310	0.1386	0.1487	362.9×	338.4×
25,000	199,968	84.597	0.2394	0.3001	353.5×	281.9×
40,000	319,968	135.674	0.4047	0.4717	335.3×	287.6×

Plots

Runtime vs n (linear scale)

Runtime vs n (logarithmic scale)

Runtime vs n (zoomed in: numba vs sparse_cached)

Speedup vs n

Install

Standard pip installation

pip install -r requirements.txt
pip install -e .

Run tests

pytest -q

Quick start

1) Generate a graph (BA)

python -m generate_graph --type ba -n 10000 -m 4 --seed 0 --out data/ba_10k.npz

2) Benchmark all implementations (a fixed iteration is best for scaling plots)

tol=0 disables early convergence so every run does exactly iters iterations.

python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

3) Accuracy-based benchmarking

python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 1e-8 --repeats 5 --out results/results.csv

4) Generate plots from the CSV

Examples (script supports filtering and a logarithmic scale; the filenames are tagged to avoid overwriting):

python bench/plot_results.py --csv results/results.csv --outdir results --tol 0 --impls baseline,numba,sparse_cached --logy --tag log
python bench/plot_results.py --csv results/results.csv --outdir results --tol 0 --impls numba,sparse_cached --tag zoom

Reproducing the sweep used in RESULTS.md

I've evaluated the implementation on graph sizes from 5k to 40k nodes. While these provide a clear baseline on local hardware, I recommend generating graphs of up to 200k nodes (or more) so that you can clearly identify scaling bottlenecks and see more pronounced results.

python -m generate_graph --type ba -n 5000  -m 4 --seed 0 --out data/ba_5k.npz
python -m pagerank_bench --graph data/ba_5k.npz  --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 10000 -m 4 --seed 0 --out data/ba_10k.npz
python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 15000 -m 4 --seed 0 --out data/ba_15k.npz
python -m pagerank_bench --graph data/ba_15k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 25000 -m 4 --seed 0 --out data/ba_25k.npz
python -m pagerank_bench --graph data/ba_25k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

python -m generate_graph --type ba -n 40000 -m 4 --seed 0 --out data/ba_40k.npz
python -m pagerank_bench --graph data/ba_40k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
bench		bench
docs		docs
profiles		profiles
results		results
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
RESULTS.md		RESULTS.md
pyproject.toml		pyproject.toml
requirements-dev.in		requirements-dev.in
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Performance PageRank on Power-Law Graphs

Overview

Key features

What’s implemented

Implementations

Benchmarking

Project Visualisation

Repo structure

Results snapshot (fixed iteration, tol=0, iters=300, median of 5 runs)

Plots

Runtime vs n (linear scale)

Runtime vs n (logarithmic scale)

Runtime vs n (zoomed in: numba vs sparse_cached)

Speedup vs n

Install

Standard pip installation

Run tests

Quick start

1) Generate a graph (BA)

2) Benchmark all implementations (a fixed iteration is best for scaling plots)

3) Accuracy-based benchmarking

4) Generate plots from the CSV

Reproducing the sweep used in RESULTS.md

Further reading

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

High-Performance PageRank on Power-Law Graphs

Overview

Key features

What’s implemented

Implementations

Benchmarking

Project Visualisation

Repo structure

Results snapshot (fixed iteration, tol=0, iters=300, median of 5 runs)

Plots

Runtime vs n (linear scale)

Runtime vs n (logarithmic scale)

Runtime vs n (zoomed in: numba vs sparse_cached)

Speedup vs n

Install

Standard pip installation

Run tests

Quick start

1) Generate a graph (BA)

2) Benchmark all implementations (a fixed iteration is best for scaling plots)

3) Accuracy-based benchmarking

4) Generate plots from the CSV

Reproducing the sweep used in RESULTS.md

Further reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages