In this High Performance Computing (HPC) engineering project, I've built, benchmarked and tested 3 variants of PageRank on large, sparse Power-Law graphs. The first version is a baseline Python implementation, which improved to SciPy with linear algebra, and finally, a custom Just-In-Time (JIT) Numba implementation.
- Three implementations: Baseline Python, SciPy with Linear Algebra and Custom JIT Numba
- Correctness tests vs NetworkX
- Command line interface (CLI) for benchmarking. Produces a CSV output
- Fixed iteration vs. tol (tolerance) benchmarking
- Comprehensive results suite (see
RESULTS.md) - Plots and profiling data (see
docs/profiling.md) - HPC-specific notes (see
docs/hpc-notes.md) - Development hygiene: Continuous Integration (CI) and tests
- baseline: pure Python loops over incoming CSR (Compressed Sparse Row)
- sparse_total: SciPy sparse approach including matrix build and iteration
- sparse_build: a one-off cost to build the transition matrix (this is reported separately)
- sparse_cached: SciPy iteration-only time (for a fairer comparison)
- numba: custom CSR traversal loop, JIT-compiled with Numba (a warmup is performed before timing)
- CLI generates graphs and writes them as
.npz(incoming CSR) - CLI benchmarks implementations and appends to
results/results.csv - Plot script reads CSV data and generates figures
- Correctness: parity tests vs NetworkX and cross-implementation comparisons
An end-to-end visualisation of the pathway I took to complete the project.
flowchart LR
A["generate_graph CLI"] --> X[".npz<br/>(incoming CSR)"] --> B["CSRIncoming"]
B --> C1["baseline<br/>Python loops"]
B --> C2["sparse<br/>SciPy"]
B --> C3["numba<br/>JIT CSR loop"]
C2 --> J["build once"] --> D1["sparse_build<br/>(A matrix)"]
D1 --> D2["sparse_cached<br/>(A * r loop)"]
C1 --> E["pagerank_bench"]
C2 --> E
C3 --> E
D2 --> E
E --> F["results/results.csv<br/>median of repeats"]
F --> G["bench/plot_results.py"]
G --> H["plots in results/"]
H --> I["RESULTS.md + README snapshot"]
src/hpc_pagerank/graphs.py- graph generation (BA/SBM), conversion to incoming CSR, save/load.npzpagerank.py- baseline, SciPy sparse, Numba implementationsbench_core.py- benchmark runner (repeats, median) + CSV append
src/generate_graph/-python -m generate_graph ...src/pagerank_bench/-python -m pagerank_bench ...bench/plot_results.py- generate plots from CSV datatests/- correctness + parity testsdocs/assets/- committed “paper figures” for README/RESULTSRESULTS.md- full methodology, tables, and plots
Benchmarked on Barabási–Albert (BA) graphs. Full details + methodology in RESULTS.md.
| n (nodes) | nnz | baseline (s) | sparse_cached (s) | numba (s) | speedup baseline/sparse_cached | speedup baseline/numba |
|---|---|---|---|---|---|---|
| 5,000 | 39,968 | 16.866 | 0.0412 | 0.0491 | 409.8× | 343.6× |
| 10,000 | 79,968 | 32.748 | 0.0935 | 0.1120 | 350.1× | 292.4× |
| 15,000 | 119,968 | 50.310 | 0.1386 | 0.1487 | 362.9× | 338.4× |
| 25,000 | 199,968 | 84.597 | 0.2394 | 0.3001 | 353.5× | 281.9× |
| 40,000 | 319,968 | 135.674 | 0.4047 | 0.4717 | 335.3× | 287.6× |
pip install -r requirements.txt
pip install -e .pytest -qpython -m generate_graph --type ba -n 10000 -m 4 --seed 0 --out data/ba_10k.npztol=0 disables early convergence so every run does exactly iters iterations.
python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csvpython -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 1e-8 --repeats 5 --out results/results.csvExamples (script supports filtering and a logarithmic scale; the filenames are tagged to avoid overwriting):
python bench/plot_results.py --csv results/results.csv --outdir results --tol 0 --impls baseline,numba,sparse_cached --logy --tag log
python bench/plot_results.py --csv results/results.csv --outdir results --tol 0 --impls numba,sparse_cached --tag zoomI've evaluated the implementation on graph sizes from 5k to 40k nodes. While these provide a clear baseline on local hardware, I recommend generating graphs of up to 200k nodes (or more) so that you can clearly identify scaling bottlenecks and see more pronounced results.
python -m generate_graph --type ba -n 5000 -m 4 --seed 0 --out data/ba_5k.npz
python -m pagerank_bench --graph data/ba_5k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv
python -m generate_graph --type ba -n 10000 -m 4 --seed 0 --out data/ba_10k.npz
python -m pagerank_bench --graph data/ba_10k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv
python -m generate_graph --type ba -n 15000 -m 4 --seed 0 --out data/ba_15k.npz
python -m pagerank_bench --graph data/ba_15k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv
python -m generate_graph --type ba -n 25000 -m 4 --seed 0 --out data/ba_25k.npz
python -m pagerank_bench --graph data/ba_25k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csv
python -m generate_graph --type ba -n 40000 -m 4 --seed 0 --out data/ba_40k.npz
python -m pagerank_bench --graph data/ba_40k.npz --impl all --iters 300 --tol 0 --repeats 5 --out results/results.csvRESULTS.md- includes the method for benchmarking, tables, and plots.docs/profiling.md- profiling evidence and my interpretation.docs/hpc-notes.md- HPC-specific notes, the future of this repo and what I would do on supercomputer architecture.



