Skip to content

beling/MPHF-Experiments

 
 

Repository files navigation

MPHF-Experiments (fork with PHast+ support)

This is a fork of MPHF-Experiments that supports more PHast variants, including PHast+. MPHF-Experiments is framework for comparison of a wide range different minimal perfect hash functions (MPHFs). From these, it can generate comprehensive plots like Pareto plots, and simple comparison tables used in several papers.

This fork was used to conduct experiments for the paper PHast - Perfect Hashing made fast by Piotr Beling and Peter Sanders, accepted for the ALENEX26 conference. This document contains detailed instructions for reproducing the results of these experiments.

The framework provides a unified interface to test basically all modern MPHF constructions that are currently available, including:

Cloning the Repository

This repository contains submodules. To clone the repository including submodules, use the following command.

git clone --recursive https://github.com/beling/MPHF-Experiments.git

Compiling and Running the Experiments Directly

C++ and Rust compilers are required to build the project. The easiest way to obtain the Rust compiler along with other necessary tools (like cargo) is to use rustup. Please follow the instructions at https://www.rust-lang.org/tools/install.

Once compilers are installed, compiling works like with every cmake project.

cmake -B ./build -DCMAKE_BUILD_TYPE=Release
cmake --build ./build -j

This might take about 5-15 minutes because of the large number of competitors. You can then run one of the benchmarks, for example ./build/TablePHast --help or ./build/Comparison --help.

If you encounter problems, see the section on compilation issues.

Reproducing results from the paper PHast - Perfect Hashing made fast

(Piotr Beling, Peter Sanders, PHast -- Perfect Hashing made fast, SIAM Symposium on Algorithm Engineering and Experiments ALENEX26, 2026; preprint available on arXiv)

First, clone the repository, compile the software, and change to build directory:

cd ./build

Data for plots, for 50M keys (--numKeys 50M) and single thread (--numThreads 1) can be obtained using the Comparison program. It can be run separately for individual algorithms (note that all these calculations may take several days - see also methods for accelerating calculations):

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustPHastPlusWrapEF
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustPHastPlusWrapC
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustPHastPlusEF
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustPHastPlusC

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustPtrHashGx

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustFmph
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --rustFmphGo

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.9 --pthash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.9 --partitionedPthash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.95 --pthash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.95 --partitionedPthash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.97 --pthash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.97 --partitionedPthash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --densePartitionedPtHash

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --bipartiteShockHashFlat
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --bipartiteShockHash

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.9 --sichash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.95 --sichash
./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --minimalOnly --loadFactor 0.97 --sichash

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --simdrecsplit

./Comparison --seed 1234 --numKeys 50M --numThreads 1 --numQueries 3 --fiPS

Notes: thanks to --seed 1234 (another seed can also be used), each algorithm works on exactly the same set of pseudo-random keys; --numQueries 3 averages the query time over 3 runs, each over all keys.

To obtain results for multiple threads, simply change the --numThreads parameter to the desired number of threads.

The results of the runs can be placed in files using standard shell mechanisms (by adding >> filename.txt or | tee -a filename.txt at the end of each line; we recommend using ST50M or MT50M as filename).

Data for tables (for 50M and 500M keys) can be obtained using the TablePHast program, either separately for individual algorithms (run ./TablePHast --help for more details) or for all at once by:

./TablePHast --seed 1234 --numKeys 50M --numThreads 1 --numQueries 10 --all
./TablePHast --seed 1234 --numKeys 500M --numThreads 1 --numQueries 10 --all

Again, the results can be placed in files using standard shell mechanisms.

The measured times can be averaged over several runs with different seeds (in the paper, we used the seeds: 1234, 2345, 3456, 4567) with the result_avg.py (python) script, which can be found in the scripts directory:

python ../scripts/result_avg.py results_50M_1234.txt results_50M_2345.txt results_50M_3456.txt results_50M_4567.txt > results_50M_avg.txt

The remaining experiments were performed using mphf_benchmark program. Its documentation describes how to reproduce them.

The results obtained can be formatted into LaTeX files using: sqlplot-tools, script and templates contained in the scripts/phast folder. The README file of sqlplot-tools contains instructions for downloading and compiling it. To use our scripts, copy the sqlplot-tools executable to scripts/phast.

To create a LaTeX files with plots presenting the benchmark results and compile them into PDF files:

  • put the single-threaded results in the ST50M.txt file and multithreaded results in MT50M.txt (place these both files and also scripts/dockerVolume/_competitorNames.txt in the scripts/phast folder);
  • (in the scripts/phast folder) run the plot-phast.sh script (which changes ST50M.tex and MT50M.tex files).

To create a LaTeX file with a table presenting the benchmark results and compile it into a PDF file:

  • put the single-threaded results in the table-ST.txt file and multithreaded results (for the same number of keys) in table-MT.txt (place both files in the scripts/phast folder);
  • (in the scripts/phast folder) run the table-phast.sh script (which uses and changes table-phast.tex file).

Software version notes: We conducted our experiments on Debian 12 bookworm (running on a computer with AMD Ryzen 5600G and 64GB of RAM), with gcc 12.2.0, rust 1.84.1, CMake 3.31.6, tbb 2021.8.0, sqlplot-tools commit 7d2202b at 2024-08-15. However, many other configurations are also supported. For example, experiments were successfully reproduced on: Ubuntu 22.04.5 LTS (using Intel(R) Xeon(R) Gold 6148 CPU) and also ArchLinux (rolling updated to 2025-10-06; using Intel Xeon Gold 6230 CPU ×2) with gcc 15.2.1, rust 1.90.0, CMake 4.1.2, oneTBB 2022.2.0 (with workaround), sqlplot-tools commit a0c3ba6 at 2025-05-10.

Faster reproduction of results

To obtain experiment results faster, but at the cost of greater noise (greater randomness in the measured running time), you can:

  1. Change --numQueries to 1.

  2. Remove usleep calls from lib/Contender.cpp or decrease the sleep times.

  3. Change --numKeys from 50M to for example 20M or even 10M (This has a great impact on the measured times, but should not significantly affect the size of MPHFs in bits/key.)

  4. Run processes in parallel. This can be done with the GNU parallel program (see 1 and 2) as follows: place the commands to be executed in the file_with_processes_to_run file and run (change 8 to number of cores in your CPU):

    parallel -k -j8 < file_with_processes_to_run > output_file.txt

    or (to see the output also in terminal):

    parallel -k -j8 < file_with_processes_to_run | tee output_file.txt

    (thanks to -k, process outputs will not be mixed up in output_file.txt)

  5. Skip some contenders (the corresponding curve will simply not appear on the plot).

Code Structure

The main comparison code can be found in the src directory. This includes tabular comparisons like they are used in different papers, as well as the more general Pareto plot in src/Comparison.cpp. To add a new competitor to the framework, have a look at the contenders directory. For each contender, there are two files. (1) A general wrapper header class that unifies the interface of the competitor, and (2) a cpp file that tests a wide range of configurations for the general Pareto plot. The cpp file should contain all meaningful configurations to cover all possible trade-offs. After adding a contender, make sure to re-run cmake. If you want to add a new comparison table, make sure to also adapt the CMakeLists.txt file accordingly.

Running the Experiments with Docker

For easier reproducibility and less setup overhead, we provide a docker image to run some of the experiments (but none from the PHast paper). However, for the measurements in the papers, we run the code directly and with more data points. We refer to Docker.md for details on how to use this repository with Docker.

Compilation problems

If, despite having the tbb library in the (linux) system, cmake cannot find it (and compile ShockHash): download https://github.com/justusc/FindTBB/blob/master/FindTBB.cmake (its copy is also in extlib/workaround) to the extlib/ShockHash directory and modify extlib/ShockHash/CMakeLists.txt:

 if(NOT TARGET simpleRibbon)
     set(IPS2RA_DISABLE_PARALLEL ON CACHE PATH "ips2ra's FindTBB greps a file that does not exist in recent TBB versions")
     add_subdirectory(extlib/simpleRibbon SYSTEM EXCLUDE_FROM_ALL)
+    list(APPEND CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/")
     find_package(TBB)
     target_compile_options(ips2ra INTERFACE -D_REENTRANT)
-    target_link_libraries(ips2ra INTERFACE pthread atomic TBB::tbb)
+    target_link_libraries(ips2ra INTERFACE pthread atomic tbb)
 endif()

License

This code is licensed under the GPLv3.

About

Comparison of different MPHF algorithms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 49.0%
  • TeX 23.1%
  • Rust 14.9%
  • CMake 7.6%
  • Shell 3.5%
  • Python 1.1%
  • Other 0.8%