DiskVecLab: A Deployment-Realistic Evaluation Framework for Disk-Based Vector Search

DiskVecLab is a modular evaluation framework for disk-based vector search that decouples core components and enables controlled ablations and end-to-end comparisons across diverse storage environments, concurrency regimes, and query distributions.

Used Datasets

All datasets are publicly available and the links are provided as follows:

Dataset	Dimensionality	Download Link	Note
LAION-T2I/I2I	512/768	Official Site (512 dim.)/Official Site (768 dim.)	Both 512 and 768-dimensional versions provides text based vectors and image based vectors. In experiments we use 512 dimensional version for text to image search (out-of-distribution) and 768 dimensional version for image to image search (in-distribution).
DEEP	96	Link from Yandex/Base Set/Query Set
Text2Image	200	Link from Yandex/Base Set/Query Set	A text to image search (out-of-distribution) dataset where base sets are image vectors and query sets are text vectors.
SIFT	128	Official Site
SpaceV	100	Official Repo

Evaluated Methods

We evaluated six state-of-the-art methods, including DiskANN, Starling, MARGO, PipeANN, Gorgeous, and SPANN. For details of the methods, please refer to the corresponding papers below:

[NeurIPS'19] DiskANN: Fast accurate billion-point nearest neighbor search on a single node
[SIGMOD'24] Starling: An i/o-efficient disk-resident graph index framework for high-dimensional vector similarity search on data segment
[VLDB'25] (MARGO) Select Edges Wisely: Monotonic Path Aware Graph Layout Optimization for Disk-Based ANN Search
[OSDI'25] (PipeANN) Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD
[arXiv preprint] Gorgeous: Revisiting the Data Layout for Disk-Resident High-Dimensional Vector Search
[NeurIPS'21] SPANN: Highly-efficient billion-scale approximate nearest neighborhood search

Usage

Our experiments were conducted on Ubuntu 24.04 with the following environment:

Ubuntu 24.04
GCC 13.3.0
CMake 3.28.3
Python 3.12
Boost 1.83.0

Example Usage

The segmentation, in-segment optimization, and quantization can be configured independently. An example usage can refer to ./test/search_segments.py.

Before running the example, please make sure to set up the environment, and use CMake to build the source of algorithms with the corresponding configuration (e.g., CMakeLists.txt).
The segmentation step is configured via different segmentaton method params classes, and are passed to the index building step.
The in-segment optimization and quantization are configured via the index configuration file (e.g., config_local.sh), in which the corresponding parameters are set in the config file and passed to the index building step for all segments.
The quantization codes are generated during the index building step, and are chosen during the search step based on the index configuration file, and can be overridden in the search step.
For more details on the parameters, please refer to the arguments descriptions in source code.

def run_natural_segmentation_example():
    NAME = "example_natural_segmentation"

    # Instaniate global configuration and dataset specification, then run the partition experiment
    P = NaturalParams(
        split_name=NAME,
        out_dir=f"{DATA_PATH}/data_split/{NAME}/", # Path to save the segments
        num_shards=40,                             # Number of segments to split into
        input_fmt="fvecs",                         # Input format of the dataset (e.g., fvecs)
    )

    B = BuildParams(
        generate_config_dataset=DS_NAME,
        config_local_path="config_local.sh",       # Path to the index configuration file (e.g., config_local.sh)
    )

    create_and_build_experiment(
        name=f"exp_{NAME}",
        global_cfg=G,
        dataset=DS,
        partition=P,
        build=B,
    )

    # Instaniate search parameters for segmented search, then run the search experiment
    S = SearchParams(
        build_type="release",
        mode="search_split",           # Search mode for segmented search
        algo="knn",                    # Search mode for segmented search
        shard_id=0,                    
        config_local_overrides={       # Search-time overrides for the index configuration file
            ...
        },
    )

    search_experiment(
        name=f"exp_{NAME}",
        global_cfg=G,
        dataset=DS,
        search=S,
    )

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
src		src
test		test
tools/python		tools/python
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiskVecLab: A Deployment-Realistic Evaluation Framework for Disk-Based Vector Search

Used Datasets

Evaluated Methods

Usage

Example Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiskVecLab: A Deployment-Realistic Evaluation Framework for Disk-Based Vector Search

Used Datasets

Evaluated Methods

Usage

Example Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages