Skip to content

[NEW] Add structured dataset support to valkey-benchmark #2765

Description

@VoletiRam

Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.

Proposed Solution

Add a --dataset option to valkey-benchmark that loads structured data from files and introduces field-based placeholders:


valkey-benchmark --dataset products.jsonl -n 50000 \
  HSET product:__field:id__ name "__field:name__" price __field:price__

New Placeholder Syntax

__field:columnname__: Replaced with data from specified dataset column in the file.

Supported file structure

CSV: Header row defines field names - title,content,category

TSV: Tab-separated with header - title\tcontent\tcategory

Parquet: Columnar binary format (for FTS) (requires library to support)

JSONL: Each line is JSON object - {"title": "...", "content": "...", "embedding": [...]} (requires library to support)

Details

  • Pre-load dataset into memory during initialization
  • Thread-safe row selection using atomic counters
  • Extends existing placeholder system in valkey-benchmark.c

Use Cases

# FTS with real Wikipedia data
valkey-benchmark --dataset wikipedia.csv -n 100000 \
  FT.SEARCH articles "@title:__field:title__"

# E-commerce product catalog
valkey-benchmark --dataset products.csv -n 50000 \
  HSET product:__field:id__ name "__field:name__" category "__field:category__"

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions