Add structured datasets loading capability in valkey benchmark by VoletiRam · Pull Request #2823 · valkey-io/valkey

VoletiRam · 2025-11-10T22:14:33Z

Background

Add structured datasets loading capability. Support CSV and TSV file formats. Use __field:fieldname__ placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers. Use --maxdocs to limit the dataset loading.

Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using valkeyFormatCommandArgv(). This ensures:

Zero impact on existing functionality
Full support for variable-size content
Thread-safe atomic record iteration
Compatible with pipelining and threading modes

Usage examples

# Strings - Simple key-value with dataset fields
./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__"

# Sets - Unique collections from dataset
./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__"

# CSV dataset with document limit
./valkey-benchmark --dataset wiki.csv --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__"

# Mixed placeholders (dataset + random)
./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__

Full-Text Search Benchmarking

# Search hit scenarios (existing terms)
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Search miss scenarios (non-existent terms)
./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Query variations
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__"
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*"

Benchmark Results

Test environment:
Instance: AWS c7i.16xlarge, 64 vCPU

Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory

Configuration	Throughput	CPU Usage	Wall Time	Memory Peak
Single-threaded, P1	93,295 RPS	99%	71.4s	5.8GB
Multi-threaded (10), P1	93,332 RPS	137%	71.5s	5.8GB
Single-threaded, P10	274,499 RPS	96%	36.1s	5.8GB
Multi-threaded (4), P10	344,589 RPS	161%	32.4s	5.8GB

Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use `__field:fieldname__' placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use `--maxdocs` to limit the dataset loading. Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

codecov · 2025-11-10T22:32:50Z

Codecov Report

❌ Patch coverage is 84.57143% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.76%. Comparing base (fb655db) to head (e2a3421).
⚠️ Report is 56 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/valkey-benchmark-dataset.c	85.54%	37 Missing ⚠️
src/valkey-benchmark.c	81.91%	17 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2823      +/-   ##
============================================
+ Coverage     76.51%   76.76%   +0.24%     
============================================
  Files           157      158       +1     
  Lines         79025    79358     +333     
============================================
+ Hits          60463    60916     +453     
+ Misses        18562    18442     -120

Files with missing lines	Coverage Δ
src/valkey-benchmark.c	`72.66% <81.91%> (+10.95%)`	⬆️
src/valkey-benchmark-dataset.c	`85.54% <85.54%> (ø)`

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

JimB123

Need a full documentation file. Need details on the new file format. Examples.

Maybe seed the file with --help data for the existing cases. This can be an area for future improvement.

JimB123 · 2025-11-13T18:16:51Z

+        " --dataset <file>   Path to CSV/TSV/XML dataset file for field placeholder replacement.\n"
+        "                    All fields auto-detected with natural content lengths.\n",


Have we reached the point where we need some actual documentation for valkey-benchmark?

The benchmark tool has relied on --help for increasingly complex configurations. Now, we're introducing a dataset configuration file with very limited description. Is it likely that developers will understand how to use this without examining the code?

I suggest that a new documentation file be created (benchmark.md) which can provide details and examples for using valkey-benchmark, including details about this new configuration file.

Thank you for the suggestion. I agree --help is not clear enough for most of the configurations benchmark support. Will add the detailed explanation with examples.

Add documentation for valkey-benchmark. Fix field discovery. Improve field discovery in xml and validate field before loading the data. fix failed tcl test in ubuntu. Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix memory leak in xml scan Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

JimB123 · 2025-11-21T20:53:24Z

Nice. I appreciate the benchmark.md file. This is an excellent start and provides a place for future documentation.

JimB123

I really like the new benchmark.md file

Load fields only we care about from dataset. Parse the fields from benchmark command and load only those field values from dataset Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Remove warning log for incomplete document Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme · 2025-12-16T00:06:34Z

Have we considered putting this dataset related code in a separate file, maybe valkey-benchmark-datsets.c? It seems like it should be fairly self-contained.

rainsupreme

Finally finished reviewing - sorry for the delay! I still think having the dataset stuff in a separate file is a good idea. We can write unit tests for all the parsing and replacement stuff too. :)

I don't think the perf concerns are necessarily blockers, but I want to make sure we've thought about it and are reasonably sure it won't be an issue.

rainsupreme · 2025-11-21T22:03:07Z

This doc is great! I noticed --warmup and --duration options weren't documented. Are there other options not covered? I'd like to either document all of them or have a list of undocumented options here. :)

This file is good, but it doesn't really make sense organizationally. We have documentation here: https://valkey.io/topics/benchmark/ which has it's authoritative source in valkey-doc.

Oh right, tbh I forgot about that. Organization is good, but we should fix that doc. It's out of date, for both these changes and changes I made myself previously. I guess we should make an issue about it or something? Should the valkey-benchmark help text reference that documentation perhaps? 🤔

An issue is reasonable. I just don't want two copies of the same information. We can probably move most of this content as is over. A readme in the source tree is not a good place for end user docs.

I created issue valkey-io/valkey-doc#414 for the update of documentation. I will own it. Please assign it to me.

rainsupreme · 2026-01-01T23:50:38Z

+    size_t before_len = field_pos - arg;
+    const char *after_start = field_end + FIELD_SUFFIX_LEN;
+
+    sds result = sdsnewlen(arg, before_len);


I dunno about the perf here. This is run for every argument in every command sent, and I notice it calls strstr() and sdsnewlen() which would typically have O(n) runtime. One can argue that this is effectively constant runtime if the user is reasonable about argument length, but it's still a lot of work to be doing in an inner loop. Also, this appears to be redundant work. My ideas:

The template we're given doesn't change. We could preprocess more and only do the minimum work here for each command sent.

There is a lot of allocator work going on here - the caller duplicates the template array for every command plus here we replace some number of those allocations again. The benchmark would run faster if we avoided so many allocations. Perhaps we could allocate a large block of memory that is large enough for the longest command we expect and keep reusing that?

If we don't make this more efficient the user will have to be more careful that the benchmark doesn't get too bottlenecked by valkey-benchmark itself, instead of valkey-server. Users have to do this anyway to some extent, but dataset mode demands much more resources than valkey-benchmark usually requires.

Appreciate the concern on performance. I think about this and definitely gonna be 20-30% improvements in terms of performance with your suggestions. The performance with current change is good enough for full text search needs and saturating the server as desired. I can follow up as future improvements for template array based pre-processing of commands for performance when we know it is a bottleneck.

rainsupreme · 2026-01-01T23:51:07Z

+
+        test {benchmark: dataset XML with field placeholders} {
+            # Create test XML dataset matching Wikipedia structure
+            set xml_data "<doc><title>XML Title 1</title><abstract>XML Abstract 1</abstract><url>http://example1.com</url><links><sublink><anchor>test1</anchor><link>http://test1.com</link></sublink></links></doc>\n<doc><title>XML Title 2</title><abstract>XML Abstract 2</abstract><url>http://example2.com</url><links><sublink><anchor>test2</anchor><link>http://test2.com</link></sublink></links></doc>"


Could we add whitespace formatting to this so it's more readable?

Separate dataset changes into new file. Add unit test coverage. Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix build Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix cmake build issue for latest ubuntu-cmake Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Replace placeholders with in same field. Update documentation Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme

I had a couple questions about the makefile stuff, but mainly the copyright/license boilerplate headers need to be fixed.

The refactor to a separate file looks good to me though. :)

rainsupreme · 2026-01-20T20:29:42Z

@@ -23,7 +23,7 @@ add_library(valkeylib STATIC ${VALKEY_SERVER_SRCS})
 target_compile_options(valkeylib PRIVATE "${COMPILE_FLAGS}")
 target_compile_definitions(valkeylib PRIVATE "${COMPILE_DEFINITIONS}")

-add_executable(valkey-unit-tests ${UNIT_TEST_SRCS})
+add_executable(valkey-unit-tests ${UNIT_TEST_SRCS} ${CMAKE_SOURCE_DIR}/src/valkey-benchmark-dataset.c)


I'm not familiar with cmake, but his seems a bit odd and I'm confused. Shouldn't it be part of a list of files like UNIT_TEST_SRCS or something? And why added to valkey-unit-tests but not valkey-benchmark?

rainsupreme · 2026-01-20T20:31:32Z

@@ -0,0 +1,115 @@
+/* Unit tests for valkey-benchmark dataset module
+ *
+ * Copyright (c) 2024, Redis Ltd.


we're not redis

looks like copyright isn't fixed yet? It's an easy fix but licensing and copyright are important :p

rainsupreme · 2026-01-20T20:34:41Z


 # valkey-unit-tests
-$(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME)
+$(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME) valkey-benchmark-dataset.o


similar to the cmake file, I feel like this should be part of a obj list or something, right? Is there a good reason it's not?

For unit testing, we only use APis in valkey-benchmark-dataset and we don't need benchmark binary (or rest of the objects). This is the minimal addition to add unit testing for dataset.

rainsupreme · 2026-01-20T20:34:58Z

@@ -0,0 +1,86 @@
+/* Dataset support for valkey-benchmark
+ *
+ * Copyright (c) 2009-2012, Redis Ltd.


we are not redis

rainsupreme · 2026-01-20T20:35:09Z

@@ -0,0 +1,776 @@
+/* Dataset support for valkey-benchmark
+ *
+ * Copyright (c) 2009-2012, Redis Ltd.


we are not redis

rainsupreme · 2026-01-20T20:39:56Z

+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.


This license/copyright boilerplate is unnecessary - you can use a short header like the one in hashtable.c. This applies to all the other files as well.

/* * Copyright Valkey Contributors. * All rights reserved. * SPDX-License-Identifier: BSD 3-Clause */

rainsupreme · 2026-01-20T20:47:39Z

+}
+
+static sds getXmlFieldValue(const char *xml_doc, const char *field_name) {
+    char start_tag_prefix[128], end_tag[128];


So there's limit on field length. Is there input checking for this and a uni test? If this length comes up other places then it should be a #define constant.

Will increase this limt as well.

rainsupreme · 2026-01-20T20:49:36Z

+    return sdsnewlen(content_start, content_len);
+}
+
+static int csvDiscoverFields(dataset *ds) {


you can use bool return type if you #include <stdbool.h> - this helps to clarify what the return value means. (and then of course use true and false instead of 1 and 0). Best to update the whole file with bool for consistency.

Ack. Will update.

rainsupreme · 2026-01-20T20:52:20Z

+}
+
+static int scanXmlFields(const char *doc_start, const char *doc_end, dataset *ds, const char *start_root_tag, const char *end_root_tag) {
+    char field_names[MAX_DATASET_FIELDS][64];


now there's a 64 character size limit here? Feels like magic numbers. I'd prefer to see a central #define for these size limits.

I see other 64 character limits below too

I want to increase this limit just to be future proof for large field testing.

Address the comments on cmake file, field size and count limitations. Update return values to use bool. Increase limits to be future proof. Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme

The new changes look good to me. Just one or two quick fixes away now! :)

rainsupreme · 2026-01-29T18:12:17Z

@@ -0,0 +1,115 @@
+/* Unit tests for valkey-benchmark dataset module
+ *
+ * Copyright (c) 2024, Redis Ltd.


looks like copyright isn't fixed yet? It's an easy fix but licensing and copyright are important :p

rainsupreme · 2026-01-29T22:17:42Z

    zmalloc.o
 ENGINE_BENCHMARK_NAME=$(ENGINE_NAME)-benchmark$(PROG_SUFFIX)
+# Dataset module shared between benchmark and unit tests
+BENCHMARK_DATASET_OBJ=valkey-benchmark-dataset.o


I guess I had imagined adding it to ENGINE_BENCHMARK_OBJ, but I see now that it is used in UTs too. I'm not sure a list with a single item makes sense - perhaps we should just add it to both ENGINE_BENCHMARK_OBJ and ENGINE_UNIT_TESTS instead? 🤔

Like before?

Not quite what I meant - before in cmake we had this:
add_executable(valkey-unit-tests ${UNIT_TEST_SRCS} ${CMAKE_SOURCE_DIR}/src/valkey-benchmark-dataset.c)
My suggestion is to also add it to the UNIT_TEST_SRCS list.

For the makefile, this is the previous line:
$(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME) valkey-benchmark-dataset.o
Could we put it in ENGINE_TEST_OBJ?

Let me know if there's a reason that's impractical or if you disagree, but that would seem better organized to me. Also, feel free to get a second opinion 🙂

Took some time to understand the suggestion. Makes sense. I update the PR and along with cluster mode enhancement.

Address comments on license and make file Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix test for ubuntu-latest Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Copilot

Pull request overview

Adds dataset-driven command generation to valkey-benchmark so benchmarks can use real CSV/TSV/XML content via __field:<name>__ placeholders, enabling more realistic workloads (e.g., full-text search inputs) while keeping the existing synthetic placeholder path intact.

Changes:

Introduces --dataset/--maxdocs/--xml-root-element and a new dataset loader + command generator (valkeyFormatCommandArgv()-based) for variable-length field substitution.
Updates benchmark send path to generate pipelined commands from dataset rows (thread-safe record selection) and adds/extends integration + unit tests.
Wires the new dataset module into Make/CMake builds and adds user documentation for benchmark usage.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/integration/valkey-benchmark.tcl	Adds integration coverage for CSV/TSV/XML datasets, maxdocs, mixed placeholders, and error cases (incl. cluster hash-tag scenario).
src/valkey-benchmark.c	Adds CLI options, template parsing, dataset initialization, dataset-mode command generation in the write path, and refactors cluster `{tag}` scanning into a helper.
src/valkey-benchmark-dataset.h	Declares dataset API and constants for field placeholders and dataset formats.
src/valkey-benchmark-dataset.c	Implements dataset parsing (CSV/TSV/XML), field mapping optimization, and RESP command generation with placeholder processing.
src/unit/test_files.h	Registers the new dataset unit test suite.
src/unit/test_dataset.c	Adds unit tests for field mapping optimization, quoted CSV fields, XML field limits, and max field count behavior.
src/unit/CMakeLists.txt	Ensures unit tests build includes `valkey-benchmark-dataset.c`.
src/Makefile	Adds dataset object to benchmark build; adjusts unit-test linkage/deps (incl. Lua).
docs/benchmark.md	Adds benchmark documentation including dataset usage and behavior.
cmake/Modules/SourceFiles.cmake	Adds dataset source to CMake benchmark sources list.

Copilot · 2026-02-17T23:31:47Z

+        if (!parseCommandTemplate(argc, argv)) {
+            exit(1);
+        }
+


When --dataset is provided and argv contains a command, the code currently proceeds even if no field:* placeholders are present (has_field_placeholders==0). That means the dataset can be fully loaded (and memory reported) but then silently ignored because dataset command generation only runs when has_field_placeholders is true. If dataset mode is intended to require field placeholders (per the error message/docs), add a validation after parseCommandTemplate() to exit with a clear error when no field placeholders were found.

Suggested change

/* Dataset mode requires at least one field placeholder in the command

* template. Without it, the dataset would be initialized and reported

* but never used for command generation. */

if (!config.has_field_placeholders) {

fprintf(stderr,

"Error: Dataset mode requires at least one field placeholder\n");

fprintf(stderr,

"Example: SET doc:__rand_int__ \"__field:content__\"\n");

exit(1);

}

Copilot · 2026-02-17T23:31:47Z

+            exit(1);
+        }
+
+        datasetReportMemory(config.current_dataset);


datasetReportMemory() is called unconditionally during startup, which prints to stdout. This can interfere with scripted consumption of valkey-benchmark output (notably --csv) by adding non-result lines before the CSV header/rows. Consider printing dataset info to stderr, or gating it behind !config.csv / !config.quiet.

Suggested change

datasetReportMemory(config.current_dataset);

if (!config.csv && !config.quiet) {

datasetReportMemory(config.current_dataset);

}

Copilot · 2026-02-17T23:31:47Z

+    char *cmd;
+    int len = valkeyFormatCommandArgv(&cmd, template_argc, (const char **)processed_argv, NULL);


datasetGenerateCommand() passes an uninitialized cmd pointer to valkeyFormatCommandArgv() and then unconditionally uses/free()s it. If valkeyFormatCommandArgv() returns -1 (e.g. OOM), cmd is undefined and this will crash or free garbage. Initialize cmd to NULL and handle the -1 return path before constructing the sds result.

Suggested change

char *cmd;

int len = valkeyFormatCommandArgv(&cmd, template_argc, (const char **)processed_argv, NULL);

char *cmd = NULL;

int len = valkeyFormatCommandArgv(&cmd, template_argc, (const char **)processed_argv, NULL);

if (len == -1) {

for (int i = 0; i < template_argc; i++) {

sdsfree(processed_argv[i]);

}

zfree(processed_argv);

if (cmd) {

free(cmd);

}

return NULL;

}

Copilot · 2026-02-17T23:31:47Z

+bool datasetLoad(dataset *ds, const char *xml_root_element) {
+    if (!ds) return false;
+    (void)xml_root_element; /* Use stored value in ds */
+
+    if (ds->format == DATASET_FORMAT_XML) {


datasetLoad() is a non-static function but it's not declared in the public header and does not appear to be used elsewhere. This unnecessarily exports a symbol and can lead to drift between declarations/definitions. Consider removing it if unused, or making it static / adding it to the header if it’s intended to be part of the dataset API.

Copilot · 2026-02-17T23:31:48Z

+    char buffer[1024];
+    sds current_doc = sdsempty();
+
+    while (fgets(buffer, sizeof(buffer), fp)) {
+        current_doc = sdscat(current_doc, buffer);
+
+        const char *tag_prefix_pos = strstr(current_doc, start_tag_prefix);
+        if (!tag_prefix_pos) continue;


scanXmlFieldsFromFile() appends every line into current_doc until the first ... is found. If the provided --xml-root-element is wrong (or the file is malformed), this accumulates the entire file into memory and can OOM before returning an error. Consider switching this to a bounded sliding window / streaming scan similar to loadXmlDataset(), or capping current_doc growth and failing once it exceeds a reasonable limit.

I don't want to limit the doc growth as I expect the code to work with larger datasets. Will leave the handling of malformed datasets and potential system constraints to users.

Neverthless, this comment helped me identify potential reuse of stream scan logic in scan too which I addressed.

Copilot · 2026-02-17T23:31:48Z

+        }
+    }
+    sds size_str = formatBytes(total_memory);
+    printf("Dataset: %zu documents (%s)\n", ds->record_count, size_str);


datasetReportMemory() prints dataset information to stdout. This can pollute benchmark output modes that are intended to be machine-readable (e.g. --csv) and makes it harder to redirect results cleanly. Consider printing to stderr and/or only emitting this when not in CSV/quiet mode.

Suggested change

printf("Dataset: %zu documents (%s)\n", ds->record_count, size_str);

fprintf(stderr, "Dataset: %zu documents (%s)\n", ds->record_count, size_str);

Copilot · 2026-02-17T23:31:48Z

+    printf("Loading XML dataset from %s...\n", ds->filename);
+
+    if (ds->field_names && ds->field_count > 0) {
+        fields_discovered = true;
+        printf("Using %d fields: ", ds->field_count);
+        for (int i = 0; i < ds->field_count; i++) {


loadXmlDataset() writes progress/status to stdout (e.g. "Loading XML dataset..." and the periodic "Loaded ..." updates). This can interleave with benchmark results output and break tools that parse stdout. Consider routing progress output to stderr and/or suppressing it when --csv or -q is enabled.

… of PR Reuse stream buffer logic in loading too for xml and address comments of PR Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix unit test Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam · 2026-02-20T18:10:40Z

Failure appears to be unrelated.


*** [TIMEOUT]: replica do not write the reply to the replication link - PSYNC (_addReplyToBufferOrList) in tests/integration/replication.tcl
Cleanup: may take some time... OK
make[1]: *** [Makefile:808: lcov] Error 1
make[1]: Leaving directory '/home/runner/work/valkey/valkey/src'
make: *** [Makefile:6: lcov] Error 2
Error: Process completed with exit code 2.

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix github wf issues with build and clang Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme

LGTM! I realize that this feature can be used to test GET/SET and other commands too, not just full text search. I look forward to the documentation being updated on the Valkey website 😉

madolson

Meta comment is do we really need all of the complex parsing? Can we just use XML and then do conversion in a higher level language like python?

madolson · 2026-02-17T22:12:25Z

@@ -0,0 +1,201 @@
+#include <stdio.h>


Do we need this unit test? This feels like a file that will primarily benefit from integration testing.

Unit tests work well here because they test parsing logic in isolation without server overhead. They're fast and great for edge cases like validating 513-character field name rejection or the 1000/1001 field boundary—scenarios that are hard to test reliably through integration tests.

We can leave them. I think I was originally concerned about all the custom parsing, less about the tests.

Some of the limits enforced before are only on XML parsing which lead to writing unit tests as it is not possible to test them with integration tests.

Non-XML related unit tests were field_mapping_optimization and field_count_correctness are covered implicitly by all existing dataset integration tests that verify field values are correctly inserted. Felt like good to have's so removed them.

csv_quoted_fields - this was the one genuine edge case not covered by integration before. It has been now added to tests/integration/valkey-benchmark.tcl as benchmark: dataset CSV with quoted field values containing commas, which verifies the value "Book, Part 1" is stored intact end-to-end.

The unit test file also introduced a non-trivial build dependency: valkey-benchmark-dataset.o had to be compiled separately and linked into the gtest binary, adding coupling between the benchmark build and the server unit test infrastructure. Removing it simplifies both src/unit/Makefile and src/unit/CMakeLists.txt

madolson · 2026-02-18T18:25:28Z

An issue is reasonable. I just don't want two copies of the same information. We can probably move most of this content as is over. A readme in the source tree is not a good place for end user docs.

Address comments on PR to remove memory report, fix makefile lua addition, and make explicit building of dataset in unit test instead Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix codecov Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam · 2026-03-25T03:55:28Z

the test failure in codecov is unrelated

Remove XML support and related code Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam · 2026-03-31T03:57:31Z

All the test failures seem unrelated

madolson · 2026-04-01T00:14:34Z

There are some test failures, so going to wait to merge until after they're resolved for the moment. @VoletiRam can you update the top level comment?

github-actions Bot assigned VoletiRam Nov 10, 2025

Fix memory leak in memory reporting

fa33f69

Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

zuiderkwast linked an issue Nov 11, 2025 that may be closed by this pull request

[NEW] Add structured dataset support to valkey-benchmark #2765

Closed

JimB123 requested changes Nov 13, 2025

View reviewed changes

Ram Prasad Voleti added 3 commits November 20, 2025 00:10

Merge branch 'unstable' into dataset-support

933ee94

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix memory leak in xml scan

be05541

Fix memory leak in xml scan Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

JimB123 approved these changes Nov 21, 2025

View reviewed changes

Ram Prasad Voleti added 2 commits December 12, 2025 20:38

Load fields only we care about from dataset

1833c6e

Load fields only we care about from dataset. Parse the fields from benchmark command and load only those field values from dataset Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Remove warning log for incomplete document

b3a4516

Remove warning log for incomplete document Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme self-requested a review December 15, 2025 23:58

zuiderkwast added the needs-review Maintainer attention label Dec 16, 2025

rainsupreme reviewed Jan 1, 2026

View reviewed changes

Ram Prasad Voleti added 4 commits January 8, 2026 00:40

Separate dataset changes into new file

8053c4d

Separate dataset changes into new file. Add unit test coverage. Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Merge remote-tracking branch 'fork/unstable' into dataset-support

f98d01d

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix build

6d3c7fc

Fix build Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix cmake build issue for latest ubuntu-cmake

4562857

Fix cmake build issue for latest ubuntu-cmake Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam force-pushed the dataset-support branch from 3c9ff5f to 4562857 Compare January 8, 2026 02:29

rainsupreme self-requested a review January 12, 2026 21:05

Replace placeholders with in same field

d088b22

Replace placeholders with in same field. Update documentation Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam force-pushed the dataset-support branch from 153ba26 to d088b22 Compare January 13, 2026 02:03

rainsupreme suggested changes Jan 20, 2026

View reviewed changes

Ram Prasad Voleti added 2 commits January 28, 2026 23:31

Address the comments on cmake file, field size and count limitation

f98630f

Address the comments on cmake file, field size and count limitations. Update return values to use bool. Increase limits to be future proof. Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Merge remote-tracking branch 'origin/unstable' into dataset-support

6ff88c9

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme suggested changes Jan 29, 2026

View reviewed changes

Address comments on license and make file

bc3432c

Address comments on license and make file Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix test for ubuntu-latest

7a63dbd

Fix test for ubuntu-latest Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

madolson requested a review from Copilot February 17, 2026 23:17

Copilot started reviewing on behalf of madolson February 17, 2026 23:17 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Ram Prasad Voleti added 3 commits February 20, 2026 08:34

Reuse stream buffer logic in loading too for xml and address comments…

2272f70

… of PR Reuse stream buffer logic in loading too for xml and address comments of PR Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Merge remote-tracking branch 'origin/unstable' into dataset-support

fd8ddc9

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix unit test

f7ca875

Fix unit test Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

VoletiRam requested a review from madolson February 20, 2026 19:10

Ram Prasad Voleti added 2 commits March 19, 2026 22:42

Merge remote-tracking branch 'origin/unstable' into dataset-support

2ca7829

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix github wf issues with build and clang

2672e68

Fix github wf issues with build and clang Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

rainsupreme approved these changes Mar 20, 2026

View reviewed changes

madolson reviewed Mar 23, 2026

View reviewed changes

Ram Prasad Voleti added 2 commits March 24, 2026 23:15

Address comments on PR to remove memory report

e553552

Address comments on PR to remove memory report, fix makefile lua addition, and make explicit building of dataset in unit test instead Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Fix codecov

4d4a17f

Fix codecov Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Ram Prasad Voleti added 2 commits March 31, 2026 00:33

Remove XML support and related code

3542d7e

Remove XML support and related code Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

Merge remote-tracking branch 'origin/unstable' into dataset-support

d98142d

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>

madolson approved these changes Apr 1, 2026

View reviewed changes

madolson added the release-notes This issue should get a line item in the release notes label Apr 1, 2026

VoletiRam added 2 commits March 31, 2026 17:46

Merge branch 'unstable' into dataset-support

bd52dd2

Merge branch 'unstable' into dataset-support

e2a3421

Andriciuc mentioned this pull request Apr 29, 2026

Update Benchmark topic with latest valkey-benchmark.c changes valkey-io/valkey-doc#435

Closed

madolson merged commit 39036c7 into valkey-io:unstable Apr 29, 2026
58 of 59 checks passed

madolson added this to Valkey 10 Apr 29, 2026

github-project-automation Bot moved this to Done in Valkey 10 Apr 29, 2026

madolson moved this to Merged in Valkey 10 May 10, 2026

Andriciuc mentioned this pull request May 19, 2026

Update Benchmark topic with latest valkey-benchmark.c changes valkey-io/valkey-doc#441

Open

		" --dataset <file> Path to CSV/TSV/XML dataset file for field placeholder replacement.\n"
		" All fields auto-detected with natural content lengths.\n",

+        /* Dataset mode requires at least one field placeholder in the command
+         * template. Without it, the dataset would be initialized and reported
+         * but never used for command generation. */
+        if (!config.has_field_placeholders) {
+            fprintf(stderr,
+                    "Error: Dataset mode requires at least one field placeholder\n");
+            fprintf(stderr,
+                    "Example: SET doc:__rand_int__ \"__field:content__\"\n");
+            exit(1);
+        }

		char *cmd;
		int len = valkeyFormatCommandArgv(&cmd, template_argc, (const char **)processed_argv, NULL);

-    char *cmd;
-    int len = valkeyFormatCommandArgv(&cmd, template_argc, (const char **)processed_argv, NULL);
+    char *cmd = NULL;
+    int len = valkeyFormatCommandArgv(&cmd, template_argc, (const char **)processed_argv, NULL);
+    if (len == -1) {
+        for (int i = 0; i < template_argc; i++) {
+            sdsfree(processed_argv[i]);
+        }
+        zfree(processed_argv);
+        if (cmd) {
+            free(cmd);
+        }
+        return NULL;
+    }

	printf("Dataset: %zu documents (%s)\n", ds->record_count, size_str);
	fprintf(stderr, "Dataset: %zu documents (%s)\n", ds->record_count, size_str);

Uh oh!

Conversation

VoletiRam commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Uh oh!

codecov Bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JimB123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JimB123 commented Nov 21, 2025

Uh oh!

JimB123 left a comment

Choose a reason for hiding this comment

Uh oh!

rainsupreme commented Dec 16, 2025

Uh oh!

rainsupreme left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rainsupreme left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VoletiRam commented Nov 10, 2025 •

edited

Loading

codecov Bot commented Nov 10, 2025 •

edited

Loading