Fix map_query_sql benchmark duplicate key error by atheendre130505 · Pull Request #18427 · apache/datafusion

atheendre130505 · 2025-11-01T12:10:56Z

Fix map_query_sql benchmark duplicate key error
Description
The build_keys() function was generating 1000 random keys from range 0..9999, which could result in duplicate keys due to the birthday paradox. The map() function requires unique keys, causing the benchmark to fail with: Execution("map key must be unique, duplicate key found: {key}")
This fix ensures all generated keys are unique by:
Using a HashSet to track seen keys
Only adding keys to the result if they haven't been seen before
Continuing to generate until exactly 1000 unique keys are produced
Fixes #18421
Which issue does this PR close?
Closes #18421
Rationale for this change
The benchmark was non-deterministic: it could pass or fail depending on random key generation. With 1000 keys from a range of 9999 values, collisions are likely (~50% chance), making the benchmark unreliable. This change ensures uniqueness so the benchmark consistently succeeds and accurately measures map function performance.
What changes are included in this PR?
Added use std::collections::HashSet; import
Modified build_keys() to:
Track generated keys using a HashSet
Only add keys if they are unique
Continue generating until exactly 1000 unique keys are produced
File changed: datafusion/core/benches/map_query_sql.rs
Code changes:
Added HashSet import at the top of the file
Replaced simple loop with uniqueness-checking logic in build_keys() function
Are these changes tested?
The fix was verified by:
Logic review: the HashSet approach guarantees uniqueness
Code review: changes follow Rust best practices
No linter errors
The benchmark itself serves as the test — running cargo bench -p datafusion --bench map_query_sql should now complete without errors. Before this fix, the benchmark would fail with duplicate key errors in a significant portion of runs.
Are there any user-facing changes?
No user-facing changes. This is an internal benchmark fix that ensures the map_query_sql benchmark runs reliably. It does not affect the public API or any runtime behavior of DataFusion.

The build_keys() function was generating 1000 random keys from range 0..9999, which could result in duplicate keys due to the birthday paradox. The map() function requires unique keys, causing the benchmark to fail with: 'Execution("map key must be unique, duplicate key found: {key}")' This fix ensures all generated keys are unique by: - Using a HashSet to track seen keys - Only adding keys to the result if they haven't been seen before - Continuing to generate until exactly 1000 unique keys are produced Fixes apache#18421

Omega359 · 2025-11-02T22:13:53Z

LGTM.

Jefffrey

Looks like need to resolve some conflicts

Jefffrey · 2025-11-05T07:37:52Z

datafusion/core/benches/map_query_sql.rs

    let mut keys = vec![];
-    for _ in 0..1000 {
-        keys.push(rng.random_range(0..9999).to_string());
+    let mut seen = HashSet::with_capacity(1000);


We could also make keys a HashSet and just keep inserting into it until it reaches 1000 instead of having both keys and seen

Jefffrey · 2025-11-17T13:59:10Z

Took the liberty of pushing some commits to get this PR over the line

Jefffrey · 2025-11-17T14:31:24Z

Thanks @atheendre130505 for initiating this

Fix map_query_sql benchmark duplicate key error Description The build_keys() function was generating 1000 random keys from range 0..9999, which could result in duplicate keys due to the birthday paradox. The map() function requires unique keys, causing the benchmark to fail with: Execution("map key must be unique, duplicate key found: {key}") This fix ensures all generated keys are unique by: Using a HashSet to track seen keys Only adding keys to the result if they haven't been seen before Continuing to generate until exactly 1000 unique keys are produced Fixes apache#18421 Which issue does this PR close? Closes apache#18421 Rationale for this change The benchmark was non-deterministic: it could pass or fail depending on random key generation. With 1000 keys from a range of 9999 values, collisions are likely (~50% chance), making the benchmark unreliable. This change ensures uniqueness so the benchmark consistently succeeds and accurately measures map function performance. What changes are included in this PR? Added use std::collections::HashSet; import Modified build_keys() to: Track generated keys using a HashSet Only add keys if they are unique Continue generating until exactly 1000 unique keys are produced File changed: datafusion/core/benches/map_query_sql.rs Code changes: Added HashSet import at the top of the file Replaced simple loop with uniqueness-checking logic in build_keys() function Are these changes tested? The fix was verified by: Logic review: the HashSet approach guarantees uniqueness Code review: changes follow Rust best practices No linter errors The benchmark itself serves as the test — running cargo bench -p datafusion --bench map_query_sql should now complete without errors. Before this fix, the benchmark would fail with duplicate key errors in a significant portion of runs. Are there any user-facing changes? No user-facing changes. This is an internal benchmark fix that ensures the map_query_sql benchmark runs reliably. It does not affect the public API or any runtime behavior of DataFusion. --------- Co-authored-by: Jefffrey <jeffrey.vo.australia@gmail.com>

github-actions bot added the core Core DataFusion crate label Nov 1, 2025

Jefffrey reviewed Nov 5, 2025

View reviewed changes

Jefffrey added 2 commits November 17, 2025 22:50

Merge branch 'main' into pr_18427

f743ba4

Use hashset for keys

729656d

Jefffrey approved these changes Nov 17, 2025

View reviewed changes

Jefffrey added this pull request to the merge queue Nov 17, 2025

Merged via the queue into apache:main with commit 5d5a276 Nov 17, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix map_query_sql benchmark duplicate key error#18427

Fix map_query_sql benchmark duplicate key error#18427
Jefffrey merged 3 commits intoapache:mainfrom
atheendre130505:fix/map-query-sql-benchmark-duplicate-keys

atheendre130505 commented Nov 1, 2025 •

edited

Loading

Uh oh!

Omega359 commented Nov 2, 2025

Uh oh!

Jefffrey left a comment

Uh oh!

Jefffrey Nov 5, 2025

Uh oh!

Jefffrey commented Nov 17, 2025

Uh oh!

Jefffrey commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

atheendre130505 commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Omega359 commented Nov 2, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented Nov 17, 2025

Uh oh!

Jefffrey commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

atheendre130505 commented Nov 1, 2025 •

edited

Loading