<algorithm>: Optimize sample() and shuffle() with Lemire's algorithm#5735
Merged
StephanTLavavej merged 12 commits intomicrosoft:mainfrom Sep 25, 2025
Merged
Conversation
Contributor
|
12th Gen Intel(R) Core(TM) i5-1235U (1.30 GHz) P-cores
E-cores
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Benchmark | Before | After | Speedup ------------------------------------------------------|------------|------------|-------- `bm_sample<uint8_t, alg_type::std_fn>/1048576/32768` | 4395155 ns | 3719749 ns | 1.18 `bm_sample<uint16_t, alg_type::std_fn>/1048576/32768` | 4422082 ns | 3740169 ns | 1.18 `bm_sample<uint32_t, alg_type::std_fn>/1048576/32768` | 4429234 ns | 3545233 ns | 1.25 `bm_sample<uint64_t, alg_type::std_fn>/1048576/32768` | 4421113 ns | 3564755 ns | 1.24 `bm_sample<uint8_t, alg_type::rng>/1048576/32768` | 4407065 ns | 3926217 ns | 1.12 `bm_sample<uint16_t, alg_type::rng>/1048576/32768` | 4408346 ns | 3531568 ns | 1.25 `bm_sample<uint32_t, alg_type::rng>/1048576/32768` | 4409194 ns | 3527145 ns | 1.25 `bm_sample<uint64_t, alg_type::rng>/1048576/32768` | 4454799 ns | 3556353 ns | 1.25 ------------------------------------------------------|------------|------------|----- `bm_shuffle<uint8_t, alg_type::std_fn>/1048576` | 5750192 ns | 3733152 ns | 1.54 `bm_shuffle<uint16_t, alg_type::std_fn>/1048576` | 5907846 ns | 4186912 ns | 1.41 `bm_shuffle<uint32_t, alg_type::std_fn>/1048576` | 6043149 ns | 4383343 ns | 1.38 `bm_shuffle<uint64_t, alg_type::std_fn>/1048576` | 6117446 ns | 4582720 ns | 1.33 `bm_shuffle<uint8_t, alg_type::rng>/1048576` | 5701609 ns | 3719451 ns | 1.53 `bm_shuffle<uint16_t, alg_type::rng>/1048576` | 5735311 ns | 4229480 ns | 1.36 `bm_shuffle<uint32_t, alg_type::rng>/1048576` | 5771077 ns | 4391066 ns | 1.31 `bm_shuffle<uint64_t, alg_type::rng>/1048576` | 5794527 ns | 4587342 ns | 1.26
…local constant.
…)`, `shuffle()`, `ranges::sample`, `ranges::shuffle`.
This was comparing `_Rem < _Index`, where `_Rem` is `_Udiff & _Udiff` (almost always `_Udiff`) while we have potentially signed `_Diff _Index`.
… __int64', signed/unsigned mismatch `_Urng::result_type` must be unsigned, but `result_type - result_type` performs the usual arithmetic conversions, promoting tiny types to int. Therefore, we need to `static_cast<_Udiff>`.
…quires a narrowing conversion Except for iterators with exotic difference types: For x64 algorithms, `_Diff` is naturally 64-bit, so `is_same_v<_Udiff, uint64_t>` is naturally taken. But for x86 algorithms, `_Diff` is naturally 32-bit. If the URBG is 32-bit (or smaller), then `_Udiff` will be `uint32_t`, and the `else` branch will be taken. In that case, the error correctly complains that we're using braces to convert from signed `_Diff _Index` to unsigned `_Uprod`, which is narrowing. We should directly `static_cast` instead.
Again, `(_Urng::max) () - (_Urng::min) ()` is `result_type - result_type`, subject to the usual arithmetic conversions. When compared against `_Udiff _Bmask_local`, this can emit signed/unsigned mismatch warnings. (This repro is x86-specific because MSVC emits warning C4018 at level 3 for `int < unsigned int`. `int < unsigned long long` emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of `_Udiff` causes the warning to be x86-specific.) I'm avoiding extracting the `max - min` as a constant to avoid VSO-2580691 "`/analyze` emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".
519a98d to
01090a7
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Member
Author
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
davidmrdavid
approved these changes
Sep 25, 2025
Member
davidmrdavid
left a comment
There was a problem hiding this comment.
LGTM 🦕
Left 2 small comments/questions, non-blocking.
Comment on lines
+57
to
+65
| using result_type = uint16_t; // N5014 [rand.req.urng]/3 | ||
| static constexpr result_type min() { | ||
| return 3; | ||
| } | ||
| static constexpr bool max() { | ||
| return true; | ||
| static constexpr result_type max() { | ||
| return 1729; | ||
| } | ||
| bool operator()() & { | ||
| return false; | ||
| result_type operator()() & { | ||
| return 4; |
Member
There was a problem hiding this comment.
just to be triple sure - these numbers are completely arbitrary, right? I'd be a fan of calling that out, but I won't insist.
Member
Author
There was a problem hiding this comment.
Yes, it's a compile-time only test, and the actual values involved are not relevant.
Member
Author
There was a problem hiding this comment.
I've made a note to add comments to this generator in a followup PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🗺️ Overview
Exactly 3 years ago, @MattStephanson's #3012 implemented @lemire's algorithm "Fast Random Integer Generation in an Interval" for TR1
uniform_intand Standarduniform_int_distribution, shipped in VS 2022 17.5.This PR extends the use of the algorithm to
sample(),shuffle(),ranges::sample, andranges::shuffle. It's a behavioral change (the results of sampling and shuffling will be different), but the significant speedup is worth it.We have to fix some warnings in
_Rng_from_urng_v2because it was never previously exposed to signed difference types.My #5712 very recently split TR1
uniform_intfrom Standarduniform_int_distribution. The former still switches between my old_Rng_from_urngand the new Lemire-powered_Rng_from_urng_v2depending on whether it senses the presence ofstatic constexprmin()andmax()from the engine, as Lemire's algorithm greatly benefits from having them as compile-time constants. After #5712, the Standard distribution now assumes it's being used with Standard engines (as it would be highly inconsistent for a program to be mixing the Standard distribution with a TR1 engine; both spellings should be updated simultaneously).As
shuffle()was C++11 andsample()was C++17, it is slightly more conceivable for those algorithms to be used with old TR1 engines. For the moment, to preserve source compatibility, I am using the same "switch between v1 and v2" logic that TR1uniform_intuses (and that Standarduniform_int_distributionindirectly used between #3012 and #5712).In the near future, I plan to purge TR1 entirely, making use of v2 unconditional, but I want to land this improvement separately.
⚙️ Commits
sample()andshuffle()._Rng_from_urng_v2up to<algorithm>, no other changes._Has_static_min_maxup to<algorithm>, no other changes._Rng_from_urng_v1_or_v2._Rng_from_urng_v2:static constexpr=>constexprfor a local constant.min()andmax()must beconstexpr.result_type, must be unsigned integer._Rng_from_urng_v1_or_v2to optimizesample(),shuffle(),ranges::sample,ranges::shuffle._Rem < _Index, where_Remis_Udiff & _Udiff(almost always_Udiff) while we have potentially signed_Diff _Index.'int'to'unsigned __int64', signed/unsigned mismatch_Urng::result_typemust be unsigned, butresult_type - result_typeperforms the usual arithmetic conversions, promoting tiny types toint. Therefore, we need tostatic_cast<_Udiff>.'const int'to'unsigned __int64'requires a narrowing conversion_Diffis naturally 64-bit, sois_same_v<_Udiff, uint64_t>is naturally taken._Diffis naturally 32-bit. If the URBG is 32-bit (or smaller), then_Udiffwill beuint32_t, and theelsebranch will be taken._Diff _Indexto unsigned_Uprod, which is narrowing. We should directlystatic_castinstead.(_Urng::max) () - (_Urng::min) ()isresult_type - result_type, subject to the usual arithmetic conversions. When compared against_Udiff _Bmask_local, this can emit signed/unsigned mismatch warnings.int < unsigned int.int < unsigned long longemits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of_Udiffcauses the warning to be x86-specific.)max - minas a constant to avoid VSO-2580691 "/analyzeemits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".⏱️ Benchmark results
On my 5950X:
bm_sample<uint8_t, alg_type::std_fn>/1048576/32768bm_sample<uint16_t, alg_type::std_fn>/1048576/32768bm_sample<uint32_t, alg_type::std_fn>/1048576/32768bm_sample<uint64_t, alg_type::std_fn>/1048576/32768bm_sample<uint8_t, alg_type::rng>/1048576/32768bm_sample<uint16_t, alg_type::rng>/1048576/32768bm_sample<uint32_t, alg_type::rng>/1048576/32768bm_sample<uint64_t, alg_type::rng>/1048576/32768bm_shuffle<uint8_t, alg_type::std_fn>/1048576bm_shuffle<uint16_t, alg_type::std_fn>/1048576bm_shuffle<uint32_t, alg_type::std_fn>/1048576bm_shuffle<uint64_t, alg_type::std_fn>/1048576bm_shuffle<uint8_t, alg_type::rng>/1048576bm_shuffle<uint16_t, alg_type::rng>/1048576bm_shuffle<uint32_t, alg_type::rng>/1048576bm_shuffle<uint64_t, alg_type::rng>/1048576