Skip to content

Add Neon implementation of std::swap_ranges#5819

Merged
StephanTLavavej merged 7 commits intomicrosoft:mainfrom
hazzlim:swap-ranges-neon
Nov 5, 2025
Merged

Add Neon implementation of std::swap_ranges#5819
StephanTLavavej merged 7 commits intomicrosoft:mainfrom
hazzlim:swap-ranges-neon

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Oct 31, 2025

Add an implementation of std::swap_ranges using Neon intrinsics.

@hazzlim hazzlim requested a review from a team as a code owner October 31, 2025 11:02
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Oct 31, 2025
@hazzlim
Copy link
Contributor Author

hazzlim commented Oct 31, 2025

@microsoft-github-policy-service agree company="Arm"

@hazzlim
Copy link
Contributor Author

hazzlim commented Oct 31, 2025

Hopefully I’ve done something reasonable here - please let me know if you would prefer a different approach when adding new implementations to vector_algorithms.cpp

The performance numbers are below - looks good apart from the case of size(1), which I think is due to the added overhead of the function call and the conditional checks now std::swap_ranges is not being inlined into the benchmark.

  MSVC Speedup
std_swap_ranges<uint8_t, highly_aligned_allocator>/1 0.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/5 0.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/15 1.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/26 4.1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/38 4.8x
std_swap_ranges<uint8_t, highly_aligned_allocator>/60 7.3x
std_swap_ranges<uint8_t, highly_aligned_allocator>/125 12.3x
std_swap_ranges<uint8_t, highly_aligned_allocator>/800 21.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/3000 22.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/9000 22.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/1 0.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/5 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/15 1.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/26 3.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/38 4.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/60 7.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/125 11.8x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/800 14.6x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/3000 15.1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/9000 14.7x
   
   
  clang-cl Speedup
std_swap_ranges<uint8_t, highly_aligned_allocator>/1 0.6x
std_swap_ranges<uint8_t, highly_aligned_allocator>/5 1.1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/15 1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/26 1.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/38 1.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/60 1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/125 1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/800 1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/3000 1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/9000 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/1 0.6x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/5 1.2x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/15 1.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/26 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/38 1.2x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/60 1.4x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/125 1.4x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/800 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/3000 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/9000 1.1x

Add an implementation of std::swap_ranges using Neon intrinsics.
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Oct 31, 2025
AlexGuteniev

This comment was marked as resolved.

@hazzlim

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Oct 31, 2025
@StephanTLavavej
Copy link
Member

Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override /Os for ARM64, and to eliminate unnecessary loops, should improve performance but perhaps by an unobservable amount.

@StephanTLavavej StephanTLavavej removed their assignment Nov 4, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 4, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Nov 4, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@hazzlim
Copy link
Contributor Author

hazzlim commented Nov 5, 2025

Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override /Os for ARM64, and to eliminate unnecessary loops, should improve performance but perhaps by an unobservable amount.

Nice, thanks for doing this! Not sure how I missed that we could remove the loops for len < 64, nice one 😺

I will re-run the perf and report, it may well be unobservable but I think it should also improve the LDP generation so that's a win :)

@StephanTLavavej StephanTLavavej merged commit 8784104 into microsoft:main Nov 5, 2025
41 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Nov 5, 2025
@StephanTLavavej
Copy link
Member

Thanks for working towards #813 and congratulations on your first microsoft/STL commit! 😻 🎉 🦾

This will ship in the MSVC Build Tools 14.51 in a future update to Visual Studio 2026.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants