perf: SIMD string escaping and batch integer formatting optimizations#2605
Merged
perf: SIMD string escaping and batch integer formatting optimizations#2605
Conversation
Profiling revealed that string serialization was performing redundant scanning: first calling fast_needs_escaping() (SIMD scan to check IF escape needed), then find_next_json_quotable_character() (scalar byte-by-byte scan to find WHERE). Profile data from Twitter benchmark showed: - 54.76% time in atom<std::string> (string serialization) - 24.85% time in find_next_json_quotable_character (scalar position finding) This optimization unifies both operations into a single SIMD pass that directly locates the first quotable character position: - NEON (ARM64): Uses vceqq_u8/vcltq_u8 for character detection, then extracts position via __builtin_ctzll on 64-bit vector lanes - SSE2 (x86-64): Uses _mm_cmpeq_epi8/_mm_subs_epu8 for detection, then _mm_movemask_epi8 + __builtin_ctz for position extraction The write_string_escaped function now uses the position finder directly, eliminating the separate fast_needs_escaping check. Benchmark results (ARM64, Apple Silicon via Docker with p2996 clang): - Twitter (string-heavy): 4330 -> 5723 MB/s (+32%) - CITM (numeric-heavy): ~neutral (expected, few strings)
Profiling with perf annotate revealed that the integer-to-string
conversion loop was a significant hotspot in numeric-heavy workloads.
The original implementation processed 2 digits per iteration:
while (pv >= 100) {
memcpy(write_pointer - 1, &decimal_table[(pv % 100) * 2], 2);
write_pointer -= 2;
pv /= 100;
}
Profile data from CITM benchmark showed:
- 31.20% time in atom<unsigned long> (integer formatting)
- 39.07% of integer formatting time in the 2-byte store instruction
(sturh on ARM64)
- CITM integers average 8.8 digits, meaning 4+ store operations per number
This optimization processes 4 digits per iteration, reducing both store
operations and division count by approximately half for large numbers:
while (pv >= 10000) {
q = pv / 10000;
r = pv % 10000;
r_hi = r / 100; // High 2 digits
r_lo = r % 100; // Low 2 digits
memcpy(write_pointer - 1, &decimal_table[r_lo * 2], 2);
memcpy(write_pointer - 3, &decimal_table[r_hi * 2], 2);
write_pointer -= 4;
pv = q;
}
The division by 10000 compiles to an efficient multiply-high instruction
(umulh on ARM64). Applied to both unsigned and signed integer paths.
Benchmark results (ARM64, Apple Silicon via Docker with p2996 clang):
- Twitter: ~neutral (few integers)
- CITM (numeric-heavy): 2912 -> 3086 MB/s (+6%)
MSVC does not have __builtin_ctz/__builtin_ctzll. Use _BitScanForward and _BitScanForward64 from <intrin.h> on MSVC instead. This fixes the build on all Windows configurations (x64, ARM64, Win32).
Remove references to specific benchmarks and previous implementations from code comments. Comments now describe what the code does rather than historical context.
Member
|
Verified. Merging. |
|
I realize this is closed but I have a question related to There is very similar code in ruby/json. However, we use something similar to (minus where we apply the mask): It seems like this would eliminate an additional Admittedly I tried both versions of the code in Is there a reason the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance optimizations for the JSON serialization path in the static reflection builder API, achieving significant improvements on string-heavy workloads.
All measurements on ARM64 (Apple Silicon via Docker with p2996 clang).
Profiling Methodology
Tools Used
Key Findings
Twitter benchmark (before optimization):
CITM benchmark (before optimization):
Optimization 1: SIMD-Accelerated Position Finding
Problem
The serialization path performed redundant scanning:
fast_needs_escaping()- SIMD scan to check IF any quotable character existsfind_next_json_quotable_character()- Scalar byte-by-byte scan to find WHERESolution
Unified both operations into a single SIMD pass that directly locates the first quotable character:
Modified
write_string_escapedto use position finding directly:Architecture Support
_mm_movemask_epi8+__builtin_ctzOptimization 2: Batch Integer Formatting
Problem
The original integer-to-string conversion processed 2 digits per iteration. Profiling showed 39% of CITM serialization time spent on the 2-byte store instruction (
sturhon ARM64). With CITM integers averaging 8.8 digits, this meant 4+ store operations per number.Solution
Process 4 digits per iteration, reducing store count and division operations by half:
Division by 10000 compiles to efficient multiply-high instruction (
umulhon ARM64).Test Plan