Skip to content

perf: SIMD string escaping and batch integer formatting optimizations#2605

Merged
lemire merged 4 commits intomasterfrom
francisco/micro-optimizations
Feb 16, 2026
Merged

perf: SIMD string escaping and batch integer formatting optimizations#2605
lemire merged 4 commits intomasterfrom
francisco/micro-optimizations

Conversation

@FranciscoThiesen
Copy link
Member

Summary

Performance optimizations for the JSON serialization path in the static reflection builder API, achieving significant improvements on string-heavy workloads.

Benchmark Baseline Optimized Improvement
Twitter (string-heavy) 4,330 MB/s 5,647 MB/s +30.4%
CITM Catalog (numeric-heavy) 2,912 MB/s 3,086 MB/s +6.0%

All measurements on ARM64 (Apple Silicon via Docker with p2996 clang).


Profiling Methodology

Tools Used

  1. perf record with DWARF call graphs for sampling
  2. perf annotate for instruction-level cycle attribution
  3. perf report for function-level analysis

Key Findings

Twitter benchmark (before optimization):

54.76%  atom<std::string>                      # String serialization dominant
24.85%  find_next_json_quotable_character      # Scalar position finding
10.90%  atom<unsigned long>                    # Integer formatting

CITM benchmark (before optimization):

31.20%  atom<unsigned long>         # Integer formatting dominant  
26.55%  atom<CITMArea>              # Struct with large int arrays

Optimization 1: SIMD-Accelerated Position Finding

Problem

The serialization path performed redundant scanning:

  1. fast_needs_escaping() - SIMD scan to check IF any quotable character exists
  2. find_next_json_quotable_character() - Scalar byte-by-byte scan to find WHERE

Solution

Unified both operations into a single SIMD pass that directly locates the first quotable character:

// NEON implementation - finds exact position in one pass
while (remaining >= 16) {
  uint8x16_t word = vld1q_u8(ptr);
  uint8x16_t needs_escape = vceqq_u8(word, v34);  // '"'
  needs_escape = vorrq_u8(needs_escape, vceqq_u8(word, v92));  // '\\'
  needs_escape = vorrq_u8(needs_escape, vcltq_u8(word, v32));  // control chars

  if (vmaxvq_u32(vreinterpretq_u32_u8(needs_escape)) != 0) {
    // Extract position using trailing zero count on 64-bit lanes
    uint64x2_t as64 = vreinterpretq_u64_u8(needs_escape);
    uint64_t lo = vgetq_lane_u64(as64, 0);
    if (lo != 0) {
      return offset + __builtin_ctzll(lo) / 8;
    } else {
      return offset + 8 + __builtin_ctzll(vgetq_lane_u64(as64, 1)) / 8;
    }
  }
  ptr += 16;
  remaining -= 16;
}

Modified write_string_escaped to use position finding directly:

// Before: Two scans
if (!fast_needs_escaping(input)) { ... }
size_t location = find_next_json_quotable_character(input, 0);

// After: Single scan
size_t location = find_next_json_quotable_character(input, 0);
if (location == mysize) {
  memcpy(out, input.data(), input.size());  // Fast path
  return input.size();
}

Architecture Support

  • NEON (ARM64): Full SIMD implementation
  • SSE2 (x86-64): Full SIMD implementation using _mm_movemask_epi8 + __builtin_ctz
  • Fallback: Scalar loop for other platforms

Optimization 2: Batch Integer Formatting

Problem

The original integer-to-string conversion processed 2 digits per iteration. Profiling showed 39% of CITM serialization time spent on the 2-byte store instruction (sturh on ARM64). With CITM integers averaging 8.8 digits, this meant 4+ store operations per number.

Solution

Process 4 digits per iteration, reducing store count and division operations by half:

// Before: 2 digits per iteration
while (pv >= 100) {
  memcpy(write_pointer - 1, &decimal_table[(pv % 100) * 2], 2);
  write_pointer -= 2;
  pv /= 100;
}

// After: 4 digits per iteration for large numbers
while (pv >= 10000) {
  unsigned_type q = pv / 10000;
  unsigned_type r = pv % 10000;
  unsigned_type r_hi = r / 100;  // High 2 digits
  unsigned_type r_lo = r % 100;  // Low 2 digits
  memcpy(write_pointer - 1, &decimal_table[r_lo * 2], 2);
  memcpy(write_pointer - 3, &decimal_table[r_hi * 2], 2);
  write_pointer -= 4;
  pv = q;
}
// Falls through to 2-digit loop for remaining 1-4 digits

Division by 10000 compiles to efficient multiply-high instruction (umulh on ARM64).


Test Plan

  • All existing unit tests pass
  • Benchmark output matches byte-for-byte (JSON correctness)
  • Test on both ARM64 and x86-64 architectures

Profiling revealed that string serialization was performing redundant
scanning: first calling fast_needs_escaping() (SIMD scan to check IF
escape needed), then find_next_json_quotable_character() (scalar
byte-by-byte scan to find WHERE).

Profile data from Twitter benchmark showed:
- 54.76% time in atom<std::string> (string serialization)
- 24.85% time in find_next_json_quotable_character (scalar position finding)

This optimization unifies both operations into a single SIMD pass that
directly locates the first quotable character position:

- NEON (ARM64): Uses vceqq_u8/vcltq_u8 for character detection, then
  extracts position via __builtin_ctzll on 64-bit vector lanes
- SSE2 (x86-64): Uses _mm_cmpeq_epi8/_mm_subs_epu8 for detection, then
  _mm_movemask_epi8 + __builtin_ctz for position extraction

The write_string_escaped function now uses the position finder directly,
eliminating the separate fast_needs_escaping check.

Benchmark results (ARM64, Apple Silicon via Docker with p2996 clang):
- Twitter (string-heavy): 4330 -> 5723 MB/s (+32%)
- CITM (numeric-heavy): ~neutral (expected, few strings)
Profiling with perf annotate revealed that the integer-to-string
conversion loop was a significant hotspot in numeric-heavy workloads.
The original implementation processed 2 digits per iteration:

    while (pv >= 100) {
        memcpy(write_pointer - 1, &decimal_table[(pv % 100) * 2], 2);
        write_pointer -= 2;
        pv /= 100;
    }

Profile data from CITM benchmark showed:
- 31.20% time in atom<unsigned long> (integer formatting)
- 39.07% of integer formatting time in the 2-byte store instruction
  (sturh on ARM64)
- CITM integers average 8.8 digits, meaning 4+ store operations per number

This optimization processes 4 digits per iteration, reducing both store
operations and division count by approximately half for large numbers:

    while (pv >= 10000) {
        q = pv / 10000;
        r = pv % 10000;
        r_hi = r / 100;  // High 2 digits
        r_lo = r % 100;  // Low 2 digits
        memcpy(write_pointer - 1, &decimal_table[r_lo * 2], 2);
        memcpy(write_pointer - 3, &decimal_table[r_hi * 2], 2);
        write_pointer -= 4;
        pv = q;
    }

The division by 10000 compiles to an efficient multiply-high instruction
(umulh on ARM64). Applied to both unsigned and signed integer paths.

Benchmark results (ARM64, Apple Silicon via Docker with p2996 clang):
- Twitter: ~neutral (few integers)
- CITM (numeric-heavy): 2912 -> 3086 MB/s (+6%)
MSVC does not have __builtin_ctz/__builtin_ctzll. Use _BitScanForward
and _BitScanForward64 from <intrin.h> on MSVC instead.

This fixes the build on all Windows configurations (x64, ARM64, Win32).
Remove references to specific benchmarks and previous implementations
from code comments. Comments now describe what the code does rather
than historical context.
@lemire
Copy link
Member

lemire commented Feb 16, 2026

Verified. Merging.

@lemire lemire merged commit af2a436 into master Feb 16, 2026
158 of 159 checks passed
@lemire lemire deleted the francisco/micro-optimizations branch February 16, 2026 16:50
@samyron
Copy link

samyron commented Feb 21, 2026

I realize this is closed but I have a question related to Optimization 1

<snip>
  if (vmaxvq_u32(vreinterpretq_u32_u8(needs_escape)) != 0) {
    // Extract position using trailing zero count on 64-bit lanes
    uint64x2_t as64 = vreinterpretq_u64_u8(needs_escape);
    uint64_t lo = vgetq_lane_u64(as64, 0);
    if (lo != 0) {
      return offset + __builtin_ctzll(lo) / 8;
    } else {
      return offset + 8 + __builtin_ctzll(vgetq_lane_u64(as64, 1)) / 8;
    }
  }
<snip>

There is very similar code in ruby/json. However, we use something similar to (minus where we apply the mask):

<snip>
  const uint8x8_t res = vshrn_n_u16(vreinterpretq_u16_u8(needs_escape), 4);
  const uint64_t mask = vget_lane_u64(vreinterpret_u64_u8(res), 0);
  if (mask != 0) {
    mask &= 0x8888888888888888ull;
    return offset + __builtin__ctzll(mask) >> 2;
  }
<snip>

It seems like this would eliminate an additional fmov and branch in the case there is an escape character.

Admittedly I tried both versions of the code in ruby/json and at least on my M1 Macbook Air I really don't see much of a difference at least in the very coarse level benchmarking I've performed using activitypub.json.

Is there a reason the shrn approach is inferior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants