perf: SIMD string escaping and batch integer formatting optimizations by FranciscoThiesen · Pull Request #2605 · simdjson/simdjson

FranciscoThiesen · 2026-02-16T04:19:47Z

Summary

Performance optimizations for the JSON serialization path in the static reflection builder API, achieving significant improvements on string-heavy workloads.

Benchmark	Baseline	Optimized	Improvement
Twitter (string-heavy)	4,330 MB/s	5,647 MB/s	+30.4%
CITM Catalog (numeric-heavy)	2,912 MB/s	3,086 MB/s	+6.0%

All measurements on ARM64 (Apple Silicon via Docker with p2996 clang).

Profiling Methodology

Tools Used

perf record with DWARF call graphs for sampling
perf annotate for instruction-level cycle attribution
perf report for function-level analysis

Key Findings

Twitter benchmark (before optimization):

54.76%  atom<std::string>                      # String serialization dominant
24.85%  find_next_json_quotable_character      # Scalar position finding
10.90%  atom<unsigned long>                    # Integer formatting

CITM benchmark (before optimization):

31.20%  atom<unsigned long>         # Integer formatting dominant  
26.55%  atom<CITMArea>              # Struct with large int arrays

Optimization 1: SIMD-Accelerated Position Finding

Problem

The serialization path performed redundant scanning:

fast_needs_escaping() - SIMD scan to check IF any quotable character exists
find_next_json_quotable_character() - Scalar byte-by-byte scan to find WHERE

Solution

Unified both operations into a single SIMD pass that directly locates the first quotable character:

// NEON implementation - finds exact position in one pass
while (remaining >= 16) {
  uint8x16_t word = vld1q_u8(ptr);
  uint8x16_t needs_escape = vceqq_u8(word, v34);  // '"'
  needs_escape = vorrq_u8(needs_escape, vceqq_u8(word, v92));  // '\\'
  needs_escape = vorrq_u8(needs_escape, vcltq_u8(word, v32));  // control chars

  if (vmaxvq_u32(vreinterpretq_u32_u8(needs_escape)) != 0) {
    // Extract position using trailing zero count on 64-bit lanes
    uint64x2_t as64 = vreinterpretq_u64_u8(needs_escape);
    uint64_t lo = vgetq_lane_u64(as64, 0);
    if (lo != 0) {
      return offset + __builtin_ctzll(lo) / 8;
    } else {
      return offset + 8 + __builtin_ctzll(vgetq_lane_u64(as64, 1)) / 8;
    }
  }
  ptr += 16;
  remaining -= 16;
}

Modified write_string_escaped to use position finding directly:

// Before: Two scans
if (!fast_needs_escaping(input)) { ... }
size_t location = find_next_json_quotable_character(input, 0);

// After: Single scan
size_t location = find_next_json_quotable_character(input, 0);
if (location == mysize) {
  memcpy(out, input.data(), input.size());  // Fast path
  return input.size();
}

Architecture Support

NEON (ARM64): Full SIMD implementation
SSE2 (x86-64): Full SIMD implementation using _mm_movemask_epi8 + __builtin_ctz
Fallback: Scalar loop for other platforms

Optimization 2: Batch Integer Formatting

Problem

The original integer-to-string conversion processed 2 digits per iteration. Profiling showed 39% of CITM serialization time spent on the 2-byte store instruction (sturh on ARM64). With CITM integers averaging 8.8 digits, this meant 4+ store operations per number.

Solution

Process 4 digits per iteration, reducing store count and division operations by half:

// Before: 2 digits per iteration
while (pv >= 100) {
  memcpy(write_pointer - 1, &decimal_table[(pv % 100) * 2], 2);
  write_pointer -= 2;
  pv /= 100;
}

// After: 4 digits per iteration for large numbers
while (pv >= 10000) {
  unsigned_type q = pv / 10000;
  unsigned_type r = pv % 10000;
  unsigned_type r_hi = r / 100;  // High 2 digits
  unsigned_type r_lo = r % 100;  // Low 2 digits
  memcpy(write_pointer - 1, &decimal_table[r_lo * 2], 2);
  memcpy(write_pointer - 3, &decimal_table[r_hi * 2], 2);
  write_pointer -= 4;
  pv = q;
}
// Falls through to 2-digit loop for remaining 1-4 digits

Division by 10000 compiles to efficient multiply-high instruction (umulh on ARM64).

Test Plan

All existing unit tests pass
Benchmark output matches byte-for-byte (JSON correctness)
Test on both ARM64 and x86-64 architectures

Profiling revealed that string serialization was performing redundant scanning: first calling fast_needs_escaping() (SIMD scan to check IF escape needed), then find_next_json_quotable_character() (scalar byte-by-byte scan to find WHERE). Profile data from Twitter benchmark showed: - 54.76% time in atom<std::string> (string serialization) - 24.85% time in find_next_json_quotable_character (scalar position finding) This optimization unifies both operations into a single SIMD pass that directly locates the first quotable character position: - NEON (ARM64): Uses vceqq_u8/vcltq_u8 for character detection, then extracts position via __builtin_ctzll on 64-bit vector lanes - SSE2 (x86-64): Uses _mm_cmpeq_epi8/_mm_subs_epu8 for detection, then _mm_movemask_epi8 + __builtin_ctz for position extraction The write_string_escaped function now uses the position finder directly, eliminating the separate fast_needs_escaping check. Benchmark results (ARM64, Apple Silicon via Docker with p2996 clang): - Twitter (string-heavy): 4330 -> 5723 MB/s (+32%) - CITM (numeric-heavy): ~neutral (expected, few strings)

Profiling with perf annotate revealed that the integer-to-string conversion loop was a significant hotspot in numeric-heavy workloads. The original implementation processed 2 digits per iteration: while (pv >= 100) { memcpy(write_pointer - 1, &decimal_table[(pv % 100) * 2], 2); write_pointer -= 2; pv /= 100; } Profile data from CITM benchmark showed: - 31.20% time in atom<unsigned long> (integer formatting) - 39.07% of integer formatting time in the 2-byte store instruction (sturh on ARM64) - CITM integers average 8.8 digits, meaning 4+ store operations per number This optimization processes 4 digits per iteration, reducing both store operations and division count by approximately half for large numbers: while (pv >= 10000) { q = pv / 10000; r = pv % 10000; r_hi = r / 100; // High 2 digits r_lo = r % 100; // Low 2 digits memcpy(write_pointer - 1, &decimal_table[r_lo * 2], 2); memcpy(write_pointer - 3, &decimal_table[r_hi * 2], 2); write_pointer -= 4; pv = q; } The division by 10000 compiles to an efficient multiply-high instruction (umulh on ARM64). Applied to both unsigned and signed integer paths. Benchmark results (ARM64, Apple Silicon via Docker with p2996 clang): - Twitter: ~neutral (few integers) - CITM (numeric-heavy): 2912 -> 3086 MB/s (+6%)

MSVC does not have __builtin_ctz/__builtin_ctzll. Use _BitScanForward and _BitScanForward64 from <intrin.h> on MSVC instead. This fixes the build on all Windows configurations (x64, ARM64, Win32).

Remove references to specific benchmarks and previous implementations from code comments. Comments now describe what the code does rather than historical context.

lemire · 2026-02-16T16:50:22Z

Verified. Merging.

samyron · 2026-02-21T03:53:53Z

I realize this is closed but I have a question related to Optimization 1

<snip>
  if (vmaxvq_u32(vreinterpretq_u32_u8(needs_escape)) != 0) {
    // Extract position using trailing zero count on 64-bit lanes
    uint64x2_t as64 = vreinterpretq_u64_u8(needs_escape);
    uint64_t lo = vgetq_lane_u64(as64, 0);
    if (lo != 0) {
      return offset + __builtin_ctzll(lo) / 8;
    } else {
      return offset + 8 + __builtin_ctzll(vgetq_lane_u64(as64, 1)) / 8;
    }
  }
<snip>

There is very similar code in ruby/json. However, we use something similar to (minus where we apply the mask):

<snip>
  const uint8x8_t res = vshrn_n_u16(vreinterpretq_u16_u8(needs_escape), 4);
  const uint64_t mask = vget_lane_u64(vreinterpret_u64_u8(res), 0);
  if (mask != 0) {
    mask &= 0x8888888888888888ull;
    return offset + __builtin__ctzll(mask) >> 2;
  }
<snip>

It seems like this would eliminate an additional fmov and branch in the case there is an escape character.

Admittedly I tried both versions of the code in ruby/json and at least on my M1 Macbook Air I really don't see much of a difference at least in the very coarse level benchmarking I've performed using activitypub.json.

Is there a reason the shrn approach is inferior?

FranciscoThiesen added 4 commits February 15, 2026 20:18

fix(build): add MSVC compatibility for bit manipulation intrinsics

c16638e

MSVC does not have __builtin_ctz/__builtin_ctzll. Use _BitScanForward and _BitScanForward64 from <intrin.h> on MSVC instead. This fixes the build on all Windows configurations (x64, ARM64, Win32).

refactor: clean up comments to be implementation-focused

11c524b

Remove references to specific benchmarks and previous implementations from code comments. Comments now describe what the code does rather than historical context.

lemire merged commit af2a436 into master Feb 16, 2026
158 of 159 checks passed

lemire deleted the francisco/micro-optimizations branch February 16, 2026 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: SIMD string escaping and batch integer formatting optimizations#2605

perf: SIMD string escaping and batch integer formatting optimizations#2605
lemire merged 4 commits intomasterfrom
francisco/micro-optimizations

FranciscoThiesen commented Feb 16, 2026

Uh oh!

lemire commented Feb 16, 2026

Uh oh!

Uh oh!

samyron commented Feb 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

FranciscoThiesen commented Feb 16, 2026

Summary

Profiling Methodology

Tools Used

Key Findings

Optimization 1: SIMD-Accelerated Position Finding

Problem

Solution

Architecture Support

Optimization 2: Batch Integer Formatting

Problem

Solution

Test Plan

Uh oh!

lemire commented Feb 16, 2026

Uh oh!

Uh oh!

samyron commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samyron commented Feb 21, 2026 •

edited

Loading