Skip to content

Sped up serialization by 10..70%-ish#2408

Merged
lemire merged 4 commits intosimdjson:masterfrom
toughengineer:sped_up_serialization
Aug 7, 2025
Merged

Sped up serialization by 10..70%-ish#2408
lemire merged 4 commits intosimdjson:masterfrom
toughengineer:sped_up_serialization

Conversation

@toughengineer
Copy link
Contributor

tl;dr:

This PR speeds up serialization by 10..70%-ish on Windows and linux using MSVC, Clang and GCC.

Methodology and measurements

To measure performance I ran bench_dom_api with the following options:

bench_dom_api --benchmark_filter=serialize --benchmark_min_warmup_time=0.1

--benchmark_min_warmup_time=0.1 helped make results less fussy and more consistent.

Everything was compiled as x64 and run on a i5 13600k CPU.

I ultimately looked at "Gigabytes" figures that are output by the tests.
Here is the overall comparison across the compilers and standard libraries:

std::vector<char> vector with small buffer
base measurement measurement diff
MSVC 1.004 1.790 +78%
clang-cl 1.487 1.860 +25%
clang 1.831 2.239 +22%
gcc 1.677 2.314 +38%
clang libstdc++ 1.744 1.966 +13%
clang libc++ 1.786 2.100 +18%

Note

Here and onward I refer to MSVC, clang-cl and clang with MSVC's standard library as used on Windows,
gcc (with libstdc++), clang libstdc++ and clang libc++ as used on Ubuntu inside WSL.

Four benchamrks were run for every entry:

serialize_twitter
serialize_big_string_to_string
serialize_twitter_to_string
serialize_twitter_string_builder

Results of serialize_twitter, serialize_twitter_to_string and serialize_twitter_string_builder should be consistent.
I took measurements from serialize_twitter_string_builder for the table above.

In case of serialize_big_string_to_string performance does not significantly depend on the changes in the implementation, it is here kinda as a control.

Feel free to modify it as you see fit, or to take it over completely in a different PR.

For more details read first few comments.

Here are the copy/paste of the results

Click/tap on each to expand.

MSVC:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    506887 ns       500975 ns         1154 Gigabytes=931.995M/s docs=1.99611k/s
serialize_big_string_to_string        18528 ns        18415 ns        37333 Gigabytes=5.43036G/s docs=54.3025k/s
serialize_twitter_to_string          502887 ns       500000 ns         1000 Gigabytes=933.812M/s docs=2k/s
serialize_twitter_string_builder     472722 ns       464965 ns         1445 Gigabytes=1.00417G/s docs=2.1507k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    272907 ns       272213 ns         2583 Gigabytes=1.71523G/s docs=3.6736k/s
serialize_big_string_to_string        20597 ns        20403 ns        34462 Gigabytes=4.90136G/s docs=49.0126k/s
serialize_twitter_to_string          273711 ns       269938 ns         2489 Gigabytes=1.72968G/s docs=3.70456k/s
serialize_twitter_string_builder     264431 ns       260911 ns         2635 Gigabytes=1.78952G/s docs=3.83273k/s

clang-cl:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    351991 ns       350249 ns         1606 Gigabytes=1.33307G/s docs=2.85511k/s
serialize_big_string_to_string        19095 ns        19043 ns        34462 Gigabytes=5.25146G/s docs=52.5135k/s
serialize_twitter_to_string          352775 ns       352926 ns         1948 Gigabytes=1.32296G/s docs=2.83345k/s
serialize_twitter_string_builder     315356 ns       313895 ns         2240 Gigabytes=1.48746G/s docs=3.18578k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    255904 ns       251074 ns         2676 Gigabytes=1.85963G/s docs=3.98288k/s
serialize_big_string_to_string        18403 ns        18415 ns        37333 Gigabytes=5.43036G/s docs=54.3025k/s
serialize_twitter_to_string          257489 ns       256696 ns         2800 Gigabytes=1.8189G/s docs=3.89565k/s
serialize_twitter_string_builder     248245 ns       251088 ns         2987 Gigabytes=1.85953G/s docs=3.98267k/s

clang:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    296844 ns       294335 ns         1858 Gigabytes=1.58631G/s docs=3.39749k/s
serialize_big_string_to_string        18935 ns        19252 ns        37333 Gigabytes=5.19426G/s docs=51.9416k/s
serialize_twitter_to_string          298236 ns       298187 ns         2358 Gigabytes=1.56582G/s docs=3.3536k/s
serialize_twitter_string_builder     258256 ns       254981 ns         2635 Gigabytes=1.83114G/s docs=3.92186k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    217882 ns       217686 ns         3230 Gigabytes=2.14486G/s docs=4.59378k/s
serialize_big_string_to_string        18651 ns        18834 ns        37333 Gigabytes=5.30969G/s docs=53.0958k/s
serialize_twitter_to_string          217446 ns       219727 ns         3200 Gigabytes=2.12494G/s docs=4.55111k/s
serialize_twitter_string_builder     208399 ns       208575 ns         3446 Gigabytes=2.23855G/s docs=4.79443k/s

gcc:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    297414 ns       297414 ns         2352 Gigabytes=1.56989G/s docs=3.36232k/s
serialize_big_string_to_string        18369 ns        18368 ns        38098 Gigabytes=5.44423G/s docs=54.4412k/s
serialize_twitter_to_string          297653 ns       297653 ns         2352 Gigabytes=1.56862G/s docs=3.35962k/s
serialize_twitter_string_builder     278383 ns       278380 ns         2514 Gigabytes=1.67722G/s docs=3.59221k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    207352 ns       207351 ns         3266 Gigabytes=2.25176G/s docs=4.82274k/s
serialize_big_string_to_string        18300 ns        18300 ns        38249 Gigabytes=5.46468G/s docs=54.6457k/s
serialize_twitter_to_string          208075 ns       208074 ns         3366 Gigabytes=2.24394G/s docs=4.80598k/s
serialize_twitter_string_builder     201757 ns       201754 ns         3460 Gigabytes=2.31423G/s docs=4.95652k/s

clang libstdc++:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    288758 ns       288757 ns         2385 Gigabytes=1.61695G/s docs=3.46313k/s
serialize_big_string_to_string        18928 ns        18928 ns        37002 Gigabytes=5.28321G/s docs=52.831k/s
serialize_twitter_to_string          288598 ns       288599 ns         2413 Gigabytes=1.61784G/s docs=3.46502k/s
serialize_twitter_string_builder     267786 ns       267783 ns         2594 Gigabytes=1.7436G/s docs=3.73436k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    243550 ns       243550 ns         2872 Gigabytes=1.91708G/s docs=4.10593k/s
serialize_big_string_to_string        18469 ns        18469 ns        37833 Gigabytes=5.41463G/s docs=54.1452k/s
serialize_twitter_to_string          244292 ns       244292 ns         2865 Gigabytes=1.91126G/s docs=4.09346k/s
serialize_twitter_string_builder     237531 ns       237528 ns         2932 Gigabytes=1.96569G/s docs=4.21003k/s

clang libc++:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    279578 ns       279577 ns         2491 Gigabytes=1.67004G/s docs=3.57683k/s
serialize_big_string_to_string        19321 ns        19321 ns        37134 Gigabytes=5.17585G/s docs=51.7575k/s
serialize_twitter_to_string          280844 ns       280844 ns         2490 Gigabytes=1.66251G/s docs=3.5607k/s
serialize_twitter_string_builder     261383 ns       261383 ns         2663 Gigabytes=1.78629G/s docs=3.8258k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    229063 ns       229062 ns         2989 Gigabytes=2.03834G/s docs=4.36563k/s
serialize_big_string_to_string        18271 ns        18271 ns        38194 Gigabytes=5.47325G/s docs=54.7314k/s
serialize_twitter_to_string          228027 ns       228025 ns         3062 Gigabytes=2.04761G/s docs=4.38548k/s
serialize_twitter_string_builder     222314 ns       222310 ns         3160 Gigabytes=2.10025G/s docs=4.49823k/s

I omitted headers above. Here are examples of them to be complete.

benchmark result headers

Windows:

2025-08-07T17:12:29+03:00
Running bench_dom_api.exe
Run on (20 X 3494 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)

Ubuntu inside WSL:

2025-08-07T15:31:57+03:00
Running ./bench_dom_api
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.22, 0.20, 0.11

@toughengineer
Copy link
Contributor Author

toughengineer commented Aug 7, 2025

Part 1: a baffling conundrum

O boy, here we go.

By random chance I looked at this line:

std::vector<char> buffer{}; // not ideal!

and thought: std::string would really be better here.

So I replaced it and also replaced

buffer.insert(buffer.end(), begin, end);

with

buffer.append(begin, end);

Results on Windows

Note

Here and onward I refer to MSVC, clang-cl and clang with MSVC's standard library as used on Windows,
gcc (with libstdc++), clang libstdc++ and clang libc++ as used on Ubuntu inside WSL.

MSVC is 80%-ish faster! Neat! Looks very promising.

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    506887 ns       500975 ns         1154 Gigabytes=931.995M/s docs=1.99611k/s
serialize_big_string_to_string        18528 ns        18415 ns        37333 Gigabytes=5.43036G/s docs=54.3025k/s
serialize_twitter_to_string          502887 ns       500000 ns         1000 Gigabytes=933.812M/s docs=2k/s
serialize_twitter_string_builder     472722 ns       464965 ns         1445 Gigabytes=1.00417G/s docs=2.1507k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    276464 ns       276483 ns         2091 Gigabytes=1.68874G/s docs=3.61686k/s
serialize_big_string_to_string        17670 ns        17997 ns        37333 Gigabytes=5.55665G/s docs=55.5654k/s
serialize_twitter_to_string          274764 ns       276215 ns         2489 Gigabytes=1.69037G/s docs=3.62036k/s
serialize_twitter_string_builder     246423 ns       245536 ns         2800 Gigabytes=1.90158G/s docs=4.07273k/s

clang-cl and clang are even faster in absolute figures!
But the relative improvement is less spectacular because they were relatively fast to begin with.

clang-cl +57%:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    351991 ns       350249 ns         1606 Gigabytes=1.33307G/s docs=2.85511k/s
serialize_big_string_to_string        19095 ns        19043 ns        34462 Gigabytes=5.25146G/s docs=52.5135k/s
serialize_twitter_to_string          352775 ns       352926 ns         1948 Gigabytes=1.32296G/s docs=2.83345k/s
serialize_twitter_string_builder     315356 ns       313895 ns         2240 Gigabytes=1.48746G/s docs=3.18578k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    233365 ns       231222 ns         2230 Gigabytes=2.0193G/s docs=4.32485k/s
serialize_big_string_to_string        16831 ns        16881 ns        40727 Gigabytes=5.92405G/s docs=59.2393k/s
serialize_twitter_to_string          232248 ns       230164 ns         2987 Gigabytes=2.02858G/s docs=4.34473k/s
serialize_twitter_string_builder     199652 ns       199507 ns         3446 Gigabytes=2.3403G/s docs=5.01236k/s

clang +28%:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    296844 ns       294335 ns         1858 Gigabytes=1.58631G/s docs=3.39749k/s
serialize_big_string_to_string        18935 ns        19252 ns        37333 Gigabytes=5.19426G/s docs=51.9416k/s
serialize_twitter_to_string          298236 ns       298187 ns         2358 Gigabytes=1.56582G/s docs=3.3536k/s
serialize_twitter_string_builder     258256 ns       254981 ns         2635 Gigabytes=1.83114G/s docs=3.92186k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    231096 ns       230736 ns         2641 Gigabytes=2.02355G/s docs=4.33395k/s
serialize_big_string_to_string        16864 ns        16881 ns        40727 Gigabytes=5.92405G/s docs=59.2393k/s
serialize_twitter_to_string          231125 ns       230164 ns         2987 Gigabytes=2.02858G/s docs=4.34473k/s
serialize_twitter_string_builder     200409 ns       199507 ns         3446 Gigabytes=2.3403G/s docs=5.01236k/s

OK, let's see what happens on linux.

Results on linux

gcc gives 25% decrease (i.e. -25%) in performance when std::vector<char> is switched to std::string... What??

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    297414 ns       297414 ns         2352 Gigabytes=1.56989G/s docs=3.36232k/s
serialize_big_string_to_string        18369 ns        18368 ns        38098 Gigabytes=5.44423G/s docs=54.4412k/s
serialize_twitter_to_string          297653 ns       297653 ns         2352 Gigabytes=1.56862G/s docs=3.35962k/s
serialize_twitter_string_builder     278383 ns       278380 ns         2514 Gigabytes=1.67722G/s docs=3.59221k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    387009 ns       387006 ns         1794 Gigabytes=1.20646G/s docs=2.58394k/s
serialize_big_string_to_string        18365 ns        18365 ns        38071 Gigabytes=5.44531G/s docs=54.452k/s
serialize_twitter_to_string          387054 ns       387052 ns         1805 Gigabytes=1.20631G/s docs=2.58363k/s
serialize_twitter_string_builder     370782 ns       370780 ns         1887 Gigabytes=1.25925G/s docs=2.69702k/s

Combination clang libstdc++ gives overall good performance which does not significantly change when the implementation is changed.

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    288758 ns       288757 ns         2385 Gigabytes=1.61695G/s docs=3.46313k/s
serialize_big_string_to_string        18928 ns        18928 ns        37002 Gigabytes=5.28321G/s docs=52.831k/s
serialize_twitter_to_string          288598 ns       288599 ns         2413 Gigabytes=1.61784G/s docs=3.46502k/s
serialize_twitter_string_builder     267786 ns       267783 ns         2594 Gigabytes=1.7436G/s docs=3.73436k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    295324 ns       295321 ns         2322 Gigabytes=1.58101G/s docs=3.38615k/s
serialize_big_string_to_string        18446 ns        18446 ns        37986 Gigabytes=5.4213G/s docs=54.2119k/s
serialize_twitter_to_string          295262 ns       295257 ns         2372 Gigabytes=1.58135G/s docs=3.38688k/s
serialize_twitter_string_builder     276829 ns       276826 ns         2523 Gigabytes=1.68664G/s docs=3.61238k/s

What's happening here?


Combination clang libc++ gives 35% decrease (i.e. -35%) in performance!!

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    279578 ns       279577 ns         2491 Gigabytes=1.67004G/s docs=3.57683k/s
serialize_big_string_to_string        19321 ns        19321 ns        37134 Gigabytes=5.17585G/s docs=51.7575k/s
serialize_twitter_to_string          280844 ns       280844 ns         2490 Gigabytes=1.66251G/s docs=3.5607k/s
serialize_twitter_string_builder     261383 ns       261383 ns         2663 Gigabytes=1.78629G/s docs=3.8258k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    428105 ns       428100 ns         1626 Gigabytes=1.09065G/s docs=2.3359k/s
serialize_big_string_to_string        17829 ns        17829 ns        39374 Gigabytes=5.60909G/s docs=56.0898k/s
serialize_twitter_to_string          429024 ns       429013 ns         1625 Gigabytes=1.08833G/s docs=2.33093k/s
serialize_twitter_string_builder     404238 ns       404236 ns         1723 Gigabytes=1.15503G/s docs=2.4738k/s

It does not make any sense to me!

Midway conclusion

Here are the overall results so far:

std::vector<char> std::string
base measurement measurement diff
MSVC 1.004 1.902 +89%
clang-cl 1.487 2.340 +57%
clang 1.831 2.340 +28%
gcc 1.677 1.259 -25%
clang libstdc++ 1.744 1.687 0%-3% *
clang libc++ 1.786 1.155 -35%
* oopsie!

Because of a copy/paste error, the result for clang libstdc++ and std::string is slightly different: it's -3%, not 0%.

Approximately none of the results make sense to me.
It's a baffling conundrum for me.
Why is std::string so much faster on Windows compared to std::vector<char>?
Why is std::string from both libstdc++ and libc++ are so much slower with gcc and clang respectively?

At this point it "nerd sniped" me, or rather I "nerd sniped" myself, and thought: surely I can write a vector with small buffer that's faster, right?

@toughengineer
Copy link
Contributor Author

toughengineer commented Aug 7, 2025

Part 2: I'm gonna go build my own vector, with blackjack and small buffer!

Instead of investigating why switching to std::string gives such bizarre results, I decided to implement vector with small buffer and see what happens.

It's pretty ugly, but also kinda elegant... somewhat... you know...

vector_with_small_buffer implementation
struct vector_with_small_buffer {
  ~vector_with_small_buffer() { free_buffer(); }

  void clear() {
    size = 0;
    capacity = StaticCapacity;
    free_buffer();
    buffer = array;
  }

  simdjson_inline void push_back(char c) {
    if (capacity < size + 1)
      grow(capacity * 2);
    buffer[size++] = c;
  }

  simdjson_inline void append(const char *begin, const char *end) {
    const size_t new_size = size + (end - begin);
    if (capacity < new_size)
      // std::max(new_size, capacity * 2); is broken in tests on Windows
      grow(new_size < capacity * 2 ? capacity * 2 : new_size);
    std::copy(begin, end, buffer + std::exchange(size, new_size));
  }

  std::string_view str() const { return std::string_view(buffer, size); }

private:
  void free_buffer() {
    if (buffer != array)
      delete[] buffer;
  }
  void grow(size_t new_capacity) {
    auto new_buffer = new char[new_capacity];
    std::copy(buffer, buffer + size, new_buffer);
    free_buffer();
    buffer = new_buffer;
    capacity = new_capacity;
  }

  static const size_t StaticCapacity = 64;
  char array[StaticCapacity];
  char *buffer = array;
  size_t size = 0;
  size_t capacity = StaticCapacity;
};

It has small buffer just like std::string has small string optimization to offset the relatively large cost of reallocations for the first few appended characters.

Specifying push_back() and append() as simdjson_inline turned out to be crucial for good performance, it does not cause significant change for the rest of the methods.

Conclusion

To make already long story shorter, here are the overall results:

std::vector<char> std::string vector with small buffer
base measurement measurement diff measurement diff
MSVC 1.004 1.902 +89% 1.790 +78%
clang-cl 1.487 2.340 +57% 1.860 +25%
clang 1.831 2.340 +28% 2.239 +22%
gcc 1.677 1.259 -25% 2.314 +38%
clang libstdc++ 1.744 1.687 0%-3% * 1.966 +13%
clang libc++ 1.786 1.155 -35% 2.100 +18%
* oopsie!

Because of a copy/paste error, the result for clang libstdc++ and std::string is slightly different: it's -3%, not 0%.

Note

Here and onward I refer to MSVC, clang-cl and clang with MSVC's standard library as used on Windows,
gcc (with libstdc++), clang libstdc++ and clang libc++ as used on Ubuntu inside WSL.

Figures in diff columns are differences with respect to figures in base measurement column.

My vector_with_small_buffer has trade offs which are acceptable in this case in my opinion.
It is not as fast as std::string with clang and clang-cl or even MSVC, which shows that in theory it can be tweaked, maybe the allocation pattern, maybe something else.

But at this point in time it is faster across the board compared to the baseline of std::vector<char>, and my work here is done for now.

All the measurements

MSVC:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    506887 ns       500975 ns         1154 Gigabytes=931.995M/s docs=1.99611k/s
serialize_big_string_to_string        18528 ns        18415 ns        37333 Gigabytes=5.43036G/s docs=54.3025k/s
serialize_twitter_to_string          502887 ns       500000 ns         1000 Gigabytes=933.812M/s docs=2k/s
serialize_twitter_string_builder     472722 ns       464965 ns         1445 Gigabytes=1.00417G/s docs=2.1507k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    276464 ns       276483 ns         2091 Gigabytes=1.68874G/s docs=3.61686k/s
serialize_big_string_to_string        17670 ns        17997 ns        37333 Gigabytes=5.55665G/s docs=55.5654k/s
serialize_twitter_to_string          274764 ns       276215 ns         2489 Gigabytes=1.69037G/s docs=3.62036k/s
serialize_twitter_string_builder     246423 ns       245536 ns         2800 Gigabytes=1.90158G/s docs=4.07273k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    272907 ns       272213 ns         2583 Gigabytes=1.71523G/s docs=3.6736k/s
serialize_big_string_to_string        20597 ns        20403 ns        34462 Gigabytes=4.90136G/s docs=49.0126k/s
serialize_twitter_to_string          273711 ns       269938 ns         2489 Gigabytes=1.72968G/s docs=3.70456k/s
serialize_twitter_string_builder     264431 ns       260911 ns         2635 Gigabytes=1.78952G/s docs=3.83273k/s

clang-cl:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    351991 ns       350249 ns         1606 Gigabytes=1.33307G/s docs=2.85511k/s
serialize_big_string_to_string        19095 ns        19043 ns        34462 Gigabytes=5.25146G/s docs=52.5135k/s
serialize_twitter_to_string          352775 ns       352926 ns         1948 Gigabytes=1.32296G/s docs=2.83345k/s
serialize_twitter_string_builder     315356 ns       313895 ns         2240 Gigabytes=1.48746G/s docs=3.18578k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    233365 ns       231222 ns         2230 Gigabytes=2.0193G/s docs=4.32485k/s
serialize_big_string_to_string        16831 ns        16881 ns        40727 Gigabytes=5.92405G/s docs=59.2393k/s
serialize_twitter_to_string          232248 ns       230164 ns         2987 Gigabytes=2.02858G/s docs=4.34473k/s
serialize_twitter_string_builder     199652 ns       199507 ns         3446 Gigabytes=2.3403G/s docs=5.01236k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    255904 ns       251074 ns         2676 Gigabytes=1.85963G/s docs=3.98288k/s
serialize_big_string_to_string        18403 ns        18415 ns        37333 Gigabytes=5.43036G/s docs=54.3025k/s
serialize_twitter_to_string          257489 ns       256696 ns         2800 Gigabytes=1.8189G/s docs=3.89565k/s
serialize_twitter_string_builder     248245 ns       251088 ns         2987 Gigabytes=1.85953G/s docs=3.98267k/s

clang:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    296844 ns       294335 ns         1858 Gigabytes=1.58631G/s docs=3.39749k/s
serialize_big_string_to_string        18935 ns        19252 ns        37333 Gigabytes=5.19426G/s docs=51.9416k/s
serialize_twitter_to_string          298236 ns       298187 ns         2358 Gigabytes=1.56582G/s docs=3.3536k/s
serialize_twitter_string_builder     258256 ns       254981 ns         2635 Gigabytes=1.83114G/s docs=3.92186k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    231096 ns       230736 ns         2641 Gigabytes=2.02355G/s docs=4.33395k/s
serialize_big_string_to_string        16864 ns        16881 ns        40727 Gigabytes=5.92405G/s docs=59.2393k/s
serialize_twitter_to_string          231125 ns       230164 ns         2987 Gigabytes=2.02858G/s docs=4.34473k/s
serialize_twitter_string_builder     200409 ns       199507 ns         3446 Gigabytes=2.3403G/s docs=5.01236k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    217882 ns       217686 ns         3230 Gigabytes=2.14486G/s docs=4.59378k/s
serialize_big_string_to_string        18651 ns        18834 ns        37333 Gigabytes=5.30969G/s docs=53.0958k/s
serialize_twitter_to_string          217446 ns       219727 ns         3200 Gigabytes=2.12494G/s docs=4.55111k/s
serialize_twitter_string_builder     208399 ns       208575 ns         3446 Gigabytes=2.23855G/s docs=4.79443k/s

gcc:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    297414 ns       297414 ns         2352 Gigabytes=1.56989G/s docs=3.36232k/s
serialize_big_string_to_string        18369 ns        18368 ns        38098 Gigabytes=5.44423G/s docs=54.4412k/s
serialize_twitter_to_string          297653 ns       297653 ns         2352 Gigabytes=1.56862G/s docs=3.35962k/s
serialize_twitter_string_builder     278383 ns       278380 ns         2514 Gigabytes=1.67722G/s docs=3.59221k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    387009 ns       387006 ns         1794 Gigabytes=1.20646G/s docs=2.58394k/s
serialize_big_string_to_string        18365 ns        18365 ns        38071 Gigabytes=5.44531G/s docs=54.452k/s
serialize_twitter_to_string          387054 ns       387052 ns         1805 Gigabytes=1.20631G/s docs=2.58363k/s
serialize_twitter_string_builder     370782 ns       370780 ns         1887 Gigabytes=1.25925G/s docs=2.69702k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    207352 ns       207351 ns         3266 Gigabytes=2.25176G/s docs=4.82274k/s
serialize_big_string_to_string        18300 ns        18300 ns        38249 Gigabytes=5.46468G/s docs=54.6457k/s
serialize_twitter_to_string          208075 ns       208074 ns         3366 Gigabytes=2.24394G/s docs=4.80598k/s
serialize_twitter_string_builder     201757 ns       201754 ns         3460 Gigabytes=2.31423G/s docs=4.95652k/s

clang libstdc++:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    288758 ns       288757 ns         2385 Gigabytes=1.61695G/s docs=3.46313k/s
serialize_big_string_to_string        18928 ns        18928 ns        37002 Gigabytes=5.28321G/s docs=52.831k/s
serialize_twitter_to_string          288598 ns       288599 ns         2413 Gigabytes=1.61784G/s docs=3.46502k/s
serialize_twitter_string_builder     267786 ns       267783 ns         2594 Gigabytes=1.7436G/s docs=3.73436k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    295324 ns       295321 ns         2322 Gigabytes=1.58101G/s docs=3.38615k/s
serialize_big_string_to_string        18446 ns        18446 ns        37986 Gigabytes=5.4213G/s docs=54.2119k/s
serialize_twitter_to_string          295262 ns       295257 ns         2372 Gigabytes=1.58135G/s docs=3.38688k/s
serialize_twitter_string_builder     276829 ns       276826 ns         2523 Gigabytes=1.68664G/s docs=3.61238k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    243550 ns       243550 ns         2872 Gigabytes=1.91708G/s docs=4.10593k/s
serialize_big_string_to_string        18469 ns        18469 ns        37833 Gigabytes=5.41463G/s docs=54.1452k/s
serialize_twitter_to_string          244292 ns       244292 ns         2865 Gigabytes=1.91126G/s docs=4.09346k/s
serialize_twitter_string_builder     237531 ns       237528 ns         2932 Gigabytes=1.96569G/s docs=4.21003k/s

clang libc++:

std::vector<char>
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    279578 ns       279577 ns         2491 Gigabytes=1.67004G/s docs=3.57683k/s
serialize_big_string_to_string        19321 ns        19321 ns        37134 Gigabytes=5.17585G/s docs=51.7575k/s
serialize_twitter_to_string          280844 ns       280844 ns         2490 Gigabytes=1.66251G/s docs=3.5607k/s
serialize_twitter_string_builder     261383 ns       261383 ns         2663 Gigabytes=1.78629G/s docs=3.8258k/s
std::string
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    428105 ns       428100 ns         1626 Gigabytes=1.09065G/s docs=2.3359k/s
serialize_big_string_to_string        17829 ns        17829 ns        39374 Gigabytes=5.60909G/s docs=56.0898k/s
serialize_twitter_to_string          429024 ns       429013 ns         1625 Gigabytes=1.08833G/s docs=2.33093k/s
serialize_twitter_string_builder     404238 ns       404236 ns         1723 Gigabytes=1.15503G/s docs=2.4738k/s
vector with small buffer
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
serialize_twitter                    229063 ns       229062 ns         2989 Gigabytes=2.03834G/s docs=4.36563k/s
serialize_big_string_to_string        18271 ns        18271 ns        38194 Gigabytes=5.47325G/s docs=54.7314k/s
serialize_twitter_to_string          228027 ns       228025 ns         3062 Gigabytes=2.04761G/s docs=4.38548k/s
serialize_twitter_string_builder     222314 ns       222310 ns         3160 Gigabytes=2.10025G/s docs=4.49823k/s

@lemire
Copy link
Member

lemire commented Aug 7, 2025

std::exchange is C++14, we support C++11.

@lemire lemire mentioned this pull request Aug 7, 2025
@lemire
Copy link
Member

lemire commented Aug 7, 2025

See #2409

@toughengineer
Copy link
Contributor Author

I wanted to come back to it a bit later, but feel free to continue the work within #2409 and close this one.

@lemire
Copy link
Member

lemire commented Aug 7, 2025

I am going to close this PR and go with the other one, which is basically your work with very minor fixes.

@lemire lemire merged commit 662e3d9 into simdjson:master Aug 7, 2025
7 of 75 checks passed
@lemire
Copy link
Member

lemire commented Aug 7, 2025

@toughengineer This work will be in our next release. I am not sure I understand why a short-string optimization helps here, but empirically, it definitely does help in some instances.

@toughengineer
Copy link
Contributor Author

Small buffer optimization is just a cherry on top, it helps a lot for short strings, and helps a little bit even for considerably long strings, all for relatively little overhead, so why not have it?

64 bytes is a reasonable size for small buffer, not too short so its effect is noticeable, not too long when its size starts to have nasty overhead that negates positive effects.

My best guess as to why this code is mosly faster than std::vector<char> and std::string is because it's very simple (e.g. better suited for inlining?), it doesn't have to have trade-offs of the other two, or maybe I just got lucky and compilers can better optimize it (apparently not as well as MSVC's std::string implementation).
Who knows?

Maybe it can be even faster if you replace std::copy() with memcpy() for example. That's an idea for future exploration.

@toughengineer
Copy link
Contributor Author

toughengineer commented Aug 27, 2025

So I made kinda a boo-boo: I measured performance in relwithdebinfo mode. I rechecked everything in release mode and my changes are still faster than std::vector<char>. Phew! 😅

By default for GCC and Clang CMake uses -O3 in release mode and -O2 in relwithdebinfo mode, so in release mode figures slightly grew, but proportionately stayed roughly the same, i.e. the current "vector with small buffer" implementation is still faster than the previous std::vector<char> one.

On Windows by default CMake explicitly uses -Ob2 that controls inline function expansion (-Ob2 being the default for -O2) in release mode, and -Ob1 (basically less agressive expansion) in relwithdebinfo mode.
-Ob2 results in more preformant code, but again it just proportionately increased the figures, "vector with small buffer" is still faster than std::vector<char>.

I also rechecked everything using std::string as the buffer and overall the situation is very similar to the described above: std::string is fast with MSVC's standard library, on par with std::vector<char> with stdlibc++, and I made a whole issue out of std::string being slower in libc++.

Curiously with MSVC compiler using std::string is faster with -Ob1 rather than with -Ob2. Go figure.


Also I learned that to properly append to std::string one should use either std::string_view or pointer and size, i.e.

const char *begin, *end;
//...
std::string s;
s.append(std::string_view{begin, end - begin}); // just std::string_view{begin, end} since C++20
s.append(begin, end - begin);

When passing pointers like this

s.append(begin, end);

they are treated as iterators and implementations create a temporary std::string object to convert the [begin, end) range to characters and only then append them.

@lemire
Copy link
Member

lemire commented Aug 27, 2025

@toughengineer To be clear, I ran my own benchmarks. I did not rely on your numbers. I verified your claims.

@toughengineer
Copy link
Contributor Author

@lemire, I hoped so. Still I wanted to add that for completeness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants