Sped up serialization by 10..70%-ish#2408
Conversation
Part 1: a baffling conundrumO boy, here we go. By random chance I looked at this line: and thought: std::string would really be better here.
So I replaced it and also replaced buffer.insert(buffer.end(), begin, end);with buffer.append(begin, end);Results on WindowsNote Here and onward I refer to MSVC, clang-cl and clang with MSVC's standard library as used on Windows, MSVC is 80%-ish faster! Neat! Looks very promising. std::vector<char>std::stringclang-cl and clang are even faster in absolute figures! clang-cl +57%: std::vector<char>std::stringclang +28%: std::vector<char>std::stringOK, let's see what happens on linux. Results on linuxgcc gives 25% decrease (i.e. -25%) in performance when std::vector<char>std::stringCombination clang libstdc++ gives overall good performance which does not significantly change when the implementation is changed. std::vector<char>std::stringWhat's happening here? Combination clang libc++ gives 35% decrease (i.e. -35%) in performance!! std::vector<char>std::stringIt does not make any sense to me! Midway conclusionHere are the overall results so far:
* oopsie!Because of a copy/paste error, the result for clang libstdc++ and Approximately none of the results make sense to me. At this point it "nerd sniped" me, or rather I "nerd sniped" myself, and thought: surely I can write a vector with small buffer that's faster, right? |
|||||||||||||||||||||||||||||||
Part 2: I'm gonna go build my own vector, with blackjack and small buffer!Instead of investigating why switching to It's pretty ugly, but also kinda elegant... somewhat... you know... vector_with_small_buffer implementationstruct vector_with_small_buffer {
~vector_with_small_buffer() { free_buffer(); }
void clear() {
size = 0;
capacity = StaticCapacity;
free_buffer();
buffer = array;
}
simdjson_inline void push_back(char c) {
if (capacity < size + 1)
grow(capacity * 2);
buffer[size++] = c;
}
simdjson_inline void append(const char *begin, const char *end) {
const size_t new_size = size + (end - begin);
if (capacity < new_size)
// std::max(new_size, capacity * 2); is broken in tests on Windows
grow(new_size < capacity * 2 ? capacity * 2 : new_size);
std::copy(begin, end, buffer + std::exchange(size, new_size));
}
std::string_view str() const { return std::string_view(buffer, size); }
private:
void free_buffer() {
if (buffer != array)
delete[] buffer;
}
void grow(size_t new_capacity) {
auto new_buffer = new char[new_capacity];
std::copy(buffer, buffer + size, new_buffer);
free_buffer();
buffer = new_buffer;
capacity = new_capacity;
}
static const size_t StaticCapacity = 64;
char array[StaticCapacity];
char *buffer = array;
size_t size = 0;
size_t capacity = StaticCapacity;
};It has small buffer just like Specifying ConclusionTo make already long story shorter, here are the overall results:
* oopsie!Because of a copy/paste error, the result for clang libstdc++ and Note Here and onward I refer to MSVC, clang-cl and clang with MSVC's standard library as used on Windows, Figures in diff columns are differences with respect to figures in base measurement column. My But at this point in time it is faster across the board compared to the baseline of All the measurementsMSVC: std::vector<char>std::stringvector with small bufferclang-cl: std::vector<char>std::stringvector with small bufferclang: std::vector<char>std::stringvector with small buffergcc: std::vector<char>std::stringvector with small bufferclang libstdc++: std::vector<char>std::stringvector with small bufferclang libc++: std::vector<char>std::stringvector with small buffer |
|||||||||||||||||||||||||||||||||||||||||||||||
|
std::exchange is C++14, we support C++11. |
|
See #2409 |
|
I wanted to come back to it a bit later, but feel free to continue the work within #2409 and close this one. |
|
I am going to close this PR and go with the other one, which is basically your work with very minor fixes. |
|
@toughengineer This work will be in our next release. I am not sure I understand why a short-string optimization helps here, but empirically, it definitely does help in some instances. |
|
Small buffer optimization is just a cherry on top, it helps a lot for short strings, and helps a little bit even for considerably long strings, all for relatively little overhead, so why not have it? 64 bytes is a reasonable size for small buffer, not too short so its effect is noticeable, not too long when its size starts to have nasty overhead that negates positive effects. My best guess as to why this code is mosly faster than Maybe it can be even faster if you replace |
|
So I made kinda a boo-boo: I measured performance in By default for GCC and Clang CMake uses On Windows by default CMake explicitly uses I also rechecked everything using Curiously with MSVC compiler using Also I learned that to properly append to const char *begin, *end;
//...
std::string s;
s.append(std::string_view{begin, end - begin}); // just std::string_view{begin, end} since C++20
s.append(begin, end - begin);When passing pointers like this s.append(begin, end);they are treated as iterators and implementations create a temporary |
|
@toughengineer To be clear, I ran my own benchmarks. I did not rely on your numbers. I verified your claims. |
|
@lemire, I hoped so. Still I wanted to add that for completeness. |
tl;dr:
This PR speeds up serialization by 10..70%-ish on Windows and linux using MSVC, Clang and GCC.
Methodology and measurements
To measure performance I ran bench_dom_api with the following options:
--benchmark_min_warmup_time=0.1helped make results less fussy and more consistent.Everything was compiled as x64 and run on a i5 13600k CPU.
I ultimately looked at "Gigabytes" figures that are output by the tests.
Here is the overall comparison across the compilers and standard libraries:
Note
Here and onward I refer to MSVC, clang-cl and clang with MSVC's standard library as used on Windows,
gcc (with libstdc++), clang libstdc++ and clang libc++ as used on Ubuntu inside WSL.
Four benchamrks were run for every entry:
Results of
serialize_twitter,serialize_twitter_to_stringandserialize_twitter_string_buildershould be consistent.I took measurements from
serialize_twitter_string_builderfor the table above.In case of
serialize_big_string_to_stringperformance does not significantly depend on the changes in the implementation, it is here kinda as a control.Feel free to modify it as you see fit, or to take it over completely in a different PR.
For more details read first few comments.
Here are the copy/paste of the results
Click/tap on each to expand.
MSVC:
std::vector<char>
vector with small buffer
clang-cl:
std::vector<char>
vector with small buffer
clang:
std::vector<char>
vector with small buffer
gcc:
std::vector<char>
vector with small buffer
clang libstdc++:
std::vector<char>
vector with small buffer
clang libc++:
std::vector<char>
vector with small buffer
I omitted headers above. Here are examples of them to be complete.
benchmark result headers
Windows:
Ubuntu inside WSL: