Skip to content

introducing a thread-local parser#2412

Merged
lemire merged 14 commits intomasterfrom
use_thread_local_parser
Aug 13, 2025
Merged

introducing a thread-local parser#2412
lemire merged 14 commits intomasterfrom
use_thread_local_parser

Conversation

@lemire
Copy link
Member

@lemire lemire commented Aug 11, 2025

This modifies @the-moisrex's code slightly so that we can default on a thread-local parser. This should drastically improve performance in some cases.

@the-moisrex
Copy link
Member

Is it an attempt to re-use the parser in these situations?

for (auto val : from(json_string_1)) { ... }
for (auto val : from(json_string_2)) { ... }

Because this would be cool, but wish we wouldn't need to allocate it on the heap in the first place though, right?

Maybe a thread local std::optional that gets initialized on-demand?

@FranciscoThiesen
Copy link
Member

I like this! One thing that I miss is having some benchmark/perf number to we can assess how much this impacts performance. (Given some analysis that I've done locally, I definitely expect this will speed-up things, but I am still curious about how much)

Copy link
Member

@FranciscoThiesen FranciscoThiesen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work (that will likely lead to significant perf improvement)

Copy link
Member

@the-moisrex the-moisrex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of parser itself is already on the heap; this will try to remove this one additional indirection.


return parser_instance;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdjson_inline simdjson_warn_unused std::optional<ondemand::parser> & parser::get_parser_instance() {
  thread_local std::optional<ondemand::parser> parser_instance;

  // Si l'instance n'existe pas encore, on la crée
  if (!parser_instance.has_value()) {
      parser_instance = std::make_optional<ondemand::parser>();
  }

  return parser_instance;
}

#endif // SIMDJSON_CONDITIONAL_INCLUDE

#include <memory>
#include <thread>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <optional>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's C++17. We compile as C++11.

private:
friend bool release_parser();
friend ondemand::parser& get_parser();
static simdjson_inline simdjson_warn_unused std::unique_ptr<ondemand::parser> & get_parser_instance();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  static simdjson_inline simdjson_warn_unused std::optional<ondemand::parser> & get_parser_instance();

@lemire
Copy link
Member Author

lemire commented Aug 11, 2025

Because this would be cool, but wish we wouldn't need to allocate it on the heap in the first place though, right?

The parser instance itself is small. But it will then allocate quite a bit of memory and that's what we worry about.

@the-moisrex
Copy link
Member

@lemire: The parser instance itself is small. But it will then allocate quite a bit of memory and that's what we worry about.

Yes, that's why I suggest std::optional and not just a static local thread instance. std::optional does not construct the object until explicitly told to, but it does reserve the storage.

It's just one less allocation, that's all.

@lemire
Copy link
Member Author

lemire commented Aug 12, 2025

@the-moisrex

std::optional is C++17.

We can do the same thing portably with a union and a bool.

@lemire
Copy link
Member Author

lemire commented Aug 12, 2025

@the-moisrex I'll make a comment with your name, pointing out that it could be implemented differently. It is an internal matter, so we can change it later.

@lemire
Copy link
Member Author

lemire commented Aug 12, 2025

@the-moisrex @FranciscoThiesen

We now have a rather complete benchmark. Unfortunately, the approach with ranges comes with a significant performance penalty. I think it is probably fine to push it but with a warning that is will slow things down.

$ ./build/benchmark/from/from_benchmark 
Trial 1:
Running benchmarks on large array...
simdjson classic                         :     0.89 ns/char   764.29 ns/value     1.12 GB/s     3.09 GHz     2.76 cycles/char     9.72 ins./char  8319.46 ins./value     3.53 i/c
simdjson::from<Car>() (no parser)        :     1.05 ns/char   898.21 ns/value     0.95 GB/s     3.09 GHz     3.24 cycles/char    10.82 ins./char  9261.46 ins./value     3.34 i/c
simdjson::from<Car>() (with parser)      :     1.01 ns/char   863.71 ns/value     0.99 GB/s     3.09 GHz     3.11 cycles/char    10.75 ins./char  9201.46 ins./value     3.45 i/c
Running stream benchmarks...
simdjson classic                         :     0.78 ns     1.28 GB/s     3.08 GHz     2.41 cycles/char     9.27 ins./char     3.85 i/c
simdjson with thread local               :     0.77 ns     1.29 GB/s     3.08 GHz     2.38 cycles/char     9.25 ins./char     3.89 i/c
Running benchmarks on tiny input...
simdjson classic                         :     0.98 ns     1.02 GB/s     3.09 GHz     3.03 cycles/char    10.77 ins./char     3.55 i/c
simdjson::from<Car>() (no parser)        :     1.12 ns     0.90 GB/s     3.09 GHz     3.44 cycles/char    11.53 ins./char     3.35 i/c
simdjson::from<Car>() (with parser)      :     1.07 ns     0.93 GB/s     3.09 GHz     3.30 cycles/char    11.48 ins./char     3.48 i/c

The thread-local parser is pretty much free.

@lemire
Copy link
Member Author

lemire commented Aug 12, 2025

Here are the numbers on an Apple system:

Running benchmarks on large array...
simdjson classic                         :     0.53 ns/char   453.10 ns/value     1.89 GB/s     4.01 GHz     2.12 cycles/char    10.21 ins./char  8742.65 ins./value     4.81 i/c
simdjson::from<Car>() (no parser)        :     0.61 ns/char   523.42 ns/value     1.64 GB/s     4.03 GHz     2.46 cycles/char    12.62 ins./char 10802.28 ins./value     5.13 i/c
simdjson::from<Car>() (with parser)      :     0.61 ns/char   519.79 ns/value     1.65 GB/s     4.03 GHz     2.45 cycles/char    12.62 ins./char 10802.50 ins./value     5.15 i/c

The last two use the new ranges API. The performance difference is quite obvious. It is about 2000 instructions of a penalty to parse a single 'Car' instance. I would say that it is quite bad...

But as long we warning users off that it comes with a significant penalty, I think it is ok.

@lemire
Copy link
Member Author

lemire commented Aug 12, 2025

@the-moisrex @FranciscoThiesen

I have removed the C++20 ranges examples from our documentation and I have marked it as experimental.

For the time being, until we can determine why it cost thousands of instructions per value in overhead, I would prefer that our users do not touch it.

@the-moisrex
Copy link
Member

the-moisrex commented Aug 13, 2025

Great work @lemire.

I ran the benchmarks a bit on a noisy system, there are a few differences I can see between the classic way and the ranges way.

  1. More error handling in auto_parser than what is being benchmarked in classic.
  2. More boiler-plate ranges setup that I tried to put them behind consteval but it's an API limitation.
  3. More error handling in ranges:

If we remove the no_errors from the default as<Car>(), we'd do better:

  /// Make it an adaptor
  template <std::ranges::range Range>
  auto operator()(Range &&rng) const noexcept {
    return std::forward<Range>(rng) | no_errors | std::views::transform(*this);
  }

This code show cases 2 of the 3 problems. For the second point it shows that we have to construct this filter and transform first. And for the third point it shows that we're excluding the errors by default which is like an additional if statement in the loop.

For the first point, there's an extra if statement of error handling in operator T, but at least on my system it seems to not affect performance that much.

We could maybe fix the API by doing something like this:

for (Car car : (from | no_errors | as<Car>)(json_string)) { ... }

// or maybe this:

for (Car car : from | no_errors | as<Car>(json_string))) { ... }

// or maybe this (but I'm not sure if it's possible cleanly):

for (Car car : no_errors | as<Car>(json_string)) { ... }

// Maybe even this:
for (Car car : from | no_errors | to<Car> | json_string) { ... }

// Or this:
for (Car car : (from | no_errors | to<Car>) << json_string) { ... }

// Or this:
for (Car car : (from | no_errors | to<Car>) + json_string) { ... }

// This looks ugly, and I'm not sure about this:
for (Car car : from | no_errors | to<Car> <<= json_string) { ... }
for (Car car : from | no_errors | to<Car> = json_string) { ... }


// But we can do these two without needing the parenthesis, but it don't look great:
// Operator Precedence doesn't leave us many choices.
for (Car car : from | no_errors | to<Car> && json_string) { ... }
for (Car car : from | no_errors | to<Car> || json_string) { ... }

The point is to mark the operator()(Range) stuff as consteval or constexpr, so the construction of the ranges wouldn't affect the performance.

But my system is not a good system to do benchmarks, so it might be a different thing.

Also, why preferring as<Car>() with extra parenthesis over to<Car>?

Also I got a segfault when I ran the benchmarks in debug mode, but I couldn't reproduce it. There might be a silent race-condition in the benchmarks or something.

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

Also, why preferring as() with extra parenthesis over to?

Because otherwise, you will get

 error: expected primary-expression before ‘)’ token
   69 |                 for(Car car : simdjson::from(json) | simdjson::as<Car>) {

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

I ran the benchmarks a bit on a noisy system,

If you have a linux or macos system, and you run the benchmark in privilege mode (i.e., 'sudo') then you should see the performance counters (i.e., the number of instructions). Even if you machine is noisy, the number of instructions should be a telling tale.

If you share with me a public key, I can give you access to a quiet Linux server.

so the construction of the ranges wouldn't affect the performance.

Here, importantly, we have a per-element overhead that's very large (i.e., about 2000 instructions).

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

The ranges work is not ready. I don't think we should publish it. We have insufficient testing and benchmarks.

It should not be slower on a per-element basis. It is fine if setting up the loop takes a few bit of extra work, but for long loops, the cost should be amortized.

@the-moisrex
Copy link
Member

Also, why preferring as() with extra parenthesis over to?

Because otherwise, you will get

 error: expected primary-expression before ‘)’ token
   69 |                 for(Car car : simdjson::from(json) | simdjson::as<Car>) {

to<Car> not as<Car>.

for(Car car : simdjson::from(json) | simdjson::to<Car>) 

We now have 3 versions, this will get confusing for users pretty quick:

template <typename T>
static constexpr to_adaptor<T> to{};

static constexpr to_adaptor<> from{};

template <typename T = void>
using as = to_adaptor<T>;

The ranges work is not ready. I don't think we should publish it. We have insufficient testing and benchmarks.

It should not be slower on a per-element basis. It is fine if setting up the loop takes a few bit of extra work, but for long loops, the cost should be amortized.

I agree. Ranges do have a cost, and also compilers still have a hard time optimizing them, and utilities like std::views::filter are flat out broken.

Let me try a few ideas see if we can match the instructions counts.

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

@the-moisrex

If I haven't done so, I am removing 'as'.

Note that the standard provides the following:

    auto vec = std::views::iota(1, 5)
             | std::views::transform([](int v){ return v * 2; })
             | std::ranges::to<std::vector>();

Specifically std::ranges::to.

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

@the-moisrex Verified, 'as' has been removed:

Capture d’écran, le 2025-08-13 à 17 04 24

We still have the to_adaptor and from but we are not documenting to_adaptor.

@the-moisrex
Copy link
Member

If the conflict in name can be confusing for users, then I suggest renaming to to as but keeping the nature of to. We can also separate the operator()s that from needs, and the ones that as needs into to different structs.


I have this so far:

                for(Car car : auto_parser<>{} | simdjson::to<Car> | json) {
                    dummy = dummy + car.year;
                }

But modifying auto_parser after it's wrapped with transform_view and filter and what not is not that easy.

Calling trasnsform_view.base().base().base() seems way too hacky.

Let me think of something else.

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

If the conflict in name can be confusing for users, then I suggest renaming to to as but keeping the nature of to.

This PR will not include those. We just keep simdjson::from which seems fine.

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

@the-moisrex To be clearer, we have adaptor_to, I think, but that's an internal name that we can change.

@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

@the-moisrex @FranciscoThiesen I am merging this PR.

I think that the thread-local parser is a good step. We also remove ranges for now, but make it easier to add them later.

@lemire lemire merged commit faf921b into master Aug 13, 2025
143 checks passed
@lemire lemire deleted the use_thread_local_parser branch August 13, 2025 22:08
@lemire
Copy link
Member Author

lemire commented Aug 13, 2025

Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants