Skip to content

Improve string performance in ondemand by making the string processing runtime dispatched.#1849

Merged
lemire merged 12 commits intomasterfrom
dlemire/exposing_parse_string_in_parser_instance
Jun 24, 2022
Merged

Improve string performance in ondemand by making the string processing runtime dispatched.#1849
lemire merged 12 commits intomasterfrom
dlemire/exposing_parse_string_in_parser_instance

Conversation

@lemire
Copy link
Member

@lemire lemire commented Jun 22, 2022

The current on-demand front-end, will, by default, use a slow parse_string (our function fro unescaping strings) that does not benefit from the processor's best instructions. We can make it available from the runtime dispatched kernels instead and possibly gain quite a bit of performance in some cases. This should not affect people who compile simdjson for their processor.

In PR #1847, I had exposed the parse_string function as a generic function, but in this new PR, it is tied to do the dom implementation, residing inside our parser instances. This new PR is less intrusive and should provide slightly better performance.

I have also added documentation as well as a new 'test' for parse.unescape, so that end users can benefit from it.

We do not have many benchmarks where strings are unescaped, but one of them (with short strings) is partial_tweets, and we do see small gains...

On icelake (AWS, GCC 11), I get...

Before... (two runs)

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:59:22+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.00, 0.03, 0.01
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     133214 ns       158498 ns         5302 best_bytes_per_sec=4.85288G best_docs_per_sec=7.6845k best_items_per_sec=768.45k bytes=631.515k bytes_per_second=4.41505G/s docs_per_sec=7.50674k/s items=100 items_per_second=750.674k/s [BEST: throughput=  4.85 GB/s doc_throughput=  7684 docs/s items=       100 avg_time=    133213 ns]
$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:59:25+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 3495.88 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.00, 0.03, 0.01
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     132884 ns       157104 ns         5317 best_bytes_per_sec=4.86957G best_docs_per_sec=7.71093k best_items_per_sec=771.093k bytes=631.515k bytes_per_second=4.42601G/s docs_per_sec=7.52538k/s items=100 items_per_second=752.538k/s [BEST: throughput=  4.87 GB/s doc_throughput=  7710 docs/s items=       100 avg_time=    132883 ns]

After... (two runs)

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:55:58+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.01, 0.06, 0.02
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     113838 ns       138738 ns         6196 best_bytes_per_sec=5.69107G best_docs_per_sec=9.01177k best_items_per_sec=901.177k bytes=631.515k bytes_per_second=5.16652G/s docs_per_sec=8.78444k/s items=100 items_per_second=878.444k/s [BEST: throughput=  5.69 GB/s doc_throughput=  9011 docs/s items=       100 avg_time=    113837 ns]
$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:56:10+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.00, 0.06, 0.02
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     114334 ns       139581 ns         6218 best_bytes_per_sec=5.67144G best_docs_per_sec=8.98069k best_items_per_sec=898.069k bytes=631.515k bytes_per_second=5.1441G/s docs_per_sec=8.74632k/s items=100 items_per_second=874.632k/s [BEST: throughput=  5.67 GB/s doc_throughput=  8980 docs/s items=       100 avg_time=    114333 ns]

@lemire lemire requested a review from jkeiser June 22, 2022 20:00
@lemire lemire changed the title This should improve string performance in ondemand by making the string processing runtime dispatched. Improve string performance in ondemand by making the string processing runtime dispatched. Jun 22, 2022
@lemire lemire merged commit 1a19562 into master Jun 24, 2022
@lemire lemire deleted the dlemire/exposing_parse_string_in_parser_instance branch June 24, 2022 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant