Improve string performance in ondemand by making the string processing runtime dispatched. by lemire · Pull Request #1849 · simdjson/simdjson

lemire · 2022-06-22T20:00:17Z

The current on-demand front-end, will, by default, use a slow parse_string (our function fro unescaping strings) that does not benefit from the processor's best instructions. We can make it available from the runtime dispatched kernels instead and possibly gain quite a bit of performance in some cases. This should not affect people who compile simdjson for their processor.

In PR #1847, I had exposed the parse_string function as a generic function, but in this new PR, it is tied to do the dom implementation, residing inside our parser instances. This new PR is less intrusive and should provide slightly better performance.

I have also added documentation as well as a new 'test' for parse.unescape, so that end users can benefit from it.

We do not have many benchmarks where strings are unescaped, but one of them (with short strings) is partial_tweets, and we do see small gains...

On icelake (AWS, GCC 11), I get...

Before... (two runs)

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:59:22+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.00, 0.03, 0.01
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     133214 ns       158498 ns         5302 best_bytes_per_sec=4.85288G best_docs_per_sec=7.6845k best_items_per_sec=768.45k bytes=631.515k bytes_per_second=4.41505G/s docs_per_sec=7.50674k/s items=100 items_per_second=750.674k/s [BEST: throughput=  4.85 GB/s doc_throughput=  7684 docs/s items=       100 avg_time=    133213 ns]
$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:59:25+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 3495.88 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.00, 0.03, 0.01
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     132884 ns       157104 ns         5317 best_bytes_per_sec=4.86957G best_docs_per_sec=7.71093k best_items_per_sec=771.093k bytes=631.515k bytes_per_second=4.42601G/s docs_per_sec=7.52538k/s items=100 items_per_second=752.538k/s [BEST: throughput=  4.87 GB/s doc_throughput=  7710 docs/s items=       100 avg_time=    132883 ns]

After... (two runs)

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:55:58+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.01, 0.06, 0.02
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     113838 ns       138738 ns         6196 best_bytes_per_sec=5.69107G best_docs_per_sec=9.01177k best_items_per_sec=901.177k bytes=631.515k bytes_per_second=5.16652G/s docs_per_sec=8.78444k/s items=100 items_per_second=878.444k/s [BEST: throughput=  5.69 GB/s doc_throughput=  9011 docs/s items=       100 avg_time=    113837 ns]
$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-22T19:56:10+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.00, 0.06, 0.02
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     114334 ns       139581 ns         6218 best_bytes_per_sec=5.67144G best_docs_per_sec=8.98069k best_items_per_sec=898.069k bytes=631.515k bytes_per_second=5.1441G/s docs_per_sec=8.74632k/s items=100 items_per_second=874.632k/s [BEST: throughput=  5.67 GB/s doc_throughput=  8980 docs/s items=       100 avg_time=    114333 ns]

…ng processing runtime dispatched.

lemire added 6 commits June 21, 2022 17:53

This should improve string performance in ondemand by making the stri…

667f488

…ng processing runtime dispatched.

Minor warning disabling (GCC7)

7a73230

Adding two missing macros.

9de1b45

Disabling again for GCC 7

7e1d07e

Another design.

4b59f91

Minor fixes.

800265e

lemire requested a review from jkeiser June 22, 2022 20:00

lemire changed the title ~~This should improve string performance in ondemand by making the string processing runtime dispatched.~~ Improve string performance in ondemand by making the string processing runtime dispatched. Jun 22, 2022

lemire added 6 commits June 22, 2022 16:08

Improving comment.

1be18a7

Documenting the new parser.unescape function.

4a423fe

Making it somewhat nicer.

b1380f3

Missing '&&'

1b5119a

No easy victory

665ef45

Damn it.

6c0c927

lemire merged commit 1a19562 into master Jun 24, 2022

lemire deleted the dlemire/exposing_parse_string_in_parser_instance branch June 24, 2022 13:57

FourierTransformer mentioned this pull request Jun 30, 2022

Version 2.1.0 FourierTransformer/lua-simdjson#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve string performance in ondemand by making the string processing runtime dispatched.#1849

Improve string performance in ondemand by making the string processing runtime dispatched.#1849
lemire merged 12 commits intomasterfrom
dlemire/exposing_parse_string_in_parser_instance

lemire commented Jun 22, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lemire commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lemire commented Jun 22, 2022 •

edited

Loading