[bad design] This should improve string performance in ondemand by making the string processing runtime dispatched. by lemire · Pull Request #1847 · simdjson/simdjson

lemire · 2022-06-21T22:25:25Z

The current on-demand front-end, will, by default, use a slow parse_string that does not benefit from the processor's best instructions. We can make it available from the runtime dispatched kernels instead and possibly gain quite a bit of performance in some cases.

This should not affect people who compile simdjson for their processor.

On icelake (AWS, GCC 11), I get...

Before... (two runs)

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:26:34+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.03, 0.40, 0.30
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     131965 ns       156473 ns         5277 best_bytes_per_sec=4.84324G best_docs_per_sec=7.66924k best_items_per_sec=766.924k bytes=631.515k bytes_per_second=4.45682G/s docs_per_sec=7.57776k/s items=100 items_per_second=757.776k/s [BEST: throughput=  4.84 GB/s doc_throughput=  7669 docs/s items=       100 avg_time=    131965 ns]

$  ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:24:31+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.16, 0.59, 0.34
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     131984 ns       157684 ns         5328 best_bytes_per_sec=4.84346G best_docs_per_sec=7.66959k best_items_per_sec=766.959k bytes=631.515k bytes_per_second=4.45618G/s docs_per_sec=7.57668k/s items=100 items_per_second=757.668k/s [BEST: throughput=  4.84 GB/s doc_throughput=  7669 docs/s items=       100 avg_time=    131983 ns]

After...

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:24:25+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 3507.25 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.18, 0.60, 0.35
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     114762 ns       139415 ns         6120 best_bytes_per_sec=5.58957G best_docs_per_sec=8.85105k best_items_per_sec=885.105k bytes=631.515k bytes_per_second=5.12491G/s docs_per_sec=8.71369k/s items=100 items_per_second=871.369k/s [BEST: throughput=  5.59 GB/s doc_throughput=  8851 docs/s items=       100 avg_time=    114761 ns]



$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:26:10+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.04, 0.43, 0.31
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     114404 ns       137394 ns         6105 best_bytes_per_sec=5.59606G best_docs_per_sec=8.86132k best_items_per_sec=886.132k bytes=631.515k bytes_per_second=5.14096G/s docs_per_sec=8.74098k/s items=100 items_per_second=874.098k/s [BEST: throughput=  5.60 GB/s doc_throughput=  8861 docs/s items=       100 avg_time=    114403 ns]

So we go from 4.8 GB/s to 5.6 GB/s, a 15% performance boost.

…ng processing runtime dispatched.

lemire · 2022-06-22T02:18:02Z

I think I did it wrong. Having it as a separate dispatched function probably carries too much overhead. Let us investigate.

lemire · 2022-06-22T20:00:46Z

Closing without a merge, since this was the wrong design.

lemire added 3 commits June 21, 2022 17:53

This should improve string performance in ondemand by making the stri…

667f488

…ng processing runtime dispatched.

Minor warning disabling (GCC7)

7a73230

Adding two missing macros.

9de1b45

lemire requested a review from jkeiser June 21, 2022 23:32

Disabling again for GCC 7

7e1d07e

lemire mentioned this pull request Jun 22, 2022

Improve string performance in ondemand by making the string processing runtime dispatched. #1849

Merged

lemire closed this Jun 22, 2022

lemire changed the title ~~This should improve string performance in ondemand by making the string processing runtime dispatched.~~ [bad design] This should improve string performance in ondemand by making the string processing runtime dispatched. Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bad design] This should improve string performance in ondemand by making the string processing runtime dispatched.#1847

[bad design] This should improve string performance in ondemand by making the string processing runtime dispatched.#1847
lemire wants to merge 4 commits intomasterfrom
dlemire/exposing_parse_string

lemire commented Jun 21, 2022 •

edited

Loading

Uh oh!

lemire commented Jun 22, 2022

Uh oh!

lemire commented Jun 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lemire commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Jun 22, 2022

Uh oh!

lemire commented Jun 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lemire commented Jun 21, 2022 •

edited

Loading