Skip to content

[bad design] This should improve string performance in ondemand by making the string processing runtime dispatched.#1847

Closed
lemire wants to merge 4 commits intomasterfrom
dlemire/exposing_parse_string
Closed

[bad design] This should improve string performance in ondemand by making the string processing runtime dispatched.#1847
lemire wants to merge 4 commits intomasterfrom
dlemire/exposing_parse_string

Conversation

@lemire
Copy link
Member

@lemire lemire commented Jun 21, 2022

The current on-demand front-end, will, by default, use a slow parse_string that does not benefit from the processor's best instructions. We can make it available from the runtime dispatched kernels instead and possibly gain quite a bit of performance in some cases.

This should not affect people who compile simdjson for their processor.

On icelake (AWS, GCC 11), I get...

Before... (two runs)

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:26:34+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.03, 0.40, 0.30
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     131965 ns       156473 ns         5277 best_bytes_per_sec=4.84324G best_docs_per_sec=7.66924k best_items_per_sec=766.924k bytes=631.515k bytes_per_second=4.45682G/s docs_per_sec=7.57776k/s items=100 items_per_second=757.776k/s [BEST: throughput=  4.84 GB/s doc_throughput=  7669 docs/s items=       100 avg_time=    131965 ns]

$  ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:24:31+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.16, 0.59, 0.34
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     131984 ns       157684 ns         5328 best_bytes_per_sec=4.84346G best_docs_per_sec=7.66959k best_items_per_sec=766.959k bytes=631.515k bytes_per_second=4.45618G/s docs_per_sec=7.57668k/s items=100 items_per_second=757.668k/s [BEST: throughput=  4.84 GB/s doc_throughput=  7669 docs/s items=       100 avg_time=    131983 ns]

After...

$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:24:25+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 3507.25 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.18, 0.60, 0.35
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     114762 ns       139415 ns         6120 best_bytes_per_sec=5.58957G best_docs_per_sec=8.85105k best_items_per_sec=885.105k bytes=631.515k bytes_per_second=5.12491G/s docs_per_sec=8.71369k/s items=100 items_per_second=871.369k/s [BEST: throughput=  5.59 GB/s doc_throughput=  8851 docs/s items=       100 avg_time=    114761 ns]



$ ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"
2022-06-21T23:26:10+00:00
Running ./build/benchmark/bench_ondemand
Run on (2 X 2899.96 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 1280 KiB (x1)
  L3 Unified 55296 KiB (x1)
Load Average: 0.04, 0.43, 0.31
simdjson::dom implementation:      icelake
simdjson::ondemand implementation (stage 1): icelake
simdjson::ondemand implementation (stage 2): fallback
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
partial_tweets<simdjson_ondemand>/manual_time     114404 ns       137394 ns         6105 best_bytes_per_sec=5.59606G best_docs_per_sec=8.86132k best_items_per_sec=886.132k bytes=631.515k bytes_per_second=5.14096G/s docs_per_sec=8.74098k/s items=100 items_per_second=874.098k/s [BEST: throughput=  5.60 GB/s doc_throughput=  8861 docs/s items=       100 avg_time=    114403 ns]

So we go from 4.8 GB/s to 5.6 GB/s, a 15% performance boost.

@lemire lemire requested a review from jkeiser June 21, 2022 23:32
@lemire
Copy link
Member Author

lemire commented Jun 22, 2022

I think I did it wrong. Having it as a separate dispatched function probably carries too much overhead. Let us investigate.

@lemire
Copy link
Member Author

lemire commented Jun 22, 2022

Closing without a merge, since this was the wrong design.

@lemire lemire closed this Jun 22, 2022
@lemire lemire changed the title This should improve string performance in ondemand by making the string processing runtime dispatched. [bad design] This should improve string performance in ondemand by making the string processing runtime dispatched. Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant