adding benchmark to extractor by lemire · Pull Request #1 · the-moisrex/simdjson-custom-types

lemire · 2024-09-21T00:40:41Z

This is a PR on top of the extractor PR at simdjson#2247

cmake --build build20  --target bench_ondemand

Preliminary benchmark suggests a small performance loss... The difference is small, but I observe it in two different benchmarks.

Can you check what you get?

partial_tweets

./build20/benchmark/bench_ondemand --benchmark_filter=partial_tweets

partial_tweets<simdjson_ondemand>/manual_time             147613 ns       160362 ns         4571 best_bytes_per_sec=4.35744G best_docs_per_sec=6.89998k best_items_per_sec=689.998k bytes=631.515k bytes_per_second=3.98438G/s docs_per_sec=6.77449k/s items=100 items_per_second=677.449k/s [BEST: throughput=  4.36 GB/s doc_throughput=  6899 docs/s items=       100 avg_time=    147612 ns]
partial_tweets<simdjson_ondemand_extract>/manual_time     160835 ns       173621 ns         4340 best_bytes_per_sec=4.14078G best_docs_per_sec=6.5569k best_items_per_sec=655.69k bytes=631.515k bytes_per_second=3.65681G/s docs_per_sec=6.21754k/s items=100 items_per_second=621.754k/s [BEST: throughput=  4.14 GB/s doc_throughput=  6556 docs/s items=       100 avg_time=    160835 ns]

kostya

./build20/benchmark/bench_ondemand --benchmark_filter=kostya

kostya<simdjson_ondemand>/manual_time           42895521 ns     46789562 ns           16 best_bytes_per_sec=3.23009G best_docs_per_sec=23.525 best_items_per_sec=12.3339M bytes=137.305M bytes_per_second=2.98109G/s docs_per_sec=23.3125/s items=524.288k items_per_second=12.2224M/s [BEST: throughput=  3.23 GB/s doc_throughput=    23 docs/s items=    524288 avg_time=  42895520 ns]
kostya<simdjson_ondemand_extract>/manual_time   45543115 ns     49405825 ns           15 best_bytes_per_sec=3.0295G best_docs_per_sec=22.064 best_items_per_sec=11.5679M bytes=137.305M bytes_per_second=2.80778G/s docs_per_sec=21.9572/s items=524.288k items_per_second=11.5119M/s [BEST: throughput=  3.03 GB/s doc_throughput=    22 docs/s items=    524288 avg_time=  45543115 ns]

lemire · 2024-09-21T00:57:24Z

@the-moisrex The problem might be algorithmic.

The extractor follows this algorithm:

Visit a key in the JSON. Compare it against all target keys until one is found.
Move to the next key, go back to 1.

In some instances, this is fine.

However, look how the reference ondemand code is like in the two instances that I have identified. We do...

'find_field("first key")' this moves you through the JSON up until you have found the key. In JSON files, the keys are often in predictible order and so we might be lucky and hit it first.
we do 'find_field("second key")'... this starts right after 'first key'.

And so on.

When the code is written to match the order of the keys in the JSON... this second approach is going to be more efficient than the extractor code.

lemire · 2024-09-21T01:00:57Z

So, possibly, we could use a hint parameter to choose the algorithm you want to use. If you think you know the order of the keys, you could use one algorithm, otherwise you could use another... ???

the-moisrex · 2024-09-21T09:34:59Z

However, look how the reference ondemand code is like in the two instances that I have identified. We do...

'find_field("first key")' this moves you through the JSON up until you have found the key. In JSON files, the keys are often in predictible order and so we might be lucky and hit it first.
we do 'find_field("second key")'... this starts right after 'first key'.

And so on.

@lemire that was the idea from which I optimized find_all as well; but it wasn't as clean as I wanted it; let me add a feature to the iterator itself that finds the next key instead of searching for an specific key and leave the equality checking to us.

So, possibly, we could use a hint parameter to choose the algorithm you want to use. If you think you know the order of the keys, you could use one algorithm, otherwise you could use another... ???

I don't think a hint is a good idea. Is there a scenario that the current algorithm would perform better?

the-moisrex · 2024-09-21T10:39:10Z

I'm gonna merge this.

the-moisrex · 2024-09-21T10:52:06Z

@lemire See simdjson#2247 (comment)

lemire · 2024-09-21T13:16:29Z

@the-moisrex Conceptually, there might be different algorithms that are best. But I agree that your latest design is quite good and likely difficult to improve upon.

lemire added 2 commits September 20, 2024 20:32

adding benchmark to extractor

949bc51

update

677674d

lemire mentioned this pull request Sep 21, 2024

object::extract: get a bunch of keys in one go simdjson/simdjson#2247

Open

the-moisrex merged commit f7bf592 into the-moisrex:extractor Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding benchmark to extractor#1

adding benchmark to extractor#1
the-moisrex merged 2 commits intothe-moisrex:extractorfrom
simdjson:extractor_bench

lemire commented Sep 21, 2024

Uh oh!

lemire commented Sep 21, 2024

Uh oh!

lemire commented Sep 21, 2024

Uh oh!

the-moisrex commented Sep 21, 2024

Uh oh!

the-moisrex commented Sep 21, 2024

Uh oh!

the-moisrex commented Sep 21, 2024

Uh oh!

lemire commented Sep 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lemire commented Sep 21, 2024

partial_tweets

kostya

Uh oh!

lemire commented Sep 21, 2024

Uh oh!

lemire commented Sep 21, 2024

Uh oh!

the-moisrex commented Sep 21, 2024

Uh oh!

the-moisrex commented Sep 21, 2024

Uh oh!

the-moisrex commented Sep 21, 2024

Uh oh!

lemire commented Sep 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants