Skip to content

adding benchmark to extractor#1

Merged
the-moisrex merged 2 commits intothe-moisrex:extractorfrom
simdjson:extractor_bench
Sep 21, 2024
Merged

adding benchmark to extractor#1
the-moisrex merged 2 commits intothe-moisrex:extractorfrom
simdjson:extractor_bench

Conversation

@lemire
Copy link

@lemire lemire commented Sep 21, 2024

This is a PR on top of the extractor PR at simdjson#2247

cmake --build build20  --target bench_ondemand

Preliminary benchmark suggests a small performance loss... The difference is small, but I observe it in two different benchmarks.

Can you check what you get?

partial_tweets

./build20/benchmark/bench_ondemand --benchmark_filter=partial_tweets
partial_tweets<simdjson_ondemand>/manual_time             147613 ns       160362 ns         4571 best_bytes_per_sec=4.35744G best_docs_per_sec=6.89998k best_items_per_sec=689.998k bytes=631.515k bytes_per_second=3.98438G/s docs_per_sec=6.77449k/s items=100 items_per_second=677.449k/s [BEST: throughput=  4.36 GB/s doc_throughput=  6899 docs/s items=       100 avg_time=    147612 ns]
partial_tweets<simdjson_ondemand_extract>/manual_time     160835 ns       173621 ns         4340 best_bytes_per_sec=4.14078G best_docs_per_sec=6.5569k best_items_per_sec=655.69k bytes=631.515k bytes_per_second=3.65681G/s docs_per_sec=6.21754k/s items=100 items_per_second=621.754k/s [BEST: throughput=  4.14 GB/s doc_throughput=  6556 docs/s items=       100 avg_time=    160835 ns]

kostya

./build20/benchmark/bench_ondemand --benchmark_filter=kostya
kostya<simdjson_ondemand>/manual_time           42895521 ns     46789562 ns           16 best_bytes_per_sec=3.23009G best_docs_per_sec=23.525 best_items_per_sec=12.3339M bytes=137.305M bytes_per_second=2.98109G/s docs_per_sec=23.3125/s items=524.288k items_per_second=12.2224M/s [BEST: throughput=  3.23 GB/s doc_throughput=    23 docs/s items=    524288 avg_time=  42895520 ns]
kostya<simdjson_ondemand_extract>/manual_time   45543115 ns     49405825 ns           15 best_bytes_per_sec=3.0295G best_docs_per_sec=22.064 best_items_per_sec=11.5679M bytes=137.305M bytes_per_second=2.80778G/s docs_per_sec=21.9572/s items=524.288k items_per_second=11.5119M/s [BEST: throughput=  3.03 GB/s doc_throughput=    22 docs/s items=    524288 avg_time=  45543115 ns]

@lemire
Copy link
Author

lemire commented Sep 21, 2024

@the-moisrex The problem might be algorithmic.

The extractor follows this algorithm:

  1. Visit a key in the JSON. Compare it against all target keys until one is found.
  2. Move to the next key, go back to 1.

In some instances, this is fine.

However, look how the reference ondemand code is like in the two instances that I have identified. We do...

  1. 'find_field("first key")' this moves you through the JSON up until you have found the key. In JSON files, the keys are often in predictible order and so we might be lucky and hit it first.
  2. we do 'find_field("second key")'... this starts right after 'first key'.

And so on.

When the code is written to match the order of the keys in the JSON... this second approach is going to be more efficient than the extractor code.

@lemire
Copy link
Author

lemire commented Sep 21, 2024

So, possibly, we could use a hint parameter to choose the algorithm you want to use. If you think you know the order of the keys, you could use one algorithm, otherwise you could use another... ???

@the-moisrex
Copy link
Owner

However, look how the reference ondemand code is like in the two instances that I have identified. We do...

  • 'find_field("first key")' this moves you through the JSON up until you have found the key. In JSON files, the keys are often in predictible order and so we might be lucky and hit it first.
    we do 'find_field("second key")'... this starts right after 'first key'.

And so on.

@lemire that was the idea from which I optimized find_all as well; but it wasn't as clean as I wanted it; let me add a feature to the iterator itself that finds the next key instead of searching for an specific key and leave the equality checking to us.

So, possibly, we could use a hint parameter to choose the algorithm you want to use. If you think you know the order of the keys, you could use one algorithm, otherwise you could use another... ???

I don't think a hint is a good idea. Is there a scenario that the current algorithm would perform better?

@the-moisrex
Copy link
Owner

I'm gonna merge this.

@the-moisrex the-moisrex merged commit f7bf592 into the-moisrex:extractor Sep 21, 2024
@the-moisrex
Copy link
Owner

@lemire See simdjson#2247 (comment)

@lemire
Copy link
Author

lemire commented Sep 21, 2024

@the-moisrex Conceptually, there might be different algorithms that are best. But I agree that your latest design is quite good and likely difficult to improve upon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants