object::find_all: find multiple keys in one go#2235
object::find_all: find multiple keys in one go#2235the-moisrex wants to merge 4 commits intosimdjson:masterfrom
Conversation
Instead of iterating through an object to find multiple keys, or iterating multiple times, we can now use `find_all`.
|
@the-moisrex Looks very exciting. Personally I would embrace it but I am going to invite feedback. |
|
Definitely |
|
|
||
| template <typename... StrT> | ||
| requires details::all_string_views<StrT...> | ||
| [[nodiscard]] simdjson_inline std::array<simdjson_result<value>, sizeof...(StrT)> object::find_all(StrT... keys) && noexcept { |
There was a problem hiding this comment.
Can you tell me more about the [[nodiscard]] usage here. Why is it a problem if the return of this function is not actually used?
There was a problem hiding this comment.
@FranciscoThiesen, Is there a reasonable place where someone would call .find_all and just ignore the result? If that happens, it's pretty much a bug, and [[nodiscard]] helps catch those bugs; and if they actually need to do that, it's better if they use std::ignore = ....find_all(...) to state the intent for everyone.
We could actually add [[nodiscard]] to a lot of places in this library, but since it's a C++11 library, we either have to get macro-creative, or just not do it; but there's no reason for a C++20-only part of the library not to have it.
Same goes for constexpr btw. We can make a lot of machinery in the library constexpr-friendly.
|
The simdjson library makes extensive use of To be honest, I don't think it ever caught a bug, so maybe it is useless. In theory it ought to help users. |
|
One place where warn_unused matters a lot is for methods with side effects that can return an error. For methods without side effects, it's weird not to use the result, but it won't really help you fix any bugs. |
|
Note that this PR is deliberately left open so we can review it. I plan to test it further soon. If we can make it work, I think we should! |
|
@lemire I would've added C++23's Multidimensional subscript operator support, but I thought maybe that we could find better use for that one. We also can have something like this in C++20, but we could simplify that even further with C++26's reflection, so I've hesitated to implement any of this: obj.extract_to({{"make", car.make}, {"model", car.model}, {"year", car.year}});or: extract(car.make, car.model, car.year) = obj.find_all("make", "model", "year");or we could get a bit more clever (you can choose your poison of operator almost however you like as long as the operator precedence allows it): obj.extract() / "make" >> car.make / "model" >> car.model / "year" >> car.year;or: obj.extract("make", "model", "year") >> car.make >> car.model >> car.year; |
|
@the-moisrex Just a ping: do not assume that this PR is dead or anything such thing. I am deliberately leaving it is here because it is so important that I want to carefully study it. I am fully expecting to merge and release this functionality, once we verify it carefully. |
|
@lemire I did do the simplest implementation, but as I showed, there can be even simpler syntaxes for users; each have their own pros and cons. I always get into a paralysis analysis situation when it comes into these things! |
|
@the-moisrex Yes. I also want to make sure that we can use the opportunity to improve the performance. |
|
@the-moisrex I did some work. There is a minor issue that I could fix easily: we would want 'find_all' to work on Another concern, more serious, is that in simdjson, you are expected to be in one object at a time. So let us suppose we want to port the following code (from our benchmarks) to find_all: bool standard_consume(simdjson::ondemand::parser &parser,
simdjson::padded_string &json,
std::vector<tweet> &result) {
result.clear();
auto read_user = [](simdjson::ondemand::object user) {
return twitter_user{user.find_field("id"), user.find_field("screen_name")};
};
auto doc = parser.iterate(json);
for (simdjson::ondemand::object tweet_value : doc.find_field("statuses")) {
result.emplace_back(
tweet{tweet_value.find_field("created_at"),
tweet_value.find_field("id"), tweet_value.find_field("text"),
read_user(tweet_value.find_field("user")),
tweet_value.find_field("retweet_count"),
tweet_value.find_field("favorite_count")});
}
return true;
}Notice how the value matching So we can do something like this... for (simdjson::ondemand::object tweet_value : doc.find_field("statuses")) {
auto [created_at, id, text, in_reply_to_status_id, retweet_count,
favorite_count] =
tweet_value.find_all("created_at", "id", "text",
"in_reply_to_status_id", "retweet_count",
"favorite_count");
auto [user_id, screen_name] =
tweet_value["user"].get_object().find_all("id", "screen_name");
result.emplace_back(tweet{
created_at, id, text,
twitter_user{user_id, screen_name}, retweet_count, favorite_count});
}That's not too bad. A final problem is performance. In the find_all call, we have to build these temporary |
|
@the-moisrex I invited you to a private repository where I wrote a benchmark based on this PR. If you get around to it, it would be great if you could check it out. (I am keeping some speculative work private so as to not create noise.) |
|
Can we close this in favour of #2247? |
|
@lemire sure, but only due to the performance issues, if you know a way to make it faster, then I suggest having both since P.S. I'm also working on changing the The use for that would be if the user wants to append to a vector instead of re-creating a temporary vector and re-add it to another vector. The error message: [1/6] Building CXX object tools/CMakeFiles/json2json.dir/json2json.cpp.o
FAILED: tools/CMakeFiles/json2json.dir/json2json.cpp.o
/usr/bin/clang++ -DSIMDJSON_AVX512_ALLOWED=1 -DSIMDJSON_THREADS_ENABLED=1 -DSIMDJSON_UTF8VALIDATION=1 -I/simdjson/include -I/simdjson/src -I/simdjson/dependencies/.cache/cxxopts/include -g -std=c++17 -Og -fPIC -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wno-sign-conversion -MD -MT tools/CMakeFiles/json2json.dir/json2json.cpp.o -MF tools/CMakeFiles/json2json.dir/json2json.cpp.o.d -o tools/CMakeFiles/json2json.dir/json2json.cpp.o -c /simdjson/tools/json2json.cpp
/simdjson/tools/json2json.cpp:95:27: error: no member named 'exceptions' in namespace 'cxxopts'
95 | } catch (const cxxopts::exceptions::option_has_no_value& e) {
| ~~~~~~~~~^
1 error generated.
[2/6] Building CXX object tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o
FAILED: tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o
/usr/bin/clang++ -DSIMDJSON_AVX512_ALLOWED=1 -DSIMDJSON_THREADS_ENABLED=1 -DSIMDJSON_UTF8VALIDATION=1 -I/simdjson/include -I/simdjson/src -I/simdjson/dependencies/.cache/cxxopts/include -g -std=c++17 -Og -fPIC -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wno-sign-conversion -MD -MT tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o -MF tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o.d -o tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o -c /simdjson/tools/jsonstats.cpp
/simdjson/tools/jsonstats.cpp:281:27: error: no member named 'exceptions' in namespace 'cxxopts'
281 | } catch (const cxxopts::exceptions::option_has_no_value& e) {
| ~~~~~~~~~^
1 error generated.
[3/6] Building CXX object tools/CMakeFiles/minify.dir/minify.cpp.o
FAILED: tools/CMakeFiles/minify.dir/minify.cpp.o
/usr/bin/clang++ -DSIMDJSON_AVX512_ALLOWED=1 -DSIMDJSON_THREADS_ENABLED=1 -DSIMDJSON_UTF8VALIDATION=1 -I/simdjson/include -I/simdjson/src -I/simdjson/dependencies/.cache/cxxopts/include -g -std=c++17 -Og -fPIC -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wno-sign-conversion -MD -MT tools/CMakeFiles/minify.dir/minify.cpp.o -MF tools/CMakeFiles/minify.dir/minify.cpp.o.d -o tools/CMakeFiles/minify.dir/minify.cpp.o -c /simdjson/tools/minify.cpp
/simdjson/tools/minify.cpp:114:27: error: no member named 'exceptions' in namespace 'cxxopts'
114 | } catch (const cxxopts::exceptions::option_has_no_value& e) {
| ~~~~~~~~~^
1 error generated.
ninja: build stopped: subcommand failed.
|
Can you give me a reproducible case? The reference to |
It was fixed by removing |
Instead of iterating through an object to find multiple keys, or iterating multiple times, we can now use
find_all.Instead of this:
We can do this now:
There's still a few places where we might need to change like in
documentandvalue; but I first wanted to see if the maintainers wanted anything like this or not.