Skip to content

object::find_all: find multiple keys in one go#2235

Closed
the-moisrex wants to merge 4 commits intosimdjson:masterfrom
the-moisrex:multi-get
Closed

object::find_all: find multiple keys in one go#2235
the-moisrex wants to merge 4 commits intosimdjson:masterfrom
the-moisrex:multi-get

Conversation

@the-moisrex
Copy link
Member

Instead of iterating through an object to find multiple keys, or iterating multiple times, we can now use find_all.

Instead of this:

     for (auto field : obj) {
       simdjson::ondemand::raw_json_string key;
       error = field.key().get(key);
       if (error) {
         return error;
       }
       if (key == "make") {
         error = field.value().get_string(car.make);
         if (error) {
           return error;
         }
       } else if (key == "model") {
         error = field.value().get_string(car.model);
         if (error) {
           return error;
         }
       } else if (key == "year") {
         error = field.value().get(car.year);
         if (error) {
           return error;
         }
     }

We can do this now:

    auto [make, model, year] = obj.find_all("make", "model", "year");
    make.get(car.make);
    model.get(car.model);
    year.get(car.year);

There's still a few places where we might need to change like in document and value; but I first wanted to see if the maintainers wanted anything like this or not.

Instead of iterating through an object to find multiple keys, or iterating multiple times, we can now use `find_all`.
@lemire
Copy link
Member

lemire commented Aug 10, 2024

@the-moisrex Looks very exciting.

Personally I would embrace it but I am going to invite feedback.

@FranciscoThiesen
Copy link
Member

Definitely obj.find_all(...) looks much nicer!


template <typename... StrT>
requires details::all_string_views<StrT...>
[[nodiscard]] simdjson_inline std::array<simdjson_result<value>, sizeof...(StrT)> object::find_all(StrT... keys) && noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about the [[nodiscard]] usage here. Why is it a problem if the return of this function is not actually used?

@the-moisrex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FranciscoThiesen, Is there a reasonable place where someone would call .find_all and just ignore the result? If that happens, it's pretty much a bug, and [[nodiscard]] helps catch those bugs; and if they actually need to do that, it's better if they use std::ignore = ....find_all(...) to state the intent for everyone.

We could actually add [[nodiscard]] to a lot of places in this library, but since it's a C++11 library, we either have to get macro-creative, or just not do it; but there's no reason for a C++20-only part of the library not to have it.

Same goes for constexpr btw. We can make a lot of machinery in the library constexpr-friendly.

@lemire
Copy link
Member

lemire commented Aug 11, 2024

The simdjson library makes extensive use of simdjson_warn_unused.

To be honest, I don't think it ever caught a bug, so maybe it is useless. In theory it ought to help users.

@jkeiser
Copy link
Member

jkeiser commented Aug 12, 2024

One place where warn_unused matters a lot is for methods with side effects that can return an error. For methods without side effects, it's weird not to use the result, but it won't really help you fix any bugs.

@lemire
Copy link
Member

lemire commented Aug 14, 2024

Note that this PR is deliberately left open so we can review it. I plan to test it further soon.

If we can make it work, I think we should!

@the-moisrex
Copy link
Member Author

the-moisrex commented Aug 14, 2024

@lemire I would've added C++23's Multidimensional subscript operator support, but I thought maybe that we could find better use for that one.

We also can have something like this in C++20, but we could simplify that even further with C++26's reflection, so I've hesitated to implement any of this:

  obj.extract_to({{"make", car.make}, {"model", car.model}, {"year", car.year}});

or:

  extract(car.make, car.model, car.year) = obj.find_all("make", "model", "year");

or we could get a bit more clever (you can choose your poison of operator almost however you like as long as the operator precedence allows it):

  obj.extract() / "make" >> car.make / "model" >> car.model / "year" >> car.year;

or:

  obj.extract("make", "model", "year") >> car.make >> car.model >> car.year;

@lemire
Copy link
Member

lemire commented Sep 3, 2024

@the-moisrex Just a ping: do not assume that this PR is dead or anything such thing. I am deliberately leaving it is here because it is so important that I want to carefully study it.

I am fully expecting to merge and release this functionality, once we verify it carefully.

@the-moisrex
Copy link
Member Author

@lemire I did do the simplest implementation, but as I showed, there can be even simpler syntaxes for users; each have their own pros and cons.

I always get into a paralysis analysis situation when it comes into these things!

@lemire
Copy link
Member

lemire commented Sep 3, 2024

@the-moisrex Yes. I also want to make sure that we can use the opportunity to improve the performance.

@lemire
Copy link
Member

lemire commented Sep 11, 2024

@the-moisrex I did some work.

There is a minor issue that I could fix easily: we would want 'find_all' to work on simdjson_result<value> as well as on simdjson_result<[value](ondemand::object)> instances and not just on ondemand::object instances. That's trivial to solve.

Another concern, more serious, is that in simdjson, you are expected to be in one object at a time.

So let us suppose we want to port the following code (from our benchmarks) to find_all:

bool standard_consume(simdjson::ondemand::parser &parser,
                      simdjson::padded_string &json,
                      std::vector<tweet> &result) {
  result.clear();
  auto read_user = [](simdjson::ondemand::object user) {
    return twitter_user{user.find_field("id"), user.find_field("screen_name")};
  };
  auto doc = parser.iterate(json);
  for (simdjson::ondemand::object tweet_value : doc.find_field("statuses")) {
    result.emplace_back(
        tweet{tweet_value.find_field("created_at"),
              tweet_value.find_field("id"), tweet_value.find_field("text"),
              read_user(tweet_value.find_field("user")),
              tweet_value.find_field("retweet_count"),
              tweet_value.find_field("favorite_count")});
  }
  return true;
}

Notice how the value matching user is an object. This means that we have to handle it separately.

So we can do something like this...

  for (simdjson::ondemand::object tweet_value : doc.find_field("statuses")) {
    auto [created_at, id, text, in_reply_to_status_id, retweet_count,
          favorite_count] =
        tweet_value.find_all("created_at", "id", "text",
                             "in_reply_to_status_id", "retweet_count",
                             "favorite_count");
    auto [user_id, screen_name] =
        tweet_value["user"].get_object().find_all("id", "screen_name");
    result.emplace_back(tweet{
        created_at, id, text, 
        twitter_user{user_id, screen_name}, retweet_count, favorite_count});
  }

That's not too bad.

A final problem is performance. In the find_all call, we have to build these temporary ondemand::value instances, which are basically a pointer inside the document. They are probably stored on stack. This implies a penalty compared to code that consumes the values immediately. In my tests, the penalty is not trivial.

@lemire
Copy link
Member

lemire commented Sep 11, 2024

@the-moisrex I invited you to a private repository where I wrote a benchmark based on this PR. If you get around to it, it would be great if you could check it out.

(I am keeping some speculative work private so as to not create noise.)

@lemire
Copy link
Member

lemire commented Sep 18, 2024

@the-moisrex

Can we close this in favour of #2247?

@the-moisrex
Copy link
Member Author

@lemire sure, but only due to the performance issues, if you know a way to make it faster, then I suggest having both since .find_all returns, but .extract puts it into something meaning we need to have a place to put it with extract, but not in find_all.


P.S. I'm also working on changing the tag_invoke signature (have already done the signature change, but there's an maybe related, maybe unrelated error message comes up) to make it so that we can put its value somewhere instead of having the tag_invoke construct a new object.

The use for that would be if the user wants to append to a vector instead of re-creating a temporary vector and re-add it to another vector.

The branch

The error message:

[1/6] Building CXX object tools/CMakeFiles/json2json.dir/json2json.cpp.o
FAILED: tools/CMakeFiles/json2json.dir/json2json.cpp.o
/usr/bin/clang++ -DSIMDJSON_AVX512_ALLOWED=1 -DSIMDJSON_THREADS_ENABLED=1 -DSIMDJSON_UTF8VALIDATION=1 -I/simdjson/include -I/simdjson/src -I/simdjson/dependencies/.cache/cxxopts/include -g -std=c++17 -Og -fPIC -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wno-sign-conversion -MD -MT tools/CMakeFiles/json2json.dir/json2json.cpp.o -MF tools/CMakeFiles/json2json.dir/json2json.cpp.o.d -o tools/CMakeFiles/json2json.dir/json2json.cpp.o -c /simdjson/tools/json2json.cpp
/simdjson/tools/json2json.cpp:95:27: error: no member named 'exceptions' in namespace 'cxxopts'
   95 |   } catch (const cxxopts::exceptions::option_has_no_value& e) {
      |                  ~~~~~~~~~^
1 error generated.
[2/6] Building CXX object tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o
FAILED: tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o
/usr/bin/clang++ -DSIMDJSON_AVX512_ALLOWED=1 -DSIMDJSON_THREADS_ENABLED=1 -DSIMDJSON_UTF8VALIDATION=1 -I/simdjson/include -I/simdjson/src -I/simdjson/dependencies/.cache/cxxopts/include -g -std=c++17 -Og -fPIC -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wno-sign-conversion -MD -MT tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o -MF tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o.d -o tools/CMakeFiles/jsonstats.dir/jsonstats.cpp.o -c /simdjson/tools/jsonstats.cpp
/simdjson/tools/jsonstats.cpp:281:27: error: no member named 'exceptions' in namespace 'cxxopts'
  281 |   } catch (const cxxopts::exceptions::option_has_no_value& e) {
      |                  ~~~~~~~~~^
1 error generated.
[3/6] Building CXX object tools/CMakeFiles/minify.dir/minify.cpp.o
FAILED: tools/CMakeFiles/minify.dir/minify.cpp.o
/usr/bin/clang++ -DSIMDJSON_AVX512_ALLOWED=1 -DSIMDJSON_THREADS_ENABLED=1 -DSIMDJSON_UTF8VALIDATION=1 -I/simdjson/include -I/simdjson/src -I/simdjson/dependencies/.cache/cxxopts/include -g -std=c++17 -Og -fPIC -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wno-sign-conversion -MD -MT tools/CMakeFiles/minify.dir/minify.cpp.o -MF tools/CMakeFiles/minify.dir/minify.cpp.o.d -o tools/CMakeFiles/minify.dir/minify.cpp.o -c /simdjson/tools/minify.cpp
/simdjson/tools/minify.cpp:114:27: error: no member named 'exceptions' in namespace 'cxxopts'
  114 |   } catch (const cxxopts::exceptions::option_has_no_value& e) {
      |                  ~~~~~~~~~^
1 error generated.
ninja: build stopped: subcommand failed.

@lemire
Copy link
Member

lemire commented Sep 18, 2024

@the-moisrex

The error message: (...)

Can you give me a reproducible case?

The reference to cxxopts::exceptions::option_has_no_value has been in the project for a long time.

@the-moisrex
Copy link
Member Author

@the-moisrex

The error message: (...)

Can you give me a reproducible case?

The reference to cxxopts::exceptions::option_has_no_value has been in the project for a long time.

It was fixed by removing dependencies/.caches directory; I explained in #2256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants