Conversation
| abundance_sum: bool = False, | ||
| query_counts: bool = False, | ||
| query_coords: bool = False, | ||
| graphs: Union[None, List[str]] = None, |
There was a problem hiding this comment.
This might be a bit confusing compared to self.graphs in this class. Are we sure we want to keep this naming? Maybe rename the parameter graphs in API to labels, or similar?
That aside, how are MultiGraphClients actually used? Do we theoretically expect a situation in which e.g. first graph serves label A while the second graph serves labels B and C, and we want to query just label A or just labels A and B?
Judging by the unit test below, this would fail. Is it the outcome we want, rather than e.g. returning empty output (and possibly some kind of warning in logs) when querying a label that we don't know?
There was a problem hiding this comment.
Yeah, true. We can try to come up with a different name. Otherwise, MultiGraphClient is essentially a GraphClient with a ThreadPool, and it's not used all that migh right now. Maybe it's simple enough to remove. Anyone can always add a ThreadPool on top any time
| assert(json.size() == result.size()); | ||
| for (Json::ArrayIndex i = 0; i < result.size(); ++i) { | ||
| for (auto&& value : json[i]["results"]) { | ||
| result[i]["results"].append(std::move(value)); |
There was a problem hiding this comment.
Should we put some limitations on final result size, and e.g. throw an exception if we detect that it became too large to handle? Also, might be nice to collect some statistics on how much time we spend querying graphs vs merging results, as the latter is effectively single-threaded due to locking.
There was a problem hiding this comment.
It just does a push back to a vector. It should be extremely fast. Space-wise also. I think there is no need to worry about that right now.
There was a problem hiding this comment.
Pushing to vector should be fast, yes, I'm just concerned about the growth rate of the output. if a query is something that occurs everywhere, and it produces, say, 1 MiB of output on one chunk, it would elevate to ~2 GiB over 2k+ chunks, and then we also send it back over HTTP, right? Do we have something in place to guarantee scenarios like this don't happen? Or are we fine with them?
Speaking of that,
- Is it guaranteed that the order of elements (over which
Json::ArrayIndexgo) is the same in all chunk results, and not possibly shuffled due to multithreading? I remember it being shuffled in AWS tests when output was in tsv format, so I had to additionally sort before merging. - We apply
config.num_top_labelsto all chunks individually, right? But should we maybe also apply it during merging, repeatedly discarding matches outside ofnum_top_labels? Given that we try to simulate a single combined graph on all chunks. This would also ensure result size stays on the order of magnitude of one chunk when this option is provided. - AFAIK, when we use
num_top_labels, individual chunks return them in sorted order by the number of matches. I assume we might also need to re-sort them at the end after merging here, to preserve this format? Or is it ensured by JSON already?
I also don't know if there are other query parameters that we may need to account for during the merging stage.
There was a problem hiding this comment.
That's a great point, I need to add the num_top_labels filter.
There was a problem hiding this comment.
It's a bit trickier than I thought because of the many different result types (counts/coords/presence bitmap/etc.).
Let's do this in another PR, so we don't delay merging this.
I'm thinking of two things:
- Always return sorted results in the server mode
- Filter by top_labels
graphsin the request jsonUsage:
./metagraph server_query indexes.txt --mmap --parallel 50where indexes.txt is a csv file with graph names and paths: