Request
For a query like "*NonDFS*" we end up treating the query as a single non-definite variable token containing wildcards. The current logic in QueryToken.cpp will treat this kind of token as any possible variable type, without checking whether each interpretation is actually possible.
For *NonDFS*, the token clearly could not ever be an IntVar or FloatVar, because it contains characters that could not possibly appear in those kinds of variables.
The current logic will end up turning this into four subqueries against logtypes matching *NonDFS*, *\x11*, *\x12*, *\x13* (i.e. any logtypes containing NonDFS, or ANY integer, float, or dictionary variable placeholder), which effectively results in decompression and scan for the entire dataset.
This particular example comes from an internally reported performance bug where this particular query takes approximately ~12m30s single-threaded over a ~300GB dataset because the generated subqueries require full decompression and scan. Hacking a fix into subquery generation to ignore the impossible IntVar and FloatVar brings query completion time down to ~3s.
Possible implementation
Instead of treating non-definite variable tokens containing wildcards as any possible type, we should try to determine which types are actually possible.
This can be achieved by writing helper functions that use heuristics to determine whether it is possible for a wildcard string to be interpreted as a representable IntVar or FloatVar. Note that a heuristic approach can lead to us still interpreting some query tokens as possibly corresponding to IntVar/FloatVar when they're not, but it should significantly cut down on invalid query interpretations in the common case.
Request
For a query like
"*NonDFS*"we end up treating the query as a single non-definite variable token containing wildcards. The current logic inQueryToken.cppwill treat this kind of token as any possible variable type, without checking whether each interpretation is actually possible.For
*NonDFS*, the token clearly could not ever be anIntVarorFloatVar, because it contains characters that could not possibly appear in those kinds of variables.The current logic will end up turning this into four subqueries against logtypes matching
*NonDFS*,*\x11*,*\x12*,*\x13*(i.e. any logtypes containing NonDFS, or ANY integer, float, or dictionary variable placeholder), which effectively results in decompression and scan for the entire dataset.This particular example comes from an internally reported performance bug where this particular query takes approximately ~12m30s single-threaded over a ~300GB dataset because the generated subqueries require full decompression and scan. Hacking a fix into subquery generation to ignore the impossible
IntVarandFloatVarbrings query completion time down to ~3s.Possible implementation
Instead of treating non-definite variable tokens containing wildcards as any possible type, we should try to determine which types are actually possible.
This can be achieved by writing helper functions that use heuristics to determine whether it is possible for a wildcard string to be interpreted as a representable
IntVarorFloatVar. Note that a heuristic approach can lead to us still interpreting some query tokens as possibly corresponding toIntVar/FloatVarwhen they're not, but it should significantly cut down on invalid query interpretations in the common case.