Skip to content

feat(core): Eliminate impossible token types for non-definite variable tokens to skip invalid subquery interpretations that are causing unnecessary decompression and scan. #1864

@gibber9809

Description

@gibber9809

Request

For a query like "*NonDFS*" we end up treating the query as a single non-definite variable token containing wildcards. The current logic in QueryToken.cpp will treat this kind of token as any possible variable type, without checking whether each interpretation is actually possible.

For *NonDFS*, the token clearly could not ever be an IntVar or FloatVar, because it contains characters that could not possibly appear in those kinds of variables.

The current logic will end up turning this into four subqueries against logtypes matching *NonDFS*, *\x11*, *\x12*, *\x13* (i.e. any logtypes containing NonDFS, or ANY integer, float, or dictionary variable placeholder), which effectively results in decompression and scan for the entire dataset.

This particular example comes from an internally reported performance bug where this particular query takes approximately ~12m30s single-threaded over a ~300GB dataset because the generated subqueries require full decompression and scan. Hacking a fix into subquery generation to ignore the impossible IntVar and FloatVar brings query completion time down to ~3s.

Possible implementation

Instead of treating non-definite variable tokens containing wildcards as any possible type, we should try to determine which types are actually possible.

This can be achieved by writing helper functions that use heuristics to determine whether it is possible for a wildcard string to be interpreted as a representable IntVar or FloatVar. Note that a heuristic approach can lead to us still interpreting some query tokens as possibly corresponding to IntVar/FloatVar when they're not, but it should significantly cut down on invalid query interpretations in the common case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions