Skip to content

[Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938

@msfroh

Description

@msfroh

Is your feature request related to a problem? Please describe.
Years ago, I worked on a search service where my team would vend search capabilities to other teams around the company. Since my team would be woken up in the middle of the night if/when people sent very expensive queries that would use all CPU or memory, we built functionality to block any remotely-expensive query by default. We'd just throw an exception saying, "We are not going to let you run that kind of query unless you talk with us and explain your use-case." Based on the particular use-case, we would usually suggest a different, better way to solve the give problem, but sometimes we would allow e.g. a wildcard query on one or more specific fields.

OpenSearch has the search.allow_expensive_queries cluster setting, but that's basically a giant on/off switch that lacks any nuance.

Describe the solution you'd like
I would like to be able to configure what query types (and possibly other features, like aggregations) are allowed, on what fields, and under what circumstances. Ideally, I'd like even more specificity than the safeguards component that my old team built (which was limited to query types or features on individual fields). I'd like to be able to say "Prefix queries are okay on this field, but only if the prefix has length 3 or more".

I think a search pipeline SearchRequestProcessor would be the ideal place to configure something like that.

Essentially, I'm thinking we could define a chain of rules for each query type like:

# Hand-wavy brainstorm idea for defining rules to restrict query/aggregation types.
# Whatever we build will probably look different. Suggestions welcome!

PUT /_search/pipeline/allowed_queries_pipeline
{
  "request_processors": [
    {
      "restrict_queries" : {
        "query_types": {
          "prefix" : [
            {
              "field": "foo",
              "value_length" : ">=3",
              "allow": true
            },
            {
              "field": "bar",
              "value_length" : ">=5",
              "allow": true
            },
            {
              "deny" : true
            }
          ],
          "wildcard": [
            {
              "field" : "low_cardinality_field",
              "allow": true
            },
            { "deny" : true }
          ],
          "range": [
            {
              "field" : "timestamp",
              "range_width" : "> 604800000",
              "deny" : true
            }
          ]
        },
        "aggregation_types": {
          "variable_width_histogram: [
            {
              "num_buckets": "> 5"
              "deny" : true
            },
            { "allow":true}
          ]
        }
    }
  ]
}

We would make use of QueryBuilderVisitor to traverse the query tree. I'm imagining that we would essentially apply the rules sequentially to each visited query -- as soon as an "allow" rule matches, we proceed; if a "deny" rule matches, we fail with a 4xx error explaining what rule was violated.

I like this approach because the "rule language" can start simpler than what is described above and evolve based on need.

Describe alternatives you've considered
We could do something much smaller like turning search.allow_expensive_queries into an index-level setting without much code change. Even reworking it to act per-field wouldn't be a huge burden (though the current "allowExpensiveQuery" checks would need to be updated to check per-field). I don't see how it would be practical to implement something like the property-based checks or checks on aggregations.

Instead of using search pipelines, we could define a whole new API for query restriction rules (maybe in a plugin), but then we would need the full set of CRUD operations on these rules. Since this is logic that we would likely want to run on the coordinator, before fanning out to shards, it feels like search pipelines already provide a pretty good persistence API.

Additional context
Note that a SearchRequestProcessor that implements this functionality could be delivered via a plugin. It doesn't need to be core OpenSearch functionality. (It could theoretically start in a plugin and move into the search-pipeline-common module.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:ResiliencyenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions