Skip to content

[BUG] Circuit broken when too many concurrent PPL queries are run #4584

@toepkerd

Description

@toepkerd

Query Information

PPL Command/Query:

source=opensearch_dashboards_sample_data_flights
| where Carrier != 'Logstash Airways' AND timestamp > '2024-01-01'
| eval price_distance_ratio = AvgTicketPrice / DistanceKilometers, delay_factor = if(FlightDelayMin > 0, FlightDelayMin, 0), weather_score = if(DestWeather = 'Sunny', 5, if(DestWeather = 'Clear', 4, if(DestWeather = 'Partly Cloudy', 3, if(DestWeather = 'Cloudy', 2, 1))))
| stats sum(AvgTicketPrice) as total_revenue, avg(DistanceKilometers) as avg_distance, max(FlightDelayMin) as max_delay, avg(price_distance_ratio) as avg_price_per_km, sum(delay_factor) as total_delay_factor, avg(weather_score) as avg_weather_score by Carrier, OriginCountry, DestCountry
| eval efficiency_score = total_revenue / (avg_distance * (1 + total_delay_factor) * (6 - avg_weather_score))
| sort -efficiency_score, +avg_distance, -total_revenue
| eval rank = 1
| fields Carrier, OriginCountry, DestCountry, total_revenue, avg_distance, max_delay, avg_price_per_km, efficiency_score, rank
| sort -AvgTicketPrice, +DistanceKilometers, -DistanceMiles
| head 1000000

Expected Result:
The above query on its own, run against OSD sample flights data, succeeds. When 300 concurrent connections to the cluster were made, and a few hundred of these requests are made per connection, a more explicit error that doesn't trip a circuit breaker is expected.

Actual Result:
The following error is received eventually:

Failed to execute phase [query], all shards failed; shardFailures {[cSDPaLn9RHOEMod0Qwh_JQ][opensearch_dashboards_sample_data_flights][0]: RemoteTransportException[[7167b47a0672ab1d56512d6e68d40c67][x.x.x.x:9300][indices:data/read/search[phase/query]]]; nested: QueryShardException[failed to create query: Failed to compile inline script [{\"langType\":\"v2\",\"script\":\"rO0ABXNyADRvcmcub3BlbnNlYXJjaC5zcWwuZXhwcmVzc2lvbi5mdW5jdGlvbi5GdW5jdGlvbkRTTCQzHWCy3iOeynUCAAVMAA12YWwkYXJndW1lbnRzdAAQTGphdmEvdXRpbC9MaXN0O0wADHZhbCRmdW5jdGlvbnQAQExvcmcvb3BlbnNlYXJjaC9zcWwvZXhwcmVzc2lvbi9mdW5jdGlvbi9...omitted for brevity...\"}] using lang [opensearch_compounded_script]]; nested: CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting]; }; CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting]"

It also trips a circuit breaker that downs at least the PPL APIs for a few minutes.

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (details below)

Index Mapping

{
  "opensearch_dashboards_sample_data_flights": {
    "mappings": {
      "properties": {
        "AvgTicketPrice": {
          "type": "float"
        },
        "Cancelled": {
          "type": "boolean"
        },
        "Carrier": {
          "type": "keyword"
        },
        "Dest": {
          "type": "keyword"
        },
        "DestAirportID": {
          "type": "keyword"
        },
        "DestCityName": {
          "type": "keyword"
        },
        "DestCountry": {
          "type": "keyword"
        },
        "DestLocation": {
          "type": "geo_point"
        },
        "DestRegion": {
          "type": "keyword"
        },
        "DestWeather": {
          "type": "keyword"
        },
        "DistanceKilometers": {
          "type": "float"
        },
        "DistanceMiles": {
          "type": "float"
        },
        "FlightDelay": {
          "type": "boolean"
        },
        "FlightDelayMin": {
          "type": "integer"
        },
        "FlightDelayType": {
          "type": "keyword"
        },
        "FlightNum": {
          "type": "keyword"
        },
        "FlightTimeHour": {
          "type": "keyword"
        },
        "FlightTimeMin": {
          "type": "float"
        },
        "Origin": {
          "type": "keyword"
        },
        "OriginAirportID": {
          "type": "keyword"
        },
        "OriginCityName": {
          "type": "keyword"
        },
        "OriginCountry": {
          "type": "keyword"
        },
        "OriginLocation": {
          "type": "geo_point"
        },
        "OriginRegion": {
          "type": "keyword"
        },
        "OriginWeather": {
          "type": "keyword"
        },
        "dayOfWeek": {
          "type": "integer"
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Sample Data

{
          "FlightNum": "9HY9SWR",
          "DestCountry": "AU",
          "OriginWeather": "Sunny",
          "OriginCityName": "Frankfurt am Main",
          "AvgTicketPrice": 841.2656419677076,
          "DistanceMiles": 10247.856675613455,
          "FlightDelay": false,
          "DestWeather": "Rain",
          "Dest": "Sydney Kingsford Smith International Airport",
          "FlightDelayType": "No Delay",
          "OriginCountry": "DE",
          "dayOfWeek": 0,
          "DistanceKilometers": 16492.32665375846,
          "timestamp": "2025-09-29T00:00:00",
          "DestLocation": {
            "lat": "-33.94609833",
            "lon": "151.177002"
          },
          "DestAirportID": "SYD",
          "Carrier": "OpenSearch Dashboards Airlines",
          "Cancelled": false,
          "FlightTimeMin": 1030.7704158599038,
          "Origin": "Frankfurt am Main Airport",
          "OriginLocation": {
            "lat": "50.033333",
            "lon": "8.570556"
          },
          "DestRegion": "SE-BD",
          "OriginAirportID": "FRA",
          "OriginRegion": "DE-HE",
          "DestCityName": "Sydney",
          "FlightTimeHour": 17.179506930998397,
          "FlightDelayMin": 0
        }

Bug Description

Issue Summary:
Running a high volume of concurrent queries against the cluster should throw a more explicit validation or rate limiting exception that doesn't cause a hiccup in the PPL plugin APIs.

Impact:
The above error led to unavailability of at least the PPL plugin and its APIs

Environment Information

OpenSearch Version:
OpenSearch 3.1

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions