Skip to content

[BUG] Streamstats command incorrectly treats null as a valid group #4751

@ishaoxy

Description

@ishaoxy

What is the bug?
When using the streamstats by <field> clause, the current implementation treats null values as a valid group key and computes separate statistics for them. Which is not consistent with the expected behavior.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Set data:
PUT stream_test_null
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      },
      "state": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "country": {
        "type": "text"
      },
      "year": {
        "type": "integer"
      },
      "month": {
        "type": "integer"
      }
    }
  }
}
POST stream_test_null/_bulk?refresh=true
{"index":{"_id":"1"}}
{"name":"Jake","age":70,"state":"California","country":"USA","year":2023,"month":4}
{"index":{"_id":"2"}}
{"name":"Hello","age":30,"state":"New York","country":"USA","year":2023,"month":4}
{"index":{"_id":"3"}}
{"name":"John","age":25,"state":"Ontario","country":"Canada","year":2023,"month":4}
{"index":{"_id":"4"}}
{"name":"Jane","age":20,"state":"Quebec","country":"Canada","year":2023,"month":4}
{"index":{"_id":"5"}}
{"name":null,"age":10,"state":null,"country":"Canada","year":2023,"month":4}
{"index":{"_id":"6"}}
{"name":"Kevin","year":2023,"month":4}
  1. Run this query
POST _plugins/_ppl
{
  "query": """
  source=stream_test_null | streamstats avg(age) as avg_age by state
  """
}
  1. See error
{
  "schema": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "country",
      "type": "string"
    },
    {
      "name": "state",
      "type": "string"
    },
    {
      "name": "month",
      "type": "int"
    },
    {
      "name": "year",
      "type": "int"
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "avg_age",
      "type": "double"
    }
  ],
  "datarows": [
    [
      "Jake",
      "USA",
      "California",
      4,
      2023,
      70,
      70
    ],
    [
      "Hello",
      "USA",
      "New York",
      4,
      2023,
      30,
      30
    ],
    [
      "John",
      "Canada",
      "Ontario",
      4,
      2023,
      25,
      25
    ],
    [
      "Jane",
      "Canada",
      "Quebec",
      4,
      2023,
      20,
      20
    ],
    [
      null,
      "Canada",
      null,
      4,
      2023,
      10,
      10
    ],
    [
      "Kevin",
      null,
      null,
      4,
      2023,
      null,
      10
    ]
  ],
  "total": 6,
  "size": 6
}

The two null value for state are incorrectly considered in one valid group.

What is the expected behavior?

"datarows": [
    [
      "Jake",
      "USA",
      "California",
      4,
      2023,
      70,
      70
    ],
    [
      "Hello",
      "USA",
      "New York",
      4,
      2023,
      30,
      30
    ],
    [
      "John",
      "Canada",
      "Ontario",
      4,
      2023,
      25,
      25
    ],
    [
      "Jane",
      "Canada",
      "Quebec",
      4,
      2023,
      20,
      20
    ],
    [
      null,
      "Canada",
      null,
      4,
      2023,
      10,
      null
    ],
    [
      "Kevin",
      null,
      null,
      4,
      2023,
      null,
      null
    ]
  ]

What is your host/environment?

  • OS: [e.g. iOS]
  • Version [e.g. 22]
  • Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions