What is the bug?
When using the streamstats by <field> clause, the current implementation treats null values as a valid group key and computes separate statistics for them. Which is not consistent with the expected behavior.
How can one reproduce the bug?
Steps to reproduce the behavior:
- Set data:
PUT stream_test_null
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"age": {
"type": "integer"
},
"state": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"country": {
"type": "text"
},
"year": {
"type": "integer"
},
"month": {
"type": "integer"
}
}
}
}
POST stream_test_null/_bulk?refresh=true
{"index":{"_id":"1"}}
{"name":"Jake","age":70,"state":"California","country":"USA","year":2023,"month":4}
{"index":{"_id":"2"}}
{"name":"Hello","age":30,"state":"New York","country":"USA","year":2023,"month":4}
{"index":{"_id":"3"}}
{"name":"John","age":25,"state":"Ontario","country":"Canada","year":2023,"month":4}
{"index":{"_id":"4"}}
{"name":"Jane","age":20,"state":"Quebec","country":"Canada","year":2023,"month":4}
{"index":{"_id":"5"}}
{"name":null,"age":10,"state":null,"country":"Canada","year":2023,"month":4}
{"index":{"_id":"6"}}
{"name":"Kevin","year":2023,"month":4}
- Run this query
POST _plugins/_ppl
{
"query": """
source=stream_test_null | streamstats avg(age) as avg_age by state
"""
}
- See error
{
"schema": [
{
"name": "name",
"type": "string"
},
{
"name": "country",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "month",
"type": "int"
},
{
"name": "year",
"type": "int"
},
{
"name": "age",
"type": "int"
},
{
"name": "avg_age",
"type": "double"
}
],
"datarows": [
[
"Jake",
"USA",
"California",
4,
2023,
70,
70
],
[
"Hello",
"USA",
"New York",
4,
2023,
30,
30
],
[
"John",
"Canada",
"Ontario",
4,
2023,
25,
25
],
[
"Jane",
"Canada",
"Quebec",
4,
2023,
20,
20
],
[
null,
"Canada",
null,
4,
2023,
10,
10
],
[
"Kevin",
null,
null,
4,
2023,
null,
10
]
],
"total": 6,
"size": 6
}
The two null value for state are incorrectly considered in one valid group.
What is the expected behavior?
"datarows": [
[
"Jake",
"USA",
"California",
4,
2023,
70,
70
],
[
"Hello",
"USA",
"New York",
4,
2023,
30,
30
],
[
"John",
"Canada",
"Ontario",
4,
2023,
25,
25
],
[
"Jane",
"Canada",
"Quebec",
4,
2023,
20,
20
],
[
null,
"Canada",
null,
4,
2023,
10,
null
],
[
"Kevin",
null,
null,
4,
2023,
null,
null
]
]
What is your host/environment?
- OS: [e.g. iOS]
- Version [e.g. 22]
- Plugins
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.
What is the bug?
When using the
streamstats by <field>clause, the current implementation treatsnullvalues as a valid group key and computes separate statistics for them. Which is not consistent with the expected behavior.How can one reproduce the bug?
Steps to reproduce the behavior:
The two
nullvalue for state are incorrectly considered in one valid group.What is the expected behavior?
What is your host/environment?
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.