Skip to content

[BUG] Regression (3.1 -> 3.3): PPL join with aggregates on IPs is now returning empty results. #4726

@Swiddis

Description

@Swiddis

What is the bug?
When trying to do an aggregate-join-aggregate on an IP field, it returns no data. Example (mappings & data in repro):

source=web_logs
| stats count() as request_count by client_ip
| join type=inner client_ip [source=ip_geodata]
| stats sum(request_count) as total_requests by country

How can one reproduce the bug?

$ echo '{
    "mappings": {
        "properties": {
            "client_ip": {
                "type": "ip"
            },
            "status": {
                "type": "long"
            },
            "timestamp": {
                "type": "date"
            },
            "url": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            }
        }
    }
}' | xh put localhost:9200/web_logs

$ echo '{
    "mappings": {
        "properties": {
            "city": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "client_ip": {
                "type": "ip"
            },
            "country": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            }
        }
    }
}' | xh put localhost:9200/ip_geodata

$ echo '{"index":{"_id":"1"}}
{"client_ip":"203.0.113.1","timestamp":"2024-04-03T10:01:00Z","url":"/products","status":200}
{"index":{"_id":"2"}}
{"client_ip":"198.51.100.1","timestamp":"2024-04-03T10:07:00Z","url":"/cart","status":200}
{"index":{"_id":"3"}}
{"client_ip":"203.0.113.3","timestamp":"2024-04-03T10:08:00Z","url":"/home","status":200}
{"index":{"_id":"4"}}
{"client_ip":"203.0.113.1","timestamp":"2024-04-03T10:00:00Z","url":"/home","status":200}
{"index":{"_id":"5"}}
{"client_ip":"203.0.113.2","timestamp":"2024-04-03T10:02:00Z","url":"/cart","status":200}
{"index":{"_id":"6"}}
{"client_ip":"198.51.100.1","timestamp":"2024-04-03T10:04:00Z","url":"/about","status":200}
{"index":{"_id":"7"}}
{"client_ip":"203.0.113.1","timestamp":"2024-04-03T10:05:00Z","url":"/checkout","status":200}
{"index":{"_id":"8"}}
{"client_ip":"198.51.100.1","timestamp":"2024-04-03T10:03:00Z","url":"/home","status":200}
{"index":{"_id":"9"}}
{"client_ip":"203.0.113.2","timestamp":"2024-04-03T10:06:00Z","url":"/products","status":200}' | xh post localhost:9200/web_logs/_bulk


$ echo '{"index":{"_id":"1"}}
{"client_ip": "198.51.100.1","country": "Canada","city": "Toronto"}
{"index":{"_id":"2"}}
{"client_ip": "203.0.113.3","country": "UK","city": "London"}
{"index":{"_id":"3"}}
{"client_ip": "203.0.113.1","country": "USA","city": "New York"}
{"index":{"_id":"4"}}
{"client_ip": "203.0.113.2","country": "USA","city": "Seattle"}' | xh post localhost:9200/ip_geodata/_bulk

$ echo '{"query":"source=web_logs | stats count() as request_count by client_ip | join type=inner client_ip [source=ip_geodata]  | stats sum(request_count) as total_requests by country"}' | xh post localhost:9200/_plugins/_ppl
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 141
Content-Type: application/json; charset=UTF-8
X-Opensearch-Version: OpenSearch/3.4.0-SNAPSHOT (opensearch)

{
    "schema": [
        {
            "name": "total_requests",
            "type": "bigint"
        },
        {
            "name": "country",
            "type": "string"
        }
    ],
    "datarows": [],
    "total": 0,
    "size": 0
}

What is the expected behavior?
This should return 3 results:

┏━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ country  ┃total_requests┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ USA      │ 5            │
├──────────┼──────────────┤
│ Canada   │ 3            │
├──────────┼──────────────┤
│ UK       │ 1            │
└──────────┴──────────────┘

This is correctly returned if the respective ip types are changed to keywords. It also works correctly if you simplify by removing the initial count() stage before the join, and just do source | join | stats count() as total_count by country

What is your host/environment?

  • Mainline, JDK 21

Do you have any screenshots?
N/A

Do you have any additional context?
Works on 3.1, stops working on 3.3, haven't yet bissected to when exactly it broke.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions