Skip to content

uri_parts parse directory structure for extension #105612

@jguay

Description

@jguay

Elasticsearch Version

8.12.1

Installed Plugins

No response

Java Version

bundled

OS Version

docker

Problem Description

uri_parts ingest pipeline processor output wrong extension when there is none and URL path contains dot character(s)

URLs https://www.example.com/path.withdot/filenamewithoutextension computes extension as "extension": "withdot/filenamewithoutextension"

Steps to Reproduce

  1. Create pipeline
PUT /_ingest/pipeline/test-uri-parts
{
    "processors": [
        {
            "uri_parts": {
                "field": "url.original",
                "target_field": "url.parsed"
            }
        }
    ]
}
  1. Simulate pipeline :
POST _ingest/pipeline/test-uri-parts/_simulate
{
  "docs" :
  [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "url": {
          "original": "https://www.example.com/path.withdot/filenamewithoutextension"
        }
      }
    }
    ]
}

Output contains wrong data for extension

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_version": "-3",
        "_id": "id",
        "_source": {
          "url": {
            "parsed": {
              "path": "/path.withdot/folder/filenamewithoutextension",
              "extension": "withdot/folder/filenamewithoutextension",
              "original": "https://www.example.com/path.withdot/folder/filenamewithoutextension",
              "scheme": "https",
              "domain": "www.example.com"
            },
            "original": "https://www.example.com/path.withdot/folder/filenamewithoutextension"
          }
        },
        "_ingest": {
          "timestamp": "2024-02-19T09:47:21.38168605Z"
        }
      }
    }
  ]
}

Workaround

  • The issue won't appear on https://www.example.com/path.withdot/filenamewithextension.zip so the following workaround is available to remove the unwanted extension field
PUT /_ingest/pipeline/test-uri-parts
{
  "processors": [
    {
      "uri_parts": {
        "field": "url.original",
        "target_field": "url.parsed"
      }
    },
    {
      "remove": {
        "field": "url.parsed.extension",
        "if": "ctx?.url?.parsed?.extension != null && ctx?.url?.parsed?.extension.indexOf('/') != -1"
      }
    }
  ]
}

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions