Skip to content

ingest-attachment support for per document indexed_chars limit #28942

@dadoonet

Description

@dadoonet

Coming from this discussion: https://discuss.elastic.co/t/how-to-control-the-indexed-chars-value-on-a-ingest-attachment-pipeline/123073/4

We today support a global indexed_chars processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

Here is my proposal.
We should add an option like reading this limit value from the document itself by adding a setting like indexed_chars_field.

Then we could do something:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}

Then index either:

PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}

Which will use the default value (or the one defined by indexed_chars)

Or

PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000 
}

I'll propose hopefully soon a PR for it unless someone in the meantime reject that feature request or propose another implementation for it.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions