Support for highlighting extracted entities.

### Background
Highlighting and entity extraction are cornerstones of search systems and yet they currently do not play well with each other in elasticsearch/Lucene.

* **Highlighting** provides the evidence of where information was found in large texts.
* **Entity extraction** derives structured information such as names of people or organisations from unstructured free-text input.

The two techniques are often used together in systems with large amounts of free-text such as news reports. Consider this example search which combines free-text and a structured field derived from the free-text:

<img width="512" alt="entitysearch" src="https://user-images.githubusercontent.com/170925/38606334-9d5fad16-3d6d-11e8-9db0-fdc0c18f4cfe.png">

In this particular example highlighting works for the entity `Natalia Veselnitskaya` but would not for the entity `Donald Trump Jr.`


### Issue - brittle highlighting
Sadly, the structured keyword terms like "person" produced by entity extraction tools rarely exist as tokens in the free text fields where they were originally discovered. The traceability of this discovery is lost. In the example above the `natalia veselnitskaya` entity only highlights because I carefully constructed the scenario:

1) I lowercase-normalized the `person` keyword field's contents
2) I applied lowercase and 2 word shingles to the unstructured `text` field

This approach was a good suggestion from @mbarretta but one which many casual users would overlook and still is far from a complete solution. `Donald Trump Jr.` would require a 3 word shingle analyzer on my text field and one which knew to preserve the full-stop in `Jr.` - but I don't want to apply 3 word shingles to all text or retain all full-stops.  This is clearly a brittle strategy.

The irony is that entity extractors such as [OpenNLP](https://github.com/spinscale/elasticsearch-ingest-opennlp), [Rosette](https://github.com/rosette-api/rosette-elasticsearch-plugin/blob/master/plugin/src/main/java/com/rosette/elasticsearch/EntitiesProcessor.java) or even [custom regex](https://twitter.com/DanHLawReporter/status/975779731420393472) have the information required to support highlighting (extracted entity term and offset into original text) but no place to keep this data. Entity extraction is really concerned with 2 or more fields - an unstructured source of data and one or more structured fields (person/organisation/location?) to deposit findings. Due to the way Analysis is focused on single fields we are left with no means for entity extractors to store the offsets that provide the traceability of their discoveries which standard highlighters can use.

### Possible Solutions

#### 1) "Internal analysis" - entity extraction performed as Analyzers
(I offer this option only to show how bad this route is...)
If the smarts in entity extractors were performed as part of the Lucene text analysis phase they could potentially emit token streams for both structured and unstructured output fields. 
* **Advantages** - input JSON source is free of low-level term+offset information
* **Disadvantages** are many. Lucene analysis would need to support multiple field outputs, entity extraction logic would have to be Java and processed in-line with added compute expense.

#### 2) "External analysis" - entity extraction performed prior to indexing
In this approach any entity extraction logic is performed outside of core elasticsearch e.g. using [python's nltk](https://www.nltk.org/) or perhaps reusing human-annotated content like Wikipedia. The details of discoveries are passed in the JSON passed to elasticsearch. We would need to allow detailed text offset information of the type produced by analysis to be passed in from outside - akin to Solr's [pre-analyzed field](https://lucene.apache.org/solr/guide/6_6/working-with-external-files-and-processes.html#WorkingwithExternalFilesandProcesses-ThePreAnalyzedFieldType). This information could act as an "overlay" to the tokens normally produced by the analysis of a `text` field. Maybe `text` strings in JSON could, like geo fields be presented in more complex object forms to pass the additional metadata e.g. instead of:

    "article_text": "Donald Trump Jr. met with russian attorney"

we could also support this more detailed form:

    "article_text": {
        "text": "Donald Trump Jr. met with russian attorney",
        "inject_tokens" : [
              {
                    "token": "Donald Trump Jr",
                    "offset": 0,
                    "length": 16
              }
        ]
   }

A custom Analyzer could fuse the token streams produced by standard analysis of the text and those provided in the `inject_tokens` array.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for highlighting extracted entities. #29467

Background

Issue - brittle highlighting

Possible Solutions

1) "Internal analysis" - entity extraction performed as Analyzers

2) "External analysis" - entity extraction performed prior to indexing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support for highlighting extracted entities. #29467

Description

Background

Issue - brittle highlighting

Possible Solutions

1) "Internal analysis" - entity extraction performed as Analyzers

2) "External analysis" - entity extraction performed prior to indexing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions