Background
Highlighting and entity extraction are cornerstones of search systems and yet they currently do not play well with each other in elasticsearch/Lucene.
- Highlighting provides the evidence of where information was found in large texts.
- Entity extraction derives structured information such as names of people or organisations from unstructured free-text input.
The two techniques are often used together in systems with large amounts of free-text such as news reports. Consider this example search which combines free-text and a structured field derived from the free-text:

In this particular example highlighting works for the entity Natalia Veselnitskaya but would not for the entity Donald Trump Jr.
Issue - brittle highlighting
Sadly, the structured keyword terms like "person" produced by entity extraction tools rarely exist as tokens in the free text fields where they were originally discovered. The traceability of this discovery is lost. In the example above the natalia veselnitskaya entity only highlights because I carefully constructed the scenario:
- I lowercase-normalized the
person keyword field's contents
- I applied lowercase and 2 word shingles to the unstructured
text field
This approach was a good suggestion from @mbarretta but one which many casual users would overlook and still is far from a complete solution. Donald Trump Jr. would require a 3 word shingle analyzer on my text field and one which knew to preserve the full-stop in Jr. - but I don't want to apply 3 word shingles to all text or retain all full-stops. This is clearly a brittle strategy.
The irony is that entity extractors such as OpenNLP, Rosette or even custom regex have the information required to support highlighting (extracted entity term and offset into original text) but no place to keep this data. Entity extraction is really concerned with 2 or more fields - an unstructured source of data and one or more structured fields (person/organisation/location?) to deposit findings. Due to the way Analysis is focused on single fields we are left with no means for entity extractors to store the offsets that provide the traceability of their discoveries which standard highlighters can use.
Possible Solutions
1) "Internal analysis" - entity extraction performed as Analyzers
(I offer this option only to show how bad this route is...)
If the smarts in entity extractors were performed as part of the Lucene text analysis phase they could potentially emit token streams for both structured and unstructured output fields.
- Advantages - input JSON source is free of low-level term+offset information
- Disadvantages are many. Lucene analysis would need to support multiple field outputs, entity extraction logic would have to be Java and processed in-line with added compute expense.
2) "External analysis" - entity extraction performed prior to indexing
In this approach any entity extraction logic is performed outside of core elasticsearch e.g. using python's nltk or perhaps reusing human-annotated content like Wikipedia. The details of discoveries are passed in the JSON passed to elasticsearch. We would need to allow detailed text offset information of the type produced by analysis to be passed in from outside - akin to Solr's pre-analyzed field. This information could act as an "overlay" to the tokens normally produced by the analysis of a text field. Maybe text strings in JSON could, like geo fields be presented in more complex object forms to pass the additional metadata e.g. instead of:
"article_text": "Donald Trump Jr. met with russian attorney"
we could also support this more detailed form:
"article_text": {
"text": "Donald Trump Jr. met with russian attorney",
"inject_tokens" : [
{
"token": "Donald Trump Jr",
"offset": 0,
"length": 16
}
]
}
A custom Analyzer could fuse the token streams produced by standard analysis of the text and those provided in the inject_tokens array.
Background
Highlighting and entity extraction are cornerstones of search systems and yet they currently do not play well with each other in elasticsearch/Lucene.
The two techniques are often used together in systems with large amounts of free-text such as news reports. Consider this example search which combines free-text and a structured field derived from the free-text:
In this particular example highlighting works for the entity
Natalia Veselnitskayabut would not for the entityDonald Trump Jr.Issue - brittle highlighting
Sadly, the structured keyword terms like "person" produced by entity extraction tools rarely exist as tokens in the free text fields where they were originally discovered. The traceability of this discovery is lost. In the example above the
natalia veselnitskayaentity only highlights because I carefully constructed the scenario:personkeyword field's contentstextfieldThis approach was a good suggestion from @mbarretta but one which many casual users would overlook and still is far from a complete solution.
Donald Trump Jr.would require a 3 word shingle analyzer on my text field and one which knew to preserve the full-stop inJr.- but I don't want to apply 3 word shingles to all text or retain all full-stops. This is clearly a brittle strategy.The irony is that entity extractors such as OpenNLP, Rosette or even custom regex have the information required to support highlighting (extracted entity term and offset into original text) but no place to keep this data. Entity extraction is really concerned with 2 or more fields - an unstructured source of data and one or more structured fields (person/organisation/location?) to deposit findings. Due to the way Analysis is focused on single fields we are left with no means for entity extractors to store the offsets that provide the traceability of their discoveries which standard highlighters can use.
Possible Solutions
1) "Internal analysis" - entity extraction performed as Analyzers
(I offer this option only to show how bad this route is...)
If the smarts in entity extractors were performed as part of the Lucene text analysis phase they could potentially emit token streams for both structured and unstructured output fields.
2) "External analysis" - entity extraction performed prior to indexing
In this approach any entity extraction logic is performed outside of core elasticsearch e.g. using python's nltk or perhaps reusing human-annotated content like Wikipedia. The details of discoveries are passed in the JSON passed to elasticsearch. We would need to allow detailed text offset information of the type produced by analysis to be passed in from outside - akin to Solr's pre-analyzed field. This information could act as an "overlay" to the tokens normally produced by the analysis of a
textfield. Maybetextstrings in JSON could, like geo fields be presented in more complex object forms to pass the additional metadata e.g. instead of:we could also support this more detailed form:
}
A custom Analyzer could fuse the token streams produced by standard analysis of the text and those provided in the
inject_tokensarray.