Elasticsearch version: 2.2.0
Description of the problem including expected versus actual behavior: Mapper attachments plugin (or ingest-attachment) works with Text of PDF, but not with the Office formats.
Steps to reproduce:
- Install mapper-attachments plugin
- Index a Word (
.docx document)
- Look at logs
DEBUG level.
Logs:
[2016-02-29 16:43:39,341][DEBUG][mapper.attachment ] Failed to extract [100000] characters of text for [null]: [Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51667d8a]
...
Caused by: java.lang.IllegalStateException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at org.apache.xmlbeans.XmlBeans.getContextTypeLoader(XmlBeans.java:336)
...
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
Analysis:
As recent Office documents are now xml based (.docx, .xlsx...), Tika can not read them anymore in the context of elasticsearch because getClassLoader call is forbidden.
Reported by many users at https://discuss.elastic.co/t/no-hits-when-do-a-text-search-in-an-attachment-for-docx-file/41779
Switching to .doc legacy format works well.
Elasticsearch version: 2.2.0
Description of the problem including expected versus actual behavior: Mapper attachments plugin (or ingest-attachment) works with Text of PDF, but not with the Office formats.
Steps to reproduce:
.docxdocument)DEBUGlevel.Logs:
Analysis:
As recent Office documents are now
xmlbased (.docx,.xlsx...), Tika can not read them anymore in the context of elasticsearch becausegetClassLoadercall is forbidden.Reported by many users at https://discuss.elastic.co/t/no-hits-when-do-a-text-search-in-an-attachment-for-docx-file/41779
Switching to
.doclegacy format works well.