Support extraction of all metadata#22339
Conversation
As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset.
That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before.
This PR adds a new property `raw_metadata` which is not set by default. That means that nothing change for users unless they explicitly ask for `"properties": [ "raw_metadata" ]`.
For example:
```
PUT _ingest/pipeline/attachment
{
"description" : "Extract all metadata",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": [ "raw_metadata" ]
}
}
]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/my_type/my_id
```
gives back:
```json
{
"found": true,
"_index": "my_index",
"_type": "my_type",
"_id": "my_id",
"_version": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"raw_metadata": {
"X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser",
"Content-Type": "application/rtf"
}
}
}
}
```
Of course, much more metadata can be extracted. For example, this is what a `docx` Word document can generate:
```
"attachment": {
"raw_metadata": {
"date": "2015-02-20T11:36:00Z",
"cp:revision": "22",
"Total-Time": "6",
"extended-properties:AppVersion": "15.0000",
"meta:paragraph-count": "1",
"meta:word-count": "15",
"dc:creator": "Windows User",
"extended-properties:Company": "JDI",
"Word-Count": "15",
"dcterms:created": "2012-10-12T11:17:00Z",
"meta:line-count": "1",
"Last-Modified": "2015-02-20T11:36:00Z",
"dcterms:modified": "2015-02-20T11:36:00Z",
"Last-Save-Date": "2015-02-20T11:36:00Z",
"meta:character-count": "92",
"Template": "Normal.dotm",
"Line-Count": "1",
"Paragraph-Count": "1",
"meta:save-date": "2015-02-20T11:36:00Z",
"meta:character-count-with-spaces": "106",
"Application-Name": "Microsoft Office Word",
"extended-properties:TotalTime": "6",
"modified": "2015-02-20T11:36:00Z",
"Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"X-Parsed-By": "org.apache.tika.parser.microsoft.ooxml.OOXMLParser",
"creator": "Windows User",
"meta:author": "Windows User",
"meta:creation-date": "2012-10-12T11:17:00Z",
"extended-properties:Application": "Microsoft Office Word",
"meta:last-author": "Luka Lampret",
"Creation-Date": "2012-10-12T11:17:00Z",
"xmpTPg:NPages": "1",
"Character-Count-With-Spaces": "106",
"Last-Author": "Luka Lampret",
"Character Count": "92",
"Page-Count": "1",
"Revision-Number": "22",
"Application-Version": "15.0000",
"extended-properties:Template": "Normal.dotm",
"Author": "Windows User",
"publisher": "JDI",
"meta:page-count": "1",
"dc:publisher": "JDI"
}
}
```
|
@spinscale Could you review it please? |
spinscale
left a comment
There was a problem hiding this comment.
left a few comments, not sure we need that extra nesting into raw_metadata - which is not a very descriptive field
| { | ||
| "attachment" : { | ||
| "field" : "data", | ||
| "properties": [ "raw_metadata" ] |
There was a problem hiding this comment.
from a user perspective: why is this called raw_metadata - doesnt this simply mean all?
There was a problem hiding this comment.
Well. It's coming from https://github.com/dadoonet/fscrawler#disabling-raw-metadata where I'm actually putting all that stuff under a meta.raw field.
There was a problem hiding this comment.
Using "raw" because it's unfiltered and not modified.
| "_source": { | ||
| "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", | ||
| "attachment": { | ||
| "raw_metadata": { |
There was a problem hiding this comment.
same here from a user perspective: why embed it into raw_metadata? couldnt that be part of the upper level attachment data structure?
There was a problem hiding this comment.
Yes we can change that but I think it can be easier for users to separate that. So they can more easily ignore the whole content of this inner object at index time with enabled: false in the mapping.
| for (Map.Entry<String, Object> entry : rawMetadata.entrySet()) { | ||
| logger.info("assertThat(rawMetadata.get(\"{}\"), is(\"{}\"));", entry.getKey(), entry.getValue()); | ||
| }*/ | ||
| assertThat(rawMetadata.get("date"), is("2015-02-20T11:36:00Z")); |
There was a problem hiding this comment.
assertThat(rawMetadata, hasEntry("date", "2015...");
| return parseBase64Document(getAsBase64(file), processor); | ||
| } | ||
|
|
||
| // Adding this method to more easily write the asciidoc documentation |
There was a problem hiding this comment.
I dont understand this comment
There was a problem hiding this comment.
I just let it here as a comment. It helps developers to easily add a new type of file in the test suite, collects all assertions in the logs, then copy and paste the log in the test case itself.
| By default, the `ingest-attachment` plugin only extracts a subset of the most common metadata. | ||
|
|
||
| If you want to get back all the raw metadata, you can set `properties` to `raw_metadata`. | ||
| It will populate a subfield `raw_metadata` with all key/value pairs found as metadata in the document. |
There was a problem hiding this comment.
should we mention that this can be a lot of fields and it might make more sense to select those one wants? Looks pretty verbose to me
There was a problem hiding this comment.
About the number of fields generated, I agree. That's why I never added this feature to the mapper attachments plugin.
Here, we know that ingest can help to filter out some fields later on.
We can indeed add another property and instead of modifying properties, just have a raw_properties list which is by default [ "_none_" ], but can be [ "_all_" ] or a list of specific properties to be included.
So users would write:
{
"attachment" : {
"field" : "data",
"raw_properties": [ "_all_" ]
}
}WDYT?
|
@spinscale Ping? :) |
|
After reading this a few more times I am not too happy with the way the configuration works. Reasons below
The good thing is, we have the rename processor to easily rename all the fields, but I feel we should rethink the configuration of this processor to be more consistent - the single field vs. field thing is just too confusing to me. Also just using the How about a configuration like The user does not care if we extracted a field from the raw data it is the content-length from another field, so I think we should hide that. |
|
@spinscale I really like your proposal of specifying properties we want to extract. In such a case, setting Really smart. I'm going to update my PR. Thanks! |
|
@spinscale So I updated the PR based on your feedback. People can now do: Which will extract all "raw" metadata plus the content itself. Note that I deprecated old field names to avoid any conflict with raw metadata names. So we are now using It's deprecated so we can still read the "old" property names like I wonder if we should have also a special "hack" where instead of: WDYT? |
|
@spinscale ping? |
|
I still think that |
I can do it. It's actually a decision between flexibility vs complexity. So I'm going to implement what you said. Thanks for the feedback! |
|
@spinscale I pushed a new change. LMK. Thanks! |
| throw new IllegalArgumentException(value + " is not one of the known keys"); | ||
| } | ||
|
|
||
| public static ReservedProperty findDeprecatedProperty(String value) { |
| this.key = key; | ||
| } | ||
|
|
||
| public static ReservedProperty parse(String value) { |
| return properties; | ||
| } | ||
|
|
||
| public Set<ReservedProperty> getReservedProperties() { |
|
@spinscale a friendly reminder here. :) |
|
@dadoonet - apologies for such a long PR processes. If you are able to fix the merge conflicts we will pick this back up and work towards getting this merged in. |
# Conflicts: # plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java
|
@jakelandis I did merge the master branch into my branch. |
|
@dadoonet - we have had alot of instability in the builds recently. Things are getting better, can you merge master in again ? |
|
@jakelandis I just merged latest master into my branch. I can see that some errors might not be related to my PR. |
|
Ping? Someone has some spare time to review it? |
|
@dadoonet Thx for this PR, can you merge master in again? |
|
Hi @dadoonet. Sorry for the really long delay on this one. We'd like to push it through. If you're still interested, would you re-merge master and fix the conflicts? Thanks. |
|
I created today #78754 which extracts more standard metadata than before. I'm wondering actually if there is a use case for the current PR. Do people need to extract the "raw" metadata and do specific post-treatment on them? Just asking because if we don't want it anymore (because of #78754), it is useless for me to update this PR again. @masseyke WDYT? |
|
Thanks @dadoonet. That makes sense to me. Let me discuss it with the team to make sure there are no objections. I'm not sure if we would still need this one or not. |
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339.
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <keith.massey@elastic.co>
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339. Co-authored-by: Keith Massey <keith.massey@elastic.co>
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <keith.massey@elastic.co> Co-authored-by: David Pilato <david@pilato.fr>
As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset.
That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before.
This PR adds a new property
raw_metadatawhich is not set by default. That means that nothing change for users unless they explicitly ask for"properties": [ "raw_metadata" ].For example:
gives back:
{ "found": true, "_index": "my_index", "_type": "my_type", "_id": "my_id", "_version": 1, "_source": { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", "attachment": { "raw_metadata": { "X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser", "Content-Type": "application/rtf" } } } }Of course, much more metadata can be extracted. For example, this is what a
docxWord document can generate: