Support extraction of all metadata by dadoonet · Pull Request #22339 · elastic/elasticsearch

dadoonet · 2016-12-23T14:55:40Z

As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset.
That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before.

This PR adds a new property raw_metadata which is not set by default. That means that nothing change for users unless they explicitly ask for "properties": [ "raw_metadata" ].

For example:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract all metadata",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "raw_metadata" ]
      }
    }
  ]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/my_type/my_id

gives back:

{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "raw_metadata": {
        "X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser",
        "Content-Type": "application/rtf"
      }
    }
  }
}

Of course, much more metadata can be extracted. For example, this is what a docx Word document can generate:

"attachment": {
  "raw_metadata": {
    "date": "2015-02-20T11:36:00Z",
    "cp:revision": "22",
    "Total-Time": "6",
    "extended-properties:AppVersion": "15.0000",
    "meta:paragraph-count": "1",
    "meta:word-count": "15",
    "dc:creator": "Windows User",
    "extended-properties:Company": "JDI",
    "Word-Count": "15",
    "dcterms:created": "2012-10-12T11:17:00Z",
    "meta:line-count": "1",
    "Last-Modified": "2015-02-20T11:36:00Z",
    "dcterms:modified": "2015-02-20T11:36:00Z",
    "Last-Save-Date": "2015-02-20T11:36:00Z",
    "meta:character-count": "92",
    "Template": "Normal.dotm",
    "Line-Count": "1",
    "Paragraph-Count": "1",
    "meta:save-date": "2015-02-20T11:36:00Z",
    "meta:character-count-with-spaces": "106",
    "Application-Name": "Microsoft Office Word",
    "extended-properties:TotalTime": "6",
    "modified": "2015-02-20T11:36:00Z",
    "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "X-Parsed-By": "org.apache.tika.parser.microsoft.ooxml.OOXMLParser",
    "creator": "Windows User",
    "meta:author": "Windows User",
    "meta:creation-date": "2012-10-12T11:17:00Z",
    "extended-properties:Application": "Microsoft Office Word",
    "meta:last-author": "Luka Lampret",
    "Creation-Date": "2012-10-12T11:17:00Z",
    "xmpTPg:NPages": "1",
    "Character-Count-With-Spaces": "106",
    "Last-Author": "Luka Lampret",
    "Character Count": "92",
    "Page-Count": "1",
    "Revision-Number": "22",
    "Application-Version": "15.0000",
    "extended-properties:Template": "Normal.dotm",
    "Author": "Windows User",
    "publisher": "JDI",
    "meta:page-count": "1",
    "dc:publisher": "JDI"
  }
}

As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset. That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before. This PR adds a new property `raw_metadata` which is not set by default. That means that nothing change for users unless they explicitly ask for `"properties": [ "raw_metadata" ]`. For example: ``` PUT _ingest/pipeline/attachment { "description" : "Extract all metadata", "processors" : [ { "attachment" : { "field" : "data", "properties": [ "raw_metadata" ] } } ] } PUT my_index/my_type/my_id?pipeline=attachment { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" } GET my_index/my_type/my_id ``` gives back: ```json { "found": true, "_index": "my_index", "_type": "my_type", "_id": "my_id", "_version": 1, "_source": { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", "attachment": { "raw_metadata": { "X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser", "Content-Type": "application/rtf" } } } } ``` Of course, much more metadata can be extracted. For example, this is what a `docx` Word document can generate: ``` "attachment": { "raw_metadata": { "date": "2015-02-20T11:36:00Z", "cp:revision": "22", "Total-Time": "6", "extended-properties:AppVersion": "15.0000", "meta:paragraph-count": "1", "meta:word-count": "15", "dc:creator": "Windows User", "extended-properties:Company": "JDI", "Word-Count": "15", "dcterms:created": "2012-10-12T11:17:00Z", "meta:line-count": "1", "Last-Modified": "2015-02-20T11:36:00Z", "dcterms:modified": "2015-02-20T11:36:00Z", "Last-Save-Date": "2015-02-20T11:36:00Z", "meta:character-count": "92", "Template": "Normal.dotm", "Line-Count": "1", "Paragraph-Count": "1", "meta:save-date": "2015-02-20T11:36:00Z", "meta:character-count-with-spaces": "106", "Application-Name": "Microsoft Office Word", "extended-properties:TotalTime": "6", "modified": "2015-02-20T11:36:00Z", "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "X-Parsed-By": "org.apache.tika.parser.microsoft.ooxml.OOXMLParser", "creator": "Windows User", "meta:author": "Windows User", "meta:creation-date": "2012-10-12T11:17:00Z", "extended-properties:Application": "Microsoft Office Word", "meta:last-author": "Luka Lampret", "Creation-Date": "2012-10-12T11:17:00Z", "xmpTPg:NPages": "1", "Character-Count-With-Spaces": "106", "Last-Author": "Luka Lampret", "Character Count": "92", "Page-Count": "1", "Revision-Number": "22", "Application-Version": "15.0000", "extended-properties:Template": "Normal.dotm", "Author": "Windows User", "publisher": "JDI", "meta:page-count": "1", "dc:publisher": "JDI" } } ```

dadoonet · 2016-12-23T14:56:58Z

@spinscale Could you review it please?

spinscale

left a few comments, not sure we need that extra nesting into raw_metadata - which is not a very descriptive field

spinscale · 2016-12-23T21:51:01Z

docs/plugins/ingest-attachment.asciidoc

+    {
+      "attachment" : {
+        "field" : "data",
+        "properties": [ "raw_metadata" ]


from a user perspective: why is this called raw_metadata - doesnt this simply mean all?

Well. It's coming from https://github.com/dadoonet/fscrawler#disabling-raw-metadata where I'm actually putting all that stuff under a meta.raw field.

Using "raw" because it's unfiltered and not modified.

spinscale · 2016-12-23T21:51:49Z

docs/plugins/ingest-attachment.asciidoc

+  "_source": {
+    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
+    "attachment": {
+      "raw_metadata": {


same here from a user perspective: why embed it into raw_metadata? couldnt that be part of the upper level attachment data structure?

Yes we can change that but I think it can be easier for users to separate that. So they can more easily ignore the whole content of this inner object at index time with enabled: false in the mapping.

spinscale · 2016-12-23T22:16:58Z

...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java

+        for (Map.Entry<String, Object> entry : rawMetadata.entrySet()) {
+            logger.info("assertThat(rawMetadata.get(\"{}\"), is(\"{}\"));", entry.getKey(), entry.getValue());
+        }*/
+        assertThat(rawMetadata.get("date"), is("2015-02-20T11:36:00Z"));


assertThat(rawMetadata, hasEntry("date", "2015...");

++. Thanks!

spinscale · 2016-12-23T22:17:41Z

...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java

+        return parseBase64Document(getAsBase64(file), processor);
+    }
+
+    // Adding this method to more easily write the asciidoc documentation


I dont understand this comment

I just let it here as a comment. It helps developers to easily add a new type of file in the test suite, collects all assertions in the logs, then copy and paste the log in the test case itself.

spinscale · 2016-12-23T22:18:24Z

docs/plugins/ingest-attachment.asciidoc

+By default, the `ingest-attachment` plugin only extracts a subset of the most common metadata.
+
+If you want to get back all the raw metadata, you can set `properties` to `raw_metadata`.
+It will populate a subfield `raw_metadata` with all key/value pairs found as metadata in the document.


should we mention that this can be a lot of fields and it might make more sense to select those one wants? Looks pretty verbose to me

About the number of fields generated, I agree. That's why I never added this feature to the mapper attachments plugin.
Here, we know that ingest can help to filter out some fields later on.

We can indeed add another property and instead of modifying properties, just have a raw_properties list which is by default [ "_none_" ], but can be [ "_all_" ] or a list of specific properties to be included.

So users would write:

{ "attachment" : { "field" : "data", "raw_properties": [ "_all_" ] } }

WDYT?

@spinscale ping?

dadoonet · 2017-01-20T09:47:42Z

@spinscale Ping? :)

spinscale · 2017-01-23T09:38:11Z

After reading this a few more times I am not too happy with the way the configuration works. Reasons below

There is one configuration property that enables dozens of other ones. This is fundamentally different to the other already existing configuration params
There are overlaps with existing properties and the raw field names, i.e. Content-Type, Date - who wins here?
There are inconsistent field names, i.e meta: (all lowercase) or Last-Save-Date snake case or just publisher vs. Author - which is just part of the territory of fiel formats I'd say (we can clean this up with a rename processor though)
There are fields which have been enriched from the processor and are not coming from the document, it seems? The X-Parsed-By header seems to such a candidate. Also confusing but still manageable

The good thing is, we have the rename processor to easily rename all the fields, but I feel we should rethink the configuration of this processor to be more consistent - the single field vs. field thing is just too confusing to me. properties is already an array, but it could be easily configured to be an array of regular expressions (as this is part of the configuration it can be precompiled or we use Regex.simpleMatchToAutomaton(String ... regexes)).

Also just using the raw configuration to extract one or two more fields and then removing all the others (of which you dont know all the names because of all the different ways to write the field names) sounds not like a good idea to me.

How about a configuration like

{
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title", "pdf:*", "*-Count"  ]
      }
    }

The user does not care if we extracted a field from the raw data it is the content-length from another field, so I think we should hide that.

dadoonet · 2017-01-31T08:19:43Z

@spinscale I really like your proposal of specifying properties we want to extract. In such a case, setting "properties": [ "*" ] would extract everything.

Really smart. I'm going to update my PR. Thanks!

dadoonet · 2017-01-31T11:39:34Z

@spinscale So I updated the PR based on your feedback.

People can now do:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract all metadata",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "*", "_content_" ]
      }
    }
  ]
}

Which will extract all "raw" metadata plus the content itself.

Note that I deprecated old field names to avoid any conflict with raw metadata names. So we are now using _content_, _title_, _author_, _keywords_, _date_, _content_type_, _content_length_, _language_ as "specific" properties.

It's deprecated so we can still read the "old" property names like content, "title"...

I wonder if we should have also a special "hack" where _*_ would mean as well any of the fixed ones.
That way people could write:

        "properties": [ "pdf:*", "_*_" ]

instead of:

        "properties": [ "pdf:*", "_content_", "_title_", "_author_", "_keywords_", "_date_", "_content_type_", "_content_length_", "_language_" ]

WDYT?

dadoonet · 2017-02-20T22:39:13Z

@spinscale ping?

spinscale · 2017-02-21T08:02:39Z

I still think that "properties": [ "*", "_content_" ] is confusing. Why do I need to match everything, and then some more to really get everything? Should * not match everything plus our own static fields?

dadoonet · 2017-02-21T08:42:40Z

Should * not match everything plus our own static fields?

I can do it. It's actually a decision between flexibility vs complexity.
That said, in the context of ingest, as we don't add so many static fields, it can be easy to manually remove them if they are not needed.

So I'm going to implement what you said.

Thanks for the feedback!

dadoonet · 2017-02-24T18:01:28Z

@spinscale I pushed a new change. LMK. Thanks!

spinscale · 2017-02-27T13:56:16Z

...ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

+            throw new IllegalArgumentException(value + " is not one of the known keys");
+        }
+
+        public static ReservedProperty findDeprecatedProperty(String value) {


spinscale · 2017-02-27T13:56:28Z

...ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

+            this.key = key;
+        }
+
+        public static ReservedProperty parse(String value) {


spinscale · 2017-02-27T13:58:46Z

...ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

        return properties;
    }

+    public Set<ReservedProperty> getReservedProperties() {


no need to be public?

dadoonet · 2018-04-30T07:09:16Z

@spinscale a friendly reminder here. :)

jakelandis · 2019-01-14T16:53:24Z

@dadoonet - apologies for such a long PR processes. If you are able to fix the merge conflicts we will pick this back up and work towards getting this merged in.

# Conflicts: # plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

dadoonet · 2019-02-05T14:13:54Z

@jakelandis I did merge the master branch into my branch.
Not sure why the build is failing though.

jakelandis · 2019-02-08T22:59:53Z

@dadoonet - we have had alot of instability in the builds recently. Things are getting better, can you merge master in again ?

dadoonet · 2019-02-11T18:43:40Z

@jakelandis I just merged latest master into my branch. I can see that some errors might not be related to my PR.

dadoonet · 2019-03-12T14:00:39Z

Ping? Someone has some spare time to review it?

theroch · 2021-02-26T12:30:14Z

@dadoonet Thx for this PR, can you merge master in again?
I'm really interested in this PR

masseyke · 2021-09-29T18:06:07Z

Hi @dadoonet. Sorry for the really long delay on this one. We'd like to push it through. If you're still interested, would you re-merge master and fix the conflicts? Thanks.

dadoonet · 2021-10-06T12:47:17Z

I created today #78754 which extracts more standard metadata than before.

I'm wondering actually if there is a use case for the current PR. Do people need to extract the "raw" metadata and do specific post-treatment on them?

Just asking because if we don't want it anymore (because of #78754), it is useless for me to update this PR again.

@masseyke WDYT?

masseyke · 2021-10-07T18:25:03Z

Thanks @dadoonet. That makes sense to me. Let me discuss it with the team to make sure there are no objections. I'm not sure if we would still need this one or not.

Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339.

Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <keith.massey@elastic.co>

Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339. Co-authored-by: Keith Massey <keith.massey@elastic.co>

Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <keith.massey@elastic.co> Co-authored-by: David Pilato <david@pilato.fr>

masseyke · 2021-12-03T15:34:49Z

@dadoonet I don't think we ever really finished the conversation, but we can close this one now since we have #78754, right?

dadoonet · 2021-12-14T15:42:29Z

@dadoonet I don't think we ever really finished the conversation, but we can close this one now since we have #78754, right?

Yes. I think that if there is some demand, we can always revisit this later.

Let's close it.

dadoonet added :Plugin Ingest Attachment >enhancement v5.2.0 v6.0.0-alpha1 labels Dec 23, 2016

dadoonet self-assigned this Dec 23, 2016

dadoonet requested a review from spinscale December 23, 2016 14:55

spinscale requested changes Dec 23, 2016

View reviewed changes

Use metadata, hasEntry(X, Y) instead of metadata.get(X), is(Y)

877bba9

dadoonet added v5.3.0 and removed v5.2.0 labels Jan 20, 2017

dadoonet added 2 commits January 31, 2017 09:20

Merge branch 'master' into pr/attachment-add-metadata

9850428

Use a list of wanted extracted fields instead of a global raw_metadata

c268247

clintongormley added v5.4.0 and removed v5.3.0 labels Feb 7, 2017

dadoonet added 3 commits February 21, 2017 09:44

Merge branch 'master' into pr/attachment-add-metadata

1166475

Merge branch 'master' into pr/attachment-add-metadata

47c3e84

Using wildcards should apply to both reserved fields and metadata fields

1617f33

spinscale reviewed Feb 27, 2017

View reviewed changes

rjernst removed the review label Oct 10, 2018

jakelandis added the team-discuss label Dec 20, 2018

jakelandis removed the team-discuss label Jan 14, 2019

dadoonet added 2 commits February 5, 2019 14:29

Merge branch 'master' into pr/attachment-add-metadata

250128f

# Conflicts: # plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

Fix imports

dfa100c

dadoonet removed the v7.0.0 label Feb 5, 2019

Merge branch 'master' into pr/attachment-add-metadata

645b90f

rjernst added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label May 4, 2020

theroch mentioned this pull request Feb 25, 2021

attachment.date uses content created date of docx files nextcloud/fulltextsearch#612

Open

dadoonet mentioned this pull request Oct 6, 2021

Extract more standard metadata from binary files #78754

Merged

dakrone requested review from masseyke and removed request for masseyke October 14, 2021 15:24

masseyke mentioned this pull request Nov 29, 2021

Extract more standard metadata from binary files (#78754) #81106

Merged

dadoonet closed this Dec 14, 2021

Conversation

dadoonet commented Dec 23, 2016

Uh oh!

dadoonet commented Dec 23, 2016

Uh oh!

spinscale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dadoonet commented Jan 20, 2017

Uh oh!

spinscale commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dadoonet commented Jan 31, 2017

Uh oh!

dadoonet commented Jan 31, 2017

Uh oh!

dadoonet commented Feb 20, 2017

Uh oh!

spinscale commented Feb 21, 2017

Uh oh!

dadoonet commented Feb 21, 2017

Uh oh!

dadoonet commented Feb 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dadoonet commented Apr 30, 2018

Uh oh!

jakelandis commented Jan 14, 2019

Uh oh!

dadoonet commented Feb 5, 2019

Uh oh!

jakelandis commented Feb 8, 2019

Uh oh!

dadoonet commented Feb 11, 2019

Uh oh!

dadoonet commented Mar 12, 2019

Uh oh!

theroch commented Feb 26, 2021

Uh oh!

masseyke commented Sep 29, 2021

Uh oh!

dadoonet commented Oct 6, 2021

Uh oh!

masseyke commented Oct 7, 2021

Uh oh!

masseyke commented Dec 3, 2021

Uh oh!

dadoonet commented Dec 14, 2021

Uh oh!

Reviewers

spinscale commented Jan 23, 2017 •

edited

Loading