Add a simple JSON analyzer that emits a token for each leaf value. by jtibshirani · Pull Request #33795 · elastic/elasticsearch

jtibshirani · 2018-09-18T04:41:36Z

I'm not too familiar with creating a new analyzer/ tokenizer, so I wanted to put something simple up for early feedback. Note that this PR is against the feature branch object-fields.

elasticmachine · 2018-09-18T04:41:37Z

Pinging @elastic/es-search-aggs

romseygeek

Looks a good start! I left some comments around the implementation of the TokenStream which can be tricky. In general, state should be set up in reset() and closed in close(), and you should avoid doing anything in the constructor because that can confuse reuse.

romseygeek · 2018-09-18T19:16:22Z

server/src/main/java/org/elasticsearch/index/analysis/JsonTokenizer.java

+    JsonTokenizer() {
+        super(ATTRIBUTE_FACTORY);
+        termAtt = addAttribute(CharTermAttribute.class);
+        jsonParser = createParser(input);


This should be done in reset()

Got it -- I now understand the contract around reset/ close.

romseygeek · 2018-09-18T19:17:00Z

server/src/main/java/org/elasticsearch/index/analysis/JsonTokenizer.java

+    @Override
+    public void reset() throws IOException {
+        super.reset();
+        jsonParser.close();


This should only be done in close()

romseygeek · 2018-09-18T19:22:29Z

server/src/main/java/org/elasticsearch/index/analysis/JsonTokenizer.java

+
+    JsonTokenizer() {
+        super(ATTRIBUTE_FACTORY);
+        termAtt = addAttribute(CharTermAttribute.class);


You can do this directly on the member variable:

private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class)

Right, I just prefer to assign instance variables in constructors when possible (I find it clearer/ more consistent).

romseygeek · 2018-09-18T19:24:49Z

server/src/test/java/org/elasticsearch/index/analysis/JsonAnalyzerTests.java

+     * as it's package private.
+     *
+     * TODO: add a version of BaseTokenStreamTestCase.assertAnalyzesTo that sets useCharFilter to false.
+     */


Can you open a lucene issue for this? Meanwhile, I think it's a good idea to copy the implementation of checkResetException() into the test code here, as it checks the contract pretty strictly.

I'll still plan to open an issue, as this seems useful in general.

romseygeek · 2018-09-18T19:25:45Z

server/src/test/java/org/elasticsearch/index/analysis/JsonAnalyzerTests.java

+        assertAnalyzesTo(
+            "{ \"key\": null }",
+            new String[] { "null" });
+    }


Are we going to be able to tell the difference between a null key, and a string with the value "null"? Do we need to?

I will think about this as I move over to the new approach -- it would be good to have better null handling.

romseygeek · 2018-09-18T19:27:07Z

server/src/main/java/org/elasticsearch/index/analysis/JsonAnalyzer.java

+    Tokenizer tokenizer = new JsonTokenizer();
+    return new TokenStreamComponents(tokenizer);
+  }
+}


jpountz · 2018-09-18T19:41:03Z

Do we actually need a tokenizer? Based on the current plan for object fields, we'd only index keywords, so we could do the json parsing on top of Lucene and add regular StringField instances to the documents to index, similarly to KeywordFieldMapper?

jtibshirani · 2018-09-18T20:12:40Z

Thanks @romseygeek for taking a look!

@jpountz I was wondering the same myself, and would be interested in both of your thoughts on this. Here were the main advantages I saw to adding a tokenizer:

If the user chooses to set store: true, then the whole JSON blob would come back, as opposed to a flattened list of individual fields. I think this is more what you’d expect as a user. The individual tokens will look even more confusing when we begin to prefix them with the JSON keys.
We’ll naturally be able to support object fields supplied as strings ({"headers": "{ \"content-type\": \"application/json\"}}. Looking through some beats examples, I saw a few examples of stringified JSON like this. This isn’t a very strong argument, since we could also support it by parsing on top of lucene.

I also wonder if having a tokenizer would make it easier to support highlighting (I really need to get more clarity on this piece still…)

jpountz · 2018-09-18T20:28:12Z

If the user chooses to set store: true, then the whole JSON blob would come back, as opposed to a flattened list of individual fields.

FWIW Lucene doesn't require that the same content is indexed and stored: you could add one Lucene StringField that is unstored for every value and one Lucene StoredField that stores the whole json document - this wouldn't be a problem.

I also wonder if having a tokenizer would make it easier to support highlighting

It might, but I'm also not clear how highlighting would work on a json document. For instance, just ensuring that inserting tags around matches doesn't break the structure of the json doc would be challenging?

jtibshirani · 2018-09-19T17:56:31Z

Thanks for the tip around separate stored + indexed content, I wasn't aware of that.

I took some time to try out the alternate approach (parsing the JSON outside of lucene). I thought it worked well: it was nice to avoid parsing the JSON twice, and to gain more control over validation. I also got a better handle on the requirements for highlighting, and while it would be possible with a 'JSON analyzer' that produces (non-indexed) offsets, the code didn’t turn out as cleanly as I hoped. I’m also not sure that highlighting this field is actually a high priority for users, and would need more feedback before we commit to adding it. I have a follow-up question around your highlighting comment, but will move it over to the meta-issue so we can keep the design discussion in one place.

For these reasons, I’m planning to close this PR and go with the alternate approach for now. I'm happy we did this code review anyways -- I'm sure one day I will have to work with a tokenizer :)

jtibshirani added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Sep 18, 2018

jtibshirani changed the title ~~Add a simple JSON analyzer emits a token for each leaf value.~~ Add a simple JSON analyzer that emits a token for each leaf value. Sep 18, 2018

jtibshirani force-pushed the json-analyzer branch from cb27622 to 8adbd15 Compare September 18, 2018 04:43

jtibshirani requested a review from romseygeek September 18, 2018 17:11

jtibshirani mentioned this pull request Sep 18, 2018

Flattened object fields design + implementation #33003

Closed

11 tasks

jtibshirani force-pushed the json-analyzer branch from 8adbd15 to 2d2e6ec Compare September 18, 2018 18:07

Add a simple JSON analyzer that emits a token for each leaf value.

ab7bc9f

jtibshirani force-pushed the json-analyzer branch from 2d2e6ec to ab7bc9f Compare September 18, 2018 18:10

romseygeek requested changes Sep 18, 2018

View reviewed changes

jtibshirani closed this Sep 19, 2018

jtibshirani deleted the json-analyzer branch September 19, 2018 22:40

jtibshirani mentioned this pull request Sep 20, 2018

Add a simple JSON field mapper. #33923

Merged

Conversation

jtibshirani commented Sep 18, 2018

Uh oh!

elasticmachine commented Sep 18, 2018

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Sep 18, 2018

Uh oh!

jtibshirani commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Sep 18, 2018

Uh oh!

jtibshirani commented Sep 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jtibshirani commented Sep 18, 2018 •

edited

Loading