Skip to content

Add a simple JSON analyzer that emits a token for each leaf value.#33795

Closed
jtibshirani wants to merge 1 commit intoelastic:object-fieldsfrom
jtibshirani:json-analyzer
Closed

Add a simple JSON analyzer that emits a token for each leaf value.#33795
jtibshirani wants to merge 1 commit intoelastic:object-fieldsfrom
jtibshirani:json-analyzer

Conversation

@jtibshirani
Copy link
Copy Markdown
Contributor

I'm not too familiar with creating a new analyzer/ tokenizer, so I wanted to put something simple up for early feedback. Note that this PR is against the feature branch object-fields.

@jtibshirani jtibshirani added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Sep 18, 2018
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-aggs

@jtibshirani jtibshirani changed the title Add a simple JSON analyzer emits a token for each leaf value. Add a simple JSON analyzer that emits a token for each leaf value. Sep 18, 2018
Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a good start! I left some comments around the implementation of the TokenStream which can be tricky. In general, state should be set up in reset() and closed in close(), and you should avoid doing anything in the constructor because that can confuse reuse.

JsonTokenizer() {
super(ATTRIBUTE_FACTORY);
termAtt = addAttribute(CharTermAttribute.class);
jsonParser = createParser(input);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in reset()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- I now understand the contract around reset/ close.

@Override
public void reset() throws IOException {
super.reset();
jsonParser.close();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be done in close()


JsonTokenizer() {
super(ATTRIBUTE_FACTORY);
termAtt = addAttribute(CharTermAttribute.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this directly on the member variable:

private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just prefer to assign instance variables in constructors when possible (I find it clearer/ more consistent).

* as it's package private.
*
* TODO: add a version of BaseTokenStreamTestCase.assertAnalyzesTo that sets useCharFilter to false.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a lucene issue for this? Meanwhile, I think it's a good idea to copy the implementation of checkResetException() into the test code here, as it checks the contract pretty strictly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll still plan to open an issue, as this seems useful in general.

assertAnalyzesTo(
"{ \"key\": null }",
new String[] { "null" });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to be able to tell the difference between a null key, and a string with the value "null"? Do we need to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about this as I move over to the new approach -- it would be good to have better null handling.

Tokenizer tokenizer = new JsonTokenizer();
return new TokenStreamComponents(tokenizer);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Sep 18, 2018

Do we actually need a tokenizer? Based on the current plan for object fields, we'd only index keywords, so we could do the json parsing on top of Lucene and add regular StringField instances to the documents to index, similarly to KeywordFieldMapper?

@jtibshirani
Copy link
Copy Markdown
Contributor Author

jtibshirani commented Sep 18, 2018

Thanks @romseygeek for taking a look!

@jpountz I was wondering the same myself, and would be interested in both of your thoughts on this. Here were the main advantages I saw to adding a tokenizer:

  • If the user chooses to set store: true, then the whole JSON blob would come back, as opposed to a flattened list of individual fields. I think this is more what you’d expect as a user. The individual tokens will look even more confusing when we begin to prefix them with the JSON keys.
  • We’ll naturally be able to support object fields supplied as strings ({"headers": "{ \"content-type\": \"application/json\"}}. Looking through some beats examples, I saw a few examples of stringified JSON like this. This isn’t a very strong argument, since we could also support it by parsing on top of lucene.

I also wonder if having a tokenizer would make it easier to support highlighting (I really need to get more clarity on this piece still…)

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Sep 18, 2018

If the user chooses to set store: true, then the whole JSON blob would come back, as opposed to a flattened list of individual fields.

FWIW Lucene doesn't require that the same content is indexed and stored: you could add one Lucene StringField that is unstored for every value and one Lucene StoredField that stores the whole json document - this wouldn't be a problem.

I also wonder if having a tokenizer would make it easier to support highlighting

It might, but I'm also not clear how highlighting would work on a json document. For instance, just ensuring that inserting tags around matches doesn't break the structure of the json doc would be challenging?

@jtibshirani
Copy link
Copy Markdown
Contributor Author

Thanks for the tip around separate stored + indexed content, I wasn't aware of that.

I took some time to try out the alternate approach (parsing the JSON outside of lucene). I thought it worked well: it was nice to avoid parsing the JSON twice, and to gain more control over validation. I also got a better handle on the requirements for highlighting, and while it would be possible with a 'JSON analyzer' that produces (non-indexed) offsets, the code didn’t turn out as cleanly as I hoped. I’m also not sure that highlighting this field is actually a high priority for users, and would need more feedback before we commit to adding it. I have a follow-up question around your highlighting comment, but will move it over to the meta-issue so we can keep the design discussion in one place.

For these reasons, I’m planning to close this PR and go with the alternate approach for now. I'm happy we did this code review anyways -- I'm sure one day I will have to work with a tokenizer :)

@jtibshirani jtibshirani deleted the json-analyzer branch September 19, 2018 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Search Foundations/Mapping Index mappings, including merging and defining field types

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants