Add a simple JSON analyzer that emits a token for each leaf value.#33795
Add a simple JSON analyzer that emits a token for each leaf value.#33795jtibshirani wants to merge 1 commit intoelastic:object-fieldsfrom
Conversation
|
Pinging @elastic/es-search-aggs |
cb27622 to
8adbd15
Compare
8adbd15 to
2d2e6ec
Compare
2d2e6ec to
ab7bc9f
Compare
romseygeek
left a comment
There was a problem hiding this comment.
Looks a good start! I left some comments around the implementation of the TokenStream which can be tricky. In general, state should be set up in reset() and closed in close(), and you should avoid doing anything in the constructor because that can confuse reuse.
| JsonTokenizer() { | ||
| super(ATTRIBUTE_FACTORY); | ||
| termAtt = addAttribute(CharTermAttribute.class); | ||
| jsonParser = createParser(input); |
There was a problem hiding this comment.
This should be done in reset()
There was a problem hiding this comment.
Got it -- I now understand the contract around reset/ close.
| @Override | ||
| public void reset() throws IOException { | ||
| super.reset(); | ||
| jsonParser.close(); |
There was a problem hiding this comment.
This should only be done in close()
|
|
||
| JsonTokenizer() { | ||
| super(ATTRIBUTE_FACTORY); | ||
| termAtt = addAttribute(CharTermAttribute.class); |
There was a problem hiding this comment.
You can do this directly on the member variable:
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class)
There was a problem hiding this comment.
Right, I just prefer to assign instance variables in constructors when possible (I find it clearer/ more consistent).
| * as it's package private. | ||
| * | ||
| * TODO: add a version of BaseTokenStreamTestCase.assertAnalyzesTo that sets useCharFilter to false. | ||
| */ |
There was a problem hiding this comment.
Can you open a lucene issue for this? Meanwhile, I think it's a good idea to copy the implementation of checkResetException() into the test code here, as it checks the contract pretty strictly.
There was a problem hiding this comment.
I'll still plan to open an issue, as this seems useful in general.
| assertAnalyzesTo( | ||
| "{ \"key\": null }", | ||
| new String[] { "null" }); | ||
| } |
There was a problem hiding this comment.
Are we going to be able to tell the difference between a null key, and a string with the value "null"? Do we need to?
There was a problem hiding this comment.
I will think about this as I move over to the new approach -- it would be good to have better null handling.
| Tokenizer tokenizer = new JsonTokenizer(); | ||
| return new TokenStreamComponents(tokenizer); | ||
| } | ||
| } |
|
Do we actually need a tokenizer? Based on the current plan for object fields, we'd only index keywords, so we could do the json parsing on top of Lucene and add regular StringField instances to the documents to index, similarly to KeywordFieldMapper? |
|
Thanks @romseygeek for taking a look! @jpountz I was wondering the same myself, and would be interested in both of your thoughts on this. Here were the main advantages I saw to adding a tokenizer:
I also wonder if having a tokenizer would make it easier to support highlighting (I really need to get more clarity on this piece still…) |
FWIW Lucene doesn't require that the same content is indexed and stored: you could add one Lucene StringField that is unstored for every value and one Lucene StoredField that stores the whole json document - this wouldn't be a problem.
It might, but I'm also not clear how highlighting would work on a json document. For instance, just ensuring that inserting tags around matches doesn't break the structure of the json doc would be challenging? |
|
Thanks for the tip around separate stored + indexed content, I wasn't aware of that. I took some time to try out the alternate approach (parsing the JSON outside of lucene). I thought it worked well: it was nice to avoid parsing the JSON twice, and to gain more control over validation. I also got a better handle on the requirements for highlighting, and while it would be possible with a 'JSON analyzer' that produces (non-indexed) offsets, the code didn’t turn out as cleanly as I hoped. I’m also not sure that highlighting this field is actually a high priority for users, and would need more feedback before we commit to adding it. I have a follow-up question around your highlighting comment, but will move it over to the meta-issue so we can keep the design discussion in one place. For these reasons, I’m planning to close this PR and go with the alternate approach for now. I'm happy we did this code review anyways -- I'm sure one day I will have to work with a tokenizer :) |
I'm not too familiar with creating a new analyzer/ tokenizer, so I wanted to put something simple up for early feedback. Note that this PR is against the feature branch
object-fields.