Conversation
| if a: | ||
| for i, t, v in self.lang.get_tokens_unprocessed(a): | ||
| yield index + i, t, v | ||
| index += len(a) |
There was a problem hiding this comment.
This line was a bug, it should have been outside the for loop
| while text: | ||
| a, sep1, text = text.partition(self.left) | ||
| if a: | ||
| for i, t, v in self.lang.get_tokens_unprocessed(a): |
There was a problem hiding this comment.
This wasn't safe - the nested tokenizer needs to see all the text at once, not just be served chunks of it at a time
b4d1806 to
8f01f2d
Compare
Anteru
left a comment
There was a problem hiding this comment.
Thanks! We have some merge conflicts due to the latest yield from improvements -- could you please take a look and update this as needed?
|
Conflicts resolved, the changes in #1537 were irrelevant |
|
I did look at this and it all seems fine, but I don't quite understand the original code. Please bear with me -- I just need to convince myself it works (I'm fairly certain it does no harm :) ). |
|
I wouldn't spend too much time understanding the old code - it had both indentation mistakes and a flawed design... |
|
Fair enough -- it does have tests now so I doubt it can be any worse than the original one; sorry for taking that long to merge. |
Closes gh-1516