Skip to content

Review analyzers#990

Merged
kermitt2 merged 7 commits intomasterfrom
review-analyzers
Feb 28, 2023
Merged

Review analyzers#990
kermitt2 merged 7 commits intomasterfrom
review-analyzers

Conversation

@kermitt2
Copy link
Copy Markdown
Collaborator

  • make possible subtokenization separating digits and letters (based on unicode class)
  • review Korean tokenizer
  • complete retokenization for CJK
  • apply the subtokenization to the citation parser

This allows to have numerical tokens even if they are mixed with letters, for instance for Korean:

Before:
Screenshot from 2023-02-28 17-48-49

After (look at the issue field):
Screenshot from 2023-02-27 13-43-02

@kermitt2 kermitt2 merged commit b847753 into master Feb 28, 2023
@lfoppiano lfoppiano deleted the review-analyzers branch March 21, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants