Skip to content

Conversation

@mh-northlander
Copy link
Contributor

@mh-northlander mh-northlander commented Jun 24, 2024

Current tokenization methods perform analysis on whole input texts and may cause OOM with long input.
This PR adds methods for a lazy analysis.

WIP:

  • add test
  • IOTools.readAsMuchAsCan may separate surrogate pair.

@mh-northlander
Copy link
Contributor Author

sonarcloud fails as SentenceSplittingLazyAnalysis has duplicated code to SentenceSplittingAnalysis, but I want to keep this as they are. I think we eventually replace SentenceSplittingAnalysis by SentenceSplittingLazyAnalysis.

Maybe we can replace it now, although it changes the way how the IOException is thrown.

@mh-northlander mh-northlander changed the title WIP: Lazy sentence split and tokenization Lazy sentence split and tokenization Jun 25, 2024
@mh-northlander mh-northlander requested a review from kazuma-t June 25, 2024 02:44
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
3.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@mh-northlander mh-northlander merged commit 26e731b into develop Jun 26, 2024
@mh-northlander mh-northlander deleted the feature/lazy-tokenize-sentences branch June 26, 2024 01:27
@azagniotov
Copy link

Hello Team, @kazuma-t and @mh-northlander , I see a lot of good work here, any updates when a new version of Sudachi will be released to Maven Central? 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants