Lazy sentence split and tokenization #231

mh-northlander · 2024-06-24T05:25:06Z

Current tokenization methods perform analysis on whole input texts and may cause OOM with long input.
This PR adds methods for a lazy analysis.

WIP:

add test
IOTools.readAsMuchAsCan may separate surrogate pair.

mh-northlander · 2024-06-25T02:43:09Z

sonarcloud fails as SentenceSplittingLazyAnalysis has duplicated code to SentenceSplittingAnalysis, but I want to keep this as they are. I think we eventually replace SentenceSplittingAnalysis by SentenceSplittingLazyAnalysis.

Maybe we can replace it now, although it changes the way how the IOException is thrown.

sonarqubecloud · 2024-06-25T07:57:48Z

Quality Gate failed

Failed conditions
3.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

azagniotov · 2024-06-28T00:26:05Z

Hello Team, @kazuma-t and @mh-northlander , I see a lot of good work here, any updates when a new version of Sudachi will be released to Maven Central? 😃

mh-northlander added 7 commits June 24, 2024 14:09

add a method for lazy split and tokenization

ac99be6

fix javadoc link

2ceca78

add tests for a lazy analysis

a606cb7

Use proper exception class

0b8f660

Add tests for error cases and fix code smells

dfb9e1b

add some more tests

403e7bc

introduce surrogate-aware readable wrapper

323e98a

mh-northlander changed the title ~~WIP: Lazy sentence split and tokenization~~ Lazy sentence split and tokenization Jun 25, 2024

mh-northlander requested a review from kazuma-t June 25, 2024 02:44

rename method

d8e4d16

kazuma-t approved these changes Jun 25, 2024

View reviewed changes

mh-northlander merged commit 26e731b into develop Jun 26, 2024

mh-northlander deleted the feature/lazy-tokenize-sentences branch June 26, 2024 01:27

azagniotov mentioned this pull request Jun 28, 2024

tokenizeSentences() consume huge memory for text is huge #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Lazy sentence split and tokenization #231

Lazy sentence split and tokenization #231

Uh oh!

mh-northlander commented Jun 24, 2024 •

edited

Loading

Uh oh!

mh-northlander commented Jun 25, 2024

Uh oh!

sonarqubecloud bot commented Jun 25, 2024

Uh oh!

azagniotov commented Jun 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Lazy sentence split and tokenization #231

Lazy sentence split and tokenization #231

Uh oh!

Conversation

mh-northlander commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mh-northlander commented Jun 25, 2024

Uh oh!

sonarqubecloud bot commented Jun 25, 2024

Quality Gate failed

Uh oh!

azagniotov commented Jun 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mh-northlander commented Jun 24, 2024 •

edited

Loading