Skip to content

Allow collecting tokens with no dictionary entry#929

Merged
killergerbah merged 1 commit intomainfrom
collect-ungrouped-segments
Mar 8, 2026
Merged

Allow collecting tokens with no dictionary entry#929
killergerbah merged 1 commit intomainfrom
collect-ungrouped-segments

Conversation

@ShanaryS
Copy link
Copy Markdown
Collaborator

@ShanaryS ShanaryS commented Mar 8, 2026

fixes #927

The Yomitan parser simply tries to find the longest dictionary entry possible left to right in order to tokenize. It's possible for some tokens to be non-dictionary entries (until the next dictionary entry-able set of characters appear) which of course won't have any lemmas when we look for them. These entries should just be created with their own token as a lemma to allow the user to collect them.

I modified the strategy logic to fall back to exact if the lemma is missing. We cannot just always add the token as the lemma as we want the failure in all other use cases of lemmatize().

@ShanaryS ShanaryS self-assigned this Mar 8, 2026
@ShanaryS ShanaryS added the bug Something isn't working label Mar 8, 2026
@ShanaryS ShanaryS force-pushed the collect-ungrouped-segments branch from fd01470 to 9103dcf Compare March 8, 2026 04:58
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 8, 2026

Deploying asbplayer with  Cloudflare Pages  Cloudflare Pages

Latest commit: fb2e7ed
Status: ✅  Deploy successful!
Preview URL: https://7776cebf.asbplayer.pages.dev
Branch Preview URL: https://collect-ungrouped-segments.asbplayer.pages.dev

View logs

@ShanaryS ShanaryS force-pushed the collect-ungrouped-segments branch from 9103dcf to 04955da Compare March 8, 2026 06:00
@ShanaryS ShanaryS force-pushed the collect-ungrouped-segments branch from 04955da to fb2e7ed Compare March 8, 2026 06:19
@killergerbah
Copy link
Copy Markdown
Owner

I just tested this with the Japanese YT video mentioned in the original issue and the ー character in ずーっと is not being rendered on this version of the code

image

@killergerbah
Copy link
Copy Markdown
Owner

Ah never mind I'm idiot

@killergerbah killergerbah merged commit e357624 into main Mar 8, 2026
2 checks passed
@killergerbah killergerbah deleted the collect-ungrouped-segments branch March 8, 2026 06:45
@killergerbah killergerbah added this to the Extension v1.15.0 milestone Mar 8, 2026
@NovaKing001
Copy link
Copy Markdown

Yeah that was a miscommunication on my part I made those subtitles using whisper.cpp lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Korean annotations

3 participants