Skip to content

feat(subtitles): color subtitle words based on Anki intervals#813

Merged
killergerbah merged 16 commits intokillergerbah:mainfrom
ShanaryS:yomitan-anki
Nov 29, 2025
Merged

feat(subtitles): color subtitle words based on Anki intervals#813
killergerbah merged 16 commits intokillergerbah:mainfrom
ShanaryS:yomitan-anki

Conversation

@ShanaryS
Copy link
Copy Markdown
Collaborator

@ShanaryS ShanaryS commented Oct 29, 2025

fixes #193
fixes #789

This uses the Yomitan API discussed in the issues. It creates a new tab in the UI called Dictionary where all these settings are stored. They all can be moved to the existing sections but that would seem to be confusing.


Design:

  1. Call AnkiConnect/Yomitan in a way that avoid CORS issues for extension
  2. Tokenize text with Yomitan
  3. Treat symbols, punctuation, and numbers as known (doesn't contain a unicode letter class character)
  4. Lookup tokens in Anki using word fields
    • Use lemmatized token from Yomitan only if inflected version is uncollected
  5. Lookup tokens in Anki using sentence fields only if word fields are uncollected
    • Tokenize the sentence from Anki and match by tokens to reduce false positives
    • Use lemmatized token from Yomitan only if inflected version is uncollected
  6. Color words based on Anki stability/intervals
    • Stability will be used if the user has FSRS enabled otherwise we fallback to the intervals
    • Tokens are marked with the highest status if there are multiple conflicting cards
    • Suspended cards can always be treated normally or as a specific status
    • If only some cards are suspended for a given token, only use choose the status from the unsuspended ones
  7. Cache all of these steps at the per token and per subtitle event level
    • Poll Anki on interval and when asbplayer creates/updates a card to trigger recheck of uncollected tokens
    • Never update colors based on changed review or suspended status, unlikely and wasteful
    • Strikethrough tokens with red if Anki or Yomitan connection fails, automatically colors tokens when connection resumes
    • App and SidePanel listens for an event for when the colors are updated
  8. Subtitle events are tokenized on the fly for the showingSubtitles and a buffer of future events
    • Prioritize building for showingSubtitles by cancelling previous for responsive and relevant work (e.g. seeking)
  9. Support coloring subtitles from the App without extension being installed
    • SubtitleController: Extension | SubtitlePlayer: App
    • SubtitlePlayer always handles coloring the website or SidePanel, listens for extension requests if in use

UI/Options

  • Using naming that is conscious of future expansion
  • Ability to customize per track (similar to subtitle appearance)
  • Ability to customize the color and line thickness
  • Add options on how to treat suspended cards
  • Ability to control how inflections and lemmas are handled
  • Allow searching multiple word/sentence fields for known status
    • Read fields from Anki and present as autofill multi selection dropdown
  • Ability to change card interval mature cutoff
  • Ability to use a wide variety of coloring options
    • Ability to disable applying style if mature
  • Machine translation of options
  • Hide Dictionary tab on previous versions?

Next PR - Using IndexedDB:

We currently cannot match if the subtitle token is inflected while the card in Anki is inflected different. For example, the subtitle is standing but the user only has an Anki card with stood. By parsing the Anki fields ahead of time and storing them in IndexedDB, we can also get their lemma allow us to match using the base form stand. Using IndexedDB will also allow much faster lookups since we won't need to use Anki in realtime.

IndexedDB also allow us to manually mark words as known without adding them to Anki. Users will also be able to import known words easily. Users can use these words without Anki or use both were the manual/imported ones take priority.

The structure will likely be using 3 "tables" like: token_local token_anki_word token_anki_sentence
These each of these "tables" will have 3 "columns": lemma|status|inflections
Where inflections is a json with key-value of: inflection:status

An example entry:

Lemma Status Inflections
run 1 {"running": 2, "ran": 3}

The Lemma "column" will be indexed and will be how lookups are performed. When we tokenize a subtitle event, each token will also be lemmatized then looked up against the database. Users will be able to choose how inflections or lemmas are used for known status. I'll stop here with the details but I already have a good idea on how to structure and use the data for this task.

Future PRs:

  • Image subtitles?
  • Keybind to hide/show colors
  • Generate comprehension score based known words
  • Auto pause after uncollected word
  • Automatically mine all uncollected words with single click

@ShanaryS ShanaryS force-pushed the yomitan-anki branch 9 times, most recently from 7155a24 to ec3818c Compare October 31, 2025 17:56
@NovaKing001
Copy link
Copy Markdown

Hey, this is awesome! I’m the one who originally requested this. I’m not too familiar with coding, but I do have a few requests.

Correct me if I’m wrong, but I see that you’re coloring the words based on Anki intervals. Wouldn’t it be better if the user could upload a TSV file where each word is listed with a value assigned according to its maturity level—similar to how LUTE does it?

For example:

Term Status
別れ 4
別れる 4
利いた 1
利く 1
利用 5
1
制度 1

Also, I think it would be useful to have an “ignore” status for any words the user wishes to exclude, such as fictional words or names.

As for changing a word’s maturity level instantly, I’d like it to work similarly to LUTE
, where you can hover over a word and press a number key. For example, pressing “1” would turn an uncollected word into an unknown word.

I’d love to contribute to this request as much as possible. If you need ideas, just ask! Thank you!

@ShanaryS ShanaryS force-pushed the yomitan-anki branch 2 times, most recently from 0e61878 to 5da3281 Compare November 1, 2025 07:42
@ShanaryS
Copy link
Copy Markdown
Collaborator Author

ShanaryS commented Nov 1, 2025

Correct me if I’m wrong, but I see that you’re coloring the words based on Anki intervals. Wouldn’t it be better if the user could upload a TSV file where each word is listed with a value assigned according to its maturity level—similar to how LUTE does it?

For 99% of users no. Most will much prefer the automatic fetching and real-time updates from their existing Anki connection to asbplayer. I think your suggestion has value and I'd be interested in implementing it but that's up to @killergerbah. It should be very straightforward, it would just replace Anki in this workflow which would only be a few lines of code and the option would slot in seamlessly. But either way, this PR is big enough as it is, I expect this review will come with a lot of changes and discussion.

Also, I think it would be useful to have an “ignore” status for any words the user wishes to exclude, such as fictional words or names.

I plan to with Manually mark words as known for ASBPlayer (overrides Anki). Will not be in this PR. I know @killergerbah had some discussions with others before that storing this data will take some planning. I personally think any solution is acceptable as long as it's exported with the backup.

As for changing a word’s maturity level instantly, I’d like it to work similarly to LUTE
, where you can hover over a word and press a number key. For example, pressing “1” would turn an uncollected word into an unknown word.

Would probably be implemented at the same time as manually marking words.

I’d love to contribute to this request as much as possible. If you need ideas, just ask! Thank you!

Please comment any other ideas that you have. The more discussion that happens now the easier it will be to plan for the future.

@NovaKing001
Copy link
Copy Markdown

For 99% of users, no. Most will much prefer the automatic fetching and real-time updates from their existing Anki connection to asbplayer. I think your suggestion has value and I'd be interested in implementing it, but that's up to @killergerbah. It should be very straightforward—it would just replace Anki in this workflow, which would only be a few lines of code, and the option would slot in seamlessly. But either way, this PR is big enough as it is; I expect this review will come with a lot of changes and discussion.

I do think having Anki integration is great, but I’ll leave this anecdote for future consideration and/or implementation.

Not every known word in my target language is in my Anki deck. I have around 11k known words, but only about 250 cards are in Anki—mostly because I’ve deleted and recreated decks over the years. Some words were never even added to Anki due to repeated exposure through reading multiple books.

Thank you for this PR! I’m really looking forward to its development. I’ll stay in touch with any suggestions I might have.

@ShanaryS ShanaryS marked this pull request as ready for review November 1, 2025 15:52
@artjomsR
Copy link
Copy Markdown
Contributor

artjomsR commented Nov 2, 2025

@NovaKing001 #770 there's a new functionality added in the next release which will allow bulk adding cards so that might help

Actually, a question from me - does it take into account whether the card is suspended or not? (E.g. I mark a card a suspended after a while to mark it as "known" in my Anki collection)

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

ShanaryS commented Nov 2, 2025

Actually, a question from me - does it take into account whether the card is suspended or not? (E.g. I mark a card a suspended after a while to mark it as "known" in my Anki collection)

Kind of. It bases it off of its interval so if it's above 21 (or the value you set) it will be marked as known. But it would be easy to add an option for it like this.

Treat suspended Anki cards as:

  • Normal
  • Mature
  • Young
  • Unknown

@killergerbah
Copy link
Copy Markdown
Owner

Thanks @ShanaryS looks like a huge body of work. I'll see if I can take a look by this weekend.

Regarding the suggestions above:
Yeah I agree that Anki already provides a natural way to know the word's "maturity." I'm not sure how most normal users would be able to come up with a text file representing the same information but I could be wrong about this if that's how other software works.

Also agree that it makes sense to defer "mark word known" to a later change. Wonder if there's a clean solution that uses Anki so that all the data is one place.

@ShanaryS ShanaryS force-pushed the yomitan-anki branch 2 times, most recently from 0f21ee4 to f48f325 Compare November 5, 2025 03:39
@ShanaryS
Copy link
Copy Markdown
Collaborator Author

ShanaryS commented Nov 5, 2025

Also agree that it makes sense to defer "mark word known" to a later change. Wonder if there's a clean solution that uses Anki so that all the data is one place.

The only way I can think of is to create a Deck that asbplayer will add cards to. As there is no way to add a card without it being in a Deck. But it would just sit on the users collection and probably not the best. We could also just add it to the mining Deck as suspended, but this might interfere with user's workflows.

I think the best option is just to use localStorage or IndexedDB. We only need two thing per word, Word: Integer. Realistically, it would be like max 20 bytes for a single key value pair which is only 200 KB for 10,000 words. This would also be the same cache used with a user uploading a text file with known words so that implementation is free. I think as long as we export it (possibly to a different json), there is no real concern.

@NovaKing001
Copy link
Copy Markdown

I'm not sure how most normal users would be able to come up with a text file representing the same information but I could be wrong about this if that's how other software works.

Programs like LUTE, LingQ, Migaku, and Bunpro allow users to export their known vocabulary. This gives users the ability to migrate their progress between platforms.

It doesn’t necessarily need to be a text file. For instance, Migaku has a feature where users can simply copy and paste their known word list.
brave_3HdgkTdOn3 Of course, this could be an issue for intermediate learners who rely exclusively on Anki to learn vocabulary. I would suggest keeping the known words based on Anki interval while also adding an option to import words directly from Anki.

An example from Migaku (courtesy of Jouzu Juls from YouTube)

ezgif-4ab390555a1c66

Wonder if there's a clean solution that uses Anki so that all the data is one place.

The only other way I see fit, other than creating a deck where all your info is stored, is to create an Anki Addon where all that information is handled. However, that would be a major undertaking and quite a hassle to implement, which is why Id rather go for the solution I mentioned previously.

Heres a quick concept I made that would allow for the user to upload their known words list while also being able to import words from Anki
image

I have some other ideas but I think it would be better to create a separate issue that outlines it all and keeping this PR focused on implementing the Yomitan API. Thank you!

@killergerbah
Copy link
Copy Markdown
Owner

killergerbah commented Nov 6, 2025

@NovaKing001 Thanks, I see now that we would benefit from being able to import word lists from other platforms. And also be able to export our own from asbplayer or Anki. I think the main decision to be made is where we store this word data, if not in Anki. @ShanaryS suggested IndexedDB or local storage. Of those two I would prefer IndexedDB, but I'm still not sure if I prefer that against having our own backend.

@ShanaryS I'm just starting to read the code so let me know if I'm misunderstanding anything. I just want to give some high-level design feedback as early as possible.

  • I see that color state is maintained by the extension. I think users should be able to use this feature from the app without the extension, and without having to load subtitles via the extension. For example, asbplayer app can easily be used standalone load both subtitles and/or video. I understand that your approach avoids CORS issues but if given the choice between installing the extension and configuring AnkiConnect settings to use this feature from the app, I think it's less friction to just configure AnkiConnect. Some suggestions for how this could work:
    • The coloring logic can be extracted into the common workspace so that it can be used by both the app and extension independently.
    • If the subtitles originate from the app side the app should color the subtitles. If they originate from the extension side, the extension should color the subtitles. You'll notice that we have some special logic triggered in Player.tsx depending on whether the subtitles are coming from extension or not. Similar logic could be used to implement the decision above.
  • I see that colors are requested separately from the subtitles themselves. Since the SubtitleModel already has an additional field coloredText I think the code will be simplified if the only state to maintain is the subtitles list itself. Which is to say, subtitles and the additional color state can be kept together without passing each one around separately.

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

ShanaryS commented Nov 7, 2025

Of those two I would prefer IndexedDB, but I'm still not sure if I prefer that against having our own backend.

I've read your blog post about adding a backend to asbplayer. The features they would allow in a single package would be nice, but I'm not sure there is a huge market for it. To me, the biggest benefit of FOSS is that it's community driven and builds on top and with each other. IMO there is not a lot of people who are willing mine words and review them with flashcards but have a dealbreaker if it's not all in one app.

Eventually it would be nice to see, but I think such a thing is probably years away with many hours of work ahead. I think in the short term, being able to mark/import words as known and saving locally without adding to Anki would gain most of the value the meantime.

@ShanaryS ShanaryS force-pushed the yomitan-anki branch 2 times, most recently from f99c299 to b820d9f Compare November 7, 2025 16:14
@ShanaryS ShanaryS force-pushed the yomitan-anki branch 2 times, most recently from 8879358 to 37f1ffd Compare November 24, 2025 23:39
Copy link
Copy Markdown
Owner

@killergerbah killergerbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShanaryS I'm not going to request any more changes. I'll plan to merge in the next few days. I think there are some open questions still:

  • Should we disable sentence field targeting until we have a solution that's fast enough? A limit of 100 actually still seems too high. I've been waiting 15 minutes and haven't seen any subtitle get colorized.
  • I think even if we start using IndexedDB, tokenizing an entire deck might hours. At least on my computer, it can take seconds per tokenize request. We'll need a way to be able to tokenize sentences very quickly. Maybe we could concatenate sentences together...

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

Should we disable sentence field targeting until we have a solution that's fast enough? A limit of 100 actually still seems too high. I've been waiting 15 minutes and haven't seen any subtitle get colorized.

It should never take that long for a single color, maybe something else is wrong. But I reduced it to a max of 10 now. I think it's better to keep it with a low limit than remove it. The IndexedDB will completely solve this at runtime, but this will still persist (though manageable) when building.

I think even if we start using IndexedDB, tokenizing an entire deck might hours. At least on my computer, it can take seconds per tokenize request. We'll need a way to be able to tokenize sentences very quickly. Maybe we could concatenate sentences together...

I've tried pretty much everything to improve it. Sending a single request with the concatenated text is ~3% slower and we lose the realtime updates from individual requests. I've also done numerous changes on Yomitan's side and the only real gain is using an LRU cache. If it's taking seconds, then I imagine your sentence fields have very long sentences? It's usually around 100-400ms for me.

Even if it takes hours to fully build the Anki status, it should be manageable as we only need to do it the first time. We wouldn't need a persistent Anki connection either and would easily be able to resume/background the work. Then we only have to do work on new/edited cards, review or suspension status won't affect this and would be a quick update for the changed cards. When using, the only time spent will be tokenizing/lemmatizing the subtitles. It will take minutes for a 3 hour movie, but we only need to do that all at once if were are calculating statistics.

Overall I think the final user experience will be perfectly acceptable and I think well understood by users. They only need the first time setup once and all future updates will be done in seconds. The subtitles are colored significantly faster than real-time where they only need to wait if we implement statistics (we could even have it update live as the value at 10% completion is likely similar to 100%).

I'm not going to request any more changes. I'll plan to merge in the next few days

I'll likely do a final review tomorrow. If I have anything else I'll let you know otherwise it's good from my side.

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

I've been waiting 15 minutes and haven't seen any subtitle get colorized.

There was a race condition when updating the subtitles but it should only happen when there is a config error (e.g no Anki/Yomitan). If lots of colors are changed quickly none of the colors got updated for SubtitlePlayer. This may not be what happened, but I don't think it should ever take more than a couple seconds for the first one.

@killergerbah
Copy link
Copy Markdown
Owner

killergerbah commented Nov 27, 2025

@ShanaryS It doesn't seem that surprising to me. At the previous cap of 100, you could query potentially 100 cards for each common word in a subtitle. Then you would need to tokenize each of those 100 cards. On my computer, it takes on average 5 seconds to tokenize the sentence of one card. That's (100 cards) * (N common words in sentence) * (5 seconds) ~ 500N seconds for a single sentence. Of course I'm assuming the Yomitan cache gets missed every time but with enough cards there would be a lot of misses.

I might be oversimplifying a bit, but my tokenize latencies look like this:

image

@killergerbah
Copy link
Copy Markdown
Owner

By the way, I'll plan to merge by tomorrow morning which is Saturday for me.

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

I've actually made so more improvements to tokenize. It's now about 2.5x faster but also runs in parallel now. So depending where the bottleneck is for you it might actually not be that long.

@killergerbah
Copy link
Copy Markdown
Owner

I've tried pretty much everything to improve it. Sending a single request with the concatenated text is ~3% slower and we lose the realtime updates from individual requests. I've also done numerous changes on Yomitan's side and the only real gain is using an LRU cache. If it's taking seconds, then I imagine your sentence fields have very long sentences? It's usually around 100-400ms for me.

I might have installed too many dictionaries. My sentences aren't that long. I'll have to try experimenting later. Here's one that took 9 seconds:

{text: "誇り高き戦士であるこの私がかわいいとか、そんな浮ついた気持ちになったりしないんだからな", scanLength: 16}

Even if it takes hours to fully build the Anki status, it should be manageable as we only need to do it the first time. We wouldn't need a persistent Anki connection either and would easily be able to resume/background the work. Then we only have to do work on new/edited cards, review or suspension status won't affect this and would be a quick update for the changed cards. When using, the only time spent will be tokenizing/lemmatizing the subtitles. It will take minutes for a 3 hour movie, but we only need to do that all at once if were are calculating statistics.

Overall I think the final user experience will be perfectly acceptable and I think well understood by users. They only need the first time setup once and all future updates will be done in seconds. The subtitles are colored significantly faster than real-time where they only need to wait if we implement statistics (we could even have it update live as the value at 10% completion is likely similar to 100%).

Yeah agreed, if we can solve IndexedDb everything should be good. As a final resort we could always replace Yomitan API with anything else that implements lemmatize and tokenize.

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

I might have installed too many dictionaries. My sentences aren't that long. I'll have to try experimenting later. Here's one that took 9 seconds:

This took 333ms for me which lines up with this kind of length. I have 20 dictionaries enabled and back when I tested it weeks ago it didn't make much of a difference.

I've also made enough improvements to Yomitan to get the tokenize + lemmatize from 5m18s to 2m5s for a 3 hour movie subtitle. I don't think there is much else to do without true multi-threading or an algorithm change. I ideally would have liked 30s or less but this is much more palatable.

@JSchoreels
Copy link
Copy Markdown

I've tried pretty much everything to improve it. Sending a single request with the concatenated text is ~3% slower and we lose the realtime updates from individual requests. I've also done numerous changes on Yomitan's side and the only real gain is using an LRU cache. If it's taking seconds, then I imagine your sentence fields have very long sentences? It's usually around 100-400ms for me.

I might have installed too many dictionaries. My sentences aren't that long. I'll have to try experimenting later. Here's one that took 9 seconds:

{text: "誇り高き戦士であるこの私がかわいいとか、そんな浮ついた気持ちになったりしないんだからな", scanLength: 16}

Even if it takes hours to fully build the Anki status, it should be manageable as we only need to do it the first time. We wouldn't need a persistent Anki connection either and would easily be able to resume/background the work. Then we only have to do work on new/edited cards, review or suspension status won't affect this and would be a quick update for the changed cards. When using, the only time spent will be tokenizing/lemmatizing the subtitles. It will take minutes for a 3 hour movie, but we only need to do that all at once if were are calculating statistics.
Overall I think the final user experience will be perfectly acceptable and I think well understood by users. They only need the first time setup once and all future updates will be done in seconds. The subtitles are colored significantly faster than real-time where they only need to wait if we implement statistics (we could even have it update live as the value at 10% completion is likely similar to 100%).

Yeah agreed, if we can solve IndexedDb everything should be good. As a final resort we could always replace Yomitan API with anything else that implements lemmatize and tokenize.

Locally I'm running Yomitan with Mecab installed and I can tokenize huge documents in matter of seconds.

CleanShot.2025-11-28.at.09.36.39.mp4

Basically we're talking ˜2-3ms by block (~128 chars for now)

If we take the comparison of the "simple" and "mecab" tokenizer inside Yomitan, we can process the full Oppenheimer subtitles file in about 6s instead of around 1.5 minutes (by bulking)

# SIMPLE
Summary:
  total blocks processed: 11
  total subtitle entries: 3243
  total API time: 95978.8 ms
  avg per block : 8725.3 ms
  wall-clock     : 95994.0 ms
  overall avg ratio (subtime/proc): 111.922
# MECAB
Summary:
  total blocks processed: 11
  total subtitle entries: 3243
  total API time: 601.2 ms
  avg per block : 54.7 ms
  wall-clock     : 617.1 ms
  overall avg ratio (subtime/proc): 17575.056

The main difference is the fact that Yomitan Simple tokenize do explore from left to right the text by bruteforcing its way through all potential conjugations. Instead, mecab gives me already something like this :

もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの
もう一度	副詞,一般,*,*,*,*,もう一度,モウイチド,モーイチド
、	記号,読点,*,*,*,*,、,、,、
聞く	動詞,自立,*,*,五段・カ行イ音便,基本形,聞く,キク,キク
わ	助詞,終助詞,*,*,*,*,わ,ワ,ワ
。	記号,句点,*,*,*,*,。,。,。
─	記号,一般,*,*,*,*,─,─,─
─	記号,一般,*,*,*,*,─,─,─
どうして	副詞,一般,*,*,*,*,どうして,ドウシテ,ドーシテ
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
、	記号,読点,*,*,*,*,、,、,、
『	記号,括弧開,*,*,*,*,『,『,『
嫉妬	名詞,サ変接続,*,*,*,*,嫉妬,シット,シット
の	助詞,連体化,*,*,*,*,の,ノ,ノ
魔女	名詞,一般,*,*,*,*,魔女,マジョ,マジョ
』	記号,括弧閉,*,*,*,*,』,』,』
の	助詞,連体化,*,*,*,*,の,ノ,ノ
名	名詞,一般,*,*,*,*,名,ナ,ナ
で	助詞,格助詞,一般,*,*,*,で,デ,デ
呼ぶ	動詞,自立,*,*,五段・バ行,基本形,呼ぶ,ヨブ,ヨブ
の	助詞,終助詞,*,*,*,*,の,ノ,ノ

That you need to interpret a little bit, but which is quite simple processing to get the same results as mecab simple tokenizer :

                        const shouldMerge = (
                            // 助動詞 or 動詞-接尾 (but not after 記号)
                            ((tokenPos === '助動詞' || (tokenPos === '動詞' && tokenPos2 === '接尾')) && last_token.pos !== '記号') ||
                            // て/で particle after verb
                            (tokenPos === '助詞' && tokenPos2 === '接続助詞' && (term === 'て' || term === 'で') && last_token.pos === '動詞')
                        );
                        if (shouldMerge) {
                            line.pop();
                            term = last_token.term + term;
                            reading = last_token.reading + reading;
                            source = last_token.source + source;
                        }

https://github.com/yomidevs/yomitan/blob/9701ef241b29d23e0ed96d77ad9ccae4f628fc6c/ext/js/comm/mecab.js#L226-L236

This then gives you something like this

Testing Parsing for sentence: この世界の片隅に
mecab      : この|世界|の|片隅|に
simple     : この世|界|の|片隅|に

Testing Parsing for sentence: ぐらい上目遣いで言った方がやる気出るぜ?
mecab      : ぐらい|上目遣い|で|言った|方|が|やる気|出る|ぜ|?
simple     : ぐらい|上目遣い|で|言った|方|がや|る|気|出る|ぜ|?

Testing Parsing for sentence: 奇襲でもされたときに君が真っ先にやられると全滅確定
mecab      : 奇襲|で|も|された|とき|に|君|が|真っ先|に|やられる|と|全滅|確定
simple     : 奇襲|でも|された|ときに|君|が|真っ先に|やられる|と|全滅|確定

Testing Parsing for sentence: でも、頑張って
mecab      : でも|、|頑張って
simple     : でも|、|頑張って

Testing Parsing for sentence: そっかそっか。ならま、いいんじゃねーかな
mecab      : そっ|か|そっ|か|。|なら|ま|、|いい|ん|じゃ|ねー|か|な
simple     : そっか|そっか|。|なら|ま|、|いいん|じゃねー|かな

Testing Parsing for sentence: そっかそっか
mecab      : そっ|か|そっ|か
simple     : そっか|そっか

Testing Parsing for sentence: もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの
mecab      : もう一度|、|聞く|わ|。|─|─|どうして|私|を|、|『|嫉妬|の|魔女|』|の|名|で|呼ぶ|の
simple     : もう一度|、|聞く|わ|。──|どうして|私|を|、『|嫉妬|の|魔女|』|の|名|で|呼ぶ|の

Testing Parsing for sentence: 立ち止まった少女に人混みをかき分けて歩み寄り
mecab      : 立ち止まった|少女|に|人混み|を|かき分けて|歩み寄り
simple     : 立ち止まった|少女|に|人混み|を|かき分けて|歩み寄り

Testing Parsing for sentence: Jonathanです
mecab      : Jonathan|です
simple     : Jonathan|です

So you see, small differences like how そっか becomes そっ-か, but that doesn't block the user to still get that そっか as a lookup results (or when we do a searchTerms on it).

I'm currently also using this branch locally and plugged my asbplayer to this yomitan branch. Unfortunately, to make this work for the main branch, this PR need to be first merged : yomidevs/yomitan-mecab-installer#11 since mecab integration is just not working at all for the moment.

Once this one is merged, I could then propose my previous PR to Yomitan itself.

Note, that if this is too heavy as a setup (which it is in my opinion), similar results could be obtained by directly embedding in asbplayer Tokenizer libraries like kuromoji https://github.com/takuyaa/kuromoji.js. Lookups would still have to be done to Yomitan of course, since the dictionaries are there, but the /tokenize doesn't need those if you use a Tokenizer like mecab/kuromoji.

For now, I'm already happy doing that on my own build, but if it's something that interests you or want some help trying to integrate things like that in the future, feel free to ping me :)

@killergerbah killergerbah merged commit 24a1d1a into killergerbah:main Nov 29, 2025
1 check passed
@killergerbah
Copy link
Copy Markdown
Owner

@JSchoreels I see, so MeCab is way faster. Hope to see your work merged soon. Also hoping that that's the problem I'm experiencing. There's I'm seeing a wide distribution of latencies on my laptop, which makes me think that there's another problem (besides Yomitan) as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add communication to Yomitan via the Yomitan API Tokenization/lemmatization integration

6 participants