feat(subtitles): color subtitle words based on Anki intervals#813
feat(subtitles): color subtitle words based on Anki intervals#813killergerbah merged 16 commits intokillergerbah:mainfrom
Conversation
7155a24 to
ec3818c
Compare
|
Hey, this is awesome! I’m the one who originally requested this. I’m not too familiar with coding, but I do have a few requests. Correct me if I’m wrong, but I see that you’re coloring the words based on Anki intervals. Wouldn’t it be better if the user could upload a TSV file where each word is listed with a value assigned according to its maturity level—similar to how LUTE does it? For example:
Also, I think it would be useful to have an “ignore” status for any words the user wishes to exclude, such as fictional words or names. As for changing a word’s maturity level instantly, I’d like it to work similarly to LUTE I’d love to contribute to this request as much as possible. If you need ideas, just ask! Thank you! |
0e61878 to
5da3281
Compare
For 99% of users no. Most will much prefer the automatic fetching and real-time updates from their existing Anki connection to asbplayer. I think your suggestion has value and I'd be interested in implementing it but that's up to @killergerbah. It should be very straightforward, it would just replace Anki in this workflow which would only be a few lines of code and the option would slot in seamlessly. But either way, this PR is big enough as it is, I expect this review will come with a lot of changes and discussion.
I plan to with
Would probably be implemented at the same time as manually marking words.
Please comment any other ideas that you have. The more discussion that happens now the easier it will be to plan for the future. |
I do think having Anki integration is great, but I’ll leave this anecdote for future consideration and/or implementation. Not every known word in my target language is in my Anki deck. I have around 11k known words, but only about 250 cards are in Anki—mostly because I’ve deleted and recreated decks over the years. Some words were never even added to Anki due to repeated exposure through reading multiple books. Thank you for this PR! I’m really looking forward to its development. I’ll stay in touch with any suggestions I might have. |
5da3281 to
c2e632f
Compare
|
@NovaKing001 #770 there's a new functionality added in the next release which will allow bulk adding cards so that might help Actually, a question from me - does it take into account whether the card is suspended or not? (E.g. I mark a card a suspended after a while to mark it as "known" in my Anki collection) |
Kind of. It bases it off of its interval so if it's above 21 (or the value you set) it will be marked as known. But it would be easy to add an option for it like this. Treat suspended Anki cards as:
|
c2e632f to
cd510db
Compare
|
Thanks @ShanaryS looks like a huge body of work. I'll see if I can take a look by this weekend. Regarding the suggestions above: Also agree that it makes sense to defer "mark word known" to a later change. Wonder if there's a clean solution that uses Anki so that all the data is one place. |
0f21ee4 to
f48f325
Compare
The only way I can think of is to create a Deck that asbplayer will add cards to. As there is no way to add a card without it being in a Deck. But it would just sit on the users collection and probably not the best. We could also just add it to the mining Deck as suspended, but this might interfere with user's workflows. I think the best option is just to use |
f48f325 to
5b21081
Compare
Programs like LUTE, LingQ, Migaku, and Bunpro allow users to export their known vocabulary. This gives users the ability to migrate their progress between platforms. It doesn’t necessarily need to be a text file. For instance, Migaku has a feature where users can simply copy and paste their known word list. An example from Migaku (courtesy of Jouzu Juls from YouTube)
The only other way I see fit, other than creating a deck where all your info is stored, is to create an Anki Addon where all that information is handled. However, that would be a major undertaking and quite a hassle to implement, which is why Id rather go for the solution I mentioned previously. Heres a quick concept I made that would allow for the user to upload their known words list while also being able to import words from Anki I have some other ideas but I think it would be better to create a separate issue that outlines it all and keeping this PR focused on implementing the Yomitan API. Thank you! |
5b21081 to
69b66e5
Compare
|
@NovaKing001 Thanks, I see now that we would benefit from being able to import word lists from other platforms. And also be able to export our own from asbplayer or Anki. I think the main decision to be made is where we store this word data, if not in Anki. @ShanaryS suggested IndexedDB or local storage. Of those two I would prefer IndexedDB, but I'm still not sure if I prefer that against having our own backend. @ShanaryS I'm just starting to read the code so let me know if I'm misunderstanding anything. I just want to give some high-level design feedback as early as possible.
|
I've read your blog post about adding a backend to asbplayer. The features they would allow in a single package would be nice, but I'm not sure there is a huge market for it. To me, the biggest benefit of FOSS is that it's community driven and builds on top and with each other. IMO there is not a lot of people who are willing mine words and review them with flashcards but have a dealbreaker if it's not all in one app. Eventually it would be nice to see, but I think such a thing is probably years away with many hours of work ahead. I think in the short term, being able to mark/import words as known and saving locally without adding to Anki would gain most of the value the meantime. |
f99c299 to
b820d9f
Compare
8879358 to
37f1ffd
Compare
37f1ffd to
a3733ff
Compare
There was a problem hiding this comment.
@ShanaryS I'm not going to request any more changes. I'll plan to merge in the next few days. I think there are some open questions still:
- Should we disable sentence field targeting until we have a solution that's fast enough? A limit of 100 actually still seems too high. I've been waiting 15 minutes and haven't seen any subtitle get colorized.
- I think even if we start using IndexedDB, tokenizing an entire deck might hours. At least on my computer, it can take seconds per tokenize request. We'll need a way to be able to tokenize sentences very quickly. Maybe we could concatenate sentences together...
It should never take that long for a single color, maybe something else is wrong. But I reduced it to a max of
I've tried pretty much everything to improve it. Sending a single request with the concatenated text is Even if it takes hours to fully build the Anki status, it should be manageable as we only need to do it the first time. We wouldn't need a persistent Anki connection either and would easily be able to resume/background the work. Then we only have to do work on new/edited cards, review or suspension status won't affect this and would be a quick update for the changed cards. When using, the only time spent will be tokenizing/lemmatizing the subtitles. It will take minutes for a 3 hour movie, but we only need to do that all at once if were are calculating statistics. Overall I think the final user experience will be perfectly acceptable and I think well understood by users. They only need the first time setup once and all future updates will be done in seconds. The subtitles are colored significantly faster than real-time where they only need to wait if we implement statistics (we could even have it update live as the value at 10% completion is likely similar to 100%).
I'll likely do a final review tomorrow. If I have anything else I'll let you know otherwise it's good from my side. |
9f46151 to
d452c25
Compare
d452c25 to
7213e96
Compare
There was a race condition when updating the subtitles but it should only happen when there is a config error (e.g no Anki/Yomitan). If lots of colors are changed quickly none of the colors got updated for SubtitlePlayer. This may not be what happened, but I don't think it should ever take more than a couple seconds for the first one. |
|
@ShanaryS It doesn't seem that surprising to me. At the previous cap of 100, you could query potentially 100 cards for each common word in a subtitle. Then you would need to tokenize each of those 100 cards. On my computer, it takes on average 5 seconds to tokenize the sentence of one card. That's (100 cards) * (N common words in sentence) * (5 seconds) ~ 500N seconds for a single sentence. Of course I'm assuming the Yomitan cache gets missed every time but with enough cards there would be a lot of misses. I might be oversimplifying a bit, but my
|
|
By the way, I'll plan to merge by tomorrow morning which is Saturday for me. |
|
I've actually made so more improvements to tokenize. It's now about 2.5x faster but also runs in parallel now. So depending where the bottleneck is for you it might actually not be that long. |
I might have installed too many dictionaries. My sentences aren't that long. I'll have to try experimenting later. Here's one that took 9 seconds:
Yeah agreed, if we can solve IndexedDb everything should be good. As a final resort we could always replace Yomitan API with anything else that implements |
This took I've also made enough improvements to Yomitan to get the |
Locally I'm running Yomitan with Mecab installed and I can tokenize huge documents in matter of seconds. CleanShot.2025-11-28.at.09.36.39.mp4Basically we're talking ˜2-3ms by block (~128 chars for now) If we take the comparison of the "simple" and "mecab" tokenizer inside Yomitan, we can process the full Oppenheimer subtitles file in about 6s instead of around 1.5 minutes (by bulking) The main difference is the fact that Yomitan Simple tokenize do explore from left to right the text by bruteforcing its way through all potential conjugations. Instead, mecab gives me already something like this : That you need to interpret a little bit, but which is quite simple processing to get the same results as mecab simple tokenizer : This then gives you something like this So you see, small differences like how そっか becomes そっ-か, but that doesn't block the user to still get that そっか as a lookup results (or when we do a searchTerms on it). I'm currently also using this branch locally and plugged my asbplayer to this yomitan branch. Unfortunately, to make this work for the main branch, this PR need to be first merged : yomidevs/yomitan-mecab-installer#11 since mecab integration is just not working at all for the moment. Once this one is merged, I could then propose my previous PR to Yomitan itself. Note, that if this is too heavy as a setup (which it is in my opinion), similar results could be obtained by directly embedding in asbplayer Tokenizer libraries like kuromoji https://github.com/takuyaa/kuromoji.js. Lookups would still have to be done to Yomitan of course, since the dictionaries are there, but the /tokenize doesn't need those if you use a Tokenizer like mecab/kuromoji. For now, I'm already happy doing that on my own build, but if it's something that interests you or want some help trying to integrate things like that in the future, feel free to ping me :) |
|
@JSchoreels I see, so MeCab is way faster. Hope to see your work merged soon. Also hoping that that's the problem I'm experiencing. There's I'm seeing a wide distribution of latencies on my laptop, which makes me think that there's another problem (besides Yomitan) as well. |




fixes #193
fixes #789
This uses the Yomitan API discussed in the issues. It creates a new tab in the UI called
Dictionarywhere all these settings are stored. They all can be moved to the existing sections but that would seem to be confusing.Design:
showingSubtitlesand a buffer of future eventsshowingSubtitlesby cancelling previous for responsive and relevant work (e.g. seeking)SubtitleController: Extension|SubtitlePlayer: AppSubtitlePlayeralways handles coloring the website or SidePanel, listens for extension requests if in useUI/Options
Dictionarytab on previous versions?Next PR - Using
IndexedDB:We currently cannot match if the subtitle token is inflected while the card in Anki is inflected different. For example, the subtitle is
standingbut the user only has an Anki card withstood. By parsing the Anki fields ahead of time and storing them inIndexedDB, we can also get their lemma allow us to match using the base formstand. UsingIndexedDBwill also allow much faster lookups since we won't need to use Anki in realtime.IndexedDBalso allow us to manually mark words as known without adding them to Anki. Users will also be able to import known words easily. Users can use these words without Anki or use both were the manual/imported ones take priority.The structure will likely be using 3 "tables" like:
token_localtoken_anki_wordtoken_anki_sentenceThese each of these "tables" will have 3 "columns":
lemma|status|inflectionsWhere
inflectionsis a json with key-value of:inflection:statusAn example entry:
The Lemma "column" will be indexed and will be how lookups are performed. When we tokenize a subtitle event, each token will also be lemmatized then looked up against the database. Users will be able to choose how inflections or lemmas are used for known status. I'll stop here with the details but I already have a good idea on how to structure and use the data for this task.
Future PRs: