Skip to content

feat(dictionary): support local and ignored tokens#853

Merged
killergerbah merged 7 commits intokillergerbah:mainfrom
ShanaryS:dictionary-local-tokens
Jan 19, 2026
Merged

feat(dictionary): support local and ignored tokens#853
killergerbah merged 7 commits intokillergerbah:mainfrom
ShanaryS:dictionary-local-tokens

Conversation

@ShanaryS
Copy link
Copy Markdown
Collaborator

@ShanaryS ShanaryS commented Jan 9, 2026

fixes #178

This allows collecting tokens locally without using Anki or to override statuses set by Anki.

  • Priority for tokens is Local > Anki word > Anki sentence
  • Local tokens are not per track unlike Anki (but still per profile), this makes it easier for users to switch between tracks and have local tokens behave consistently.
    • If per track local tokens are desired, it can be trivially added as an opt in later.
  • Local tokens are mainly collected by hovering over the word and pressing the configured keyboard shortcut
    • Coloring must be enabled as it is used as the feedback for collection
    • Default keybinds are Q+[0-5] and Q+I for collecting and ignoring tokens
  • Local tokens can also be collected using a clipboard import located in the Annotations tab
    • If users desire collecting without coloring or a token that was parsed differently they would use this feature
  • Tokens can be exported and imported for backup, these only apply to local tokens as Anki tokens can always be regenerated
    • Imports are purely constructive, it will not delete existing tokens in the db and will use the highest status of existing or the import. The imported states are only used if there are currently no states on the token.
  • Ignored tokens are treated as fully known, these are used for things such as names or places. Currently there is no value to ignored tokens but it will be useful when performing statistics.
    • If desired we can give ignored tokens their own color but at present I don't see a reason to
  • The buildAnkiCache button is now disabled when all tracks are disabled with additional helper text if no Anki fields are configured
  • ensureStoragePersisted() is used on any user interaction with this feature for additional assurance that the db will not be deleted.
  • Treat VideoPlayer subtitles similar to SubtitleController, it will no longer split by line and instead use white-space: pre-wrap

In the future, it would be beneficial to have a way to browse the local words by their token/lemmas, states, status, etc. This would also be a good area to delete tokens, currently users must hover and mark uncollected or use the clipboard import with uncollected.

@ShanaryS ShanaryS self-assigned this Jan 9, 2026
@ShanaryS ShanaryS added the enhancement New feature or request label Jan 9, 2026
@killergerbah
Copy link
Copy Markdown
Owner

Btw one thing I noticed when testing this, is that the default underline style makes it hard to tell the boundaries between consecutively underlined tokens. With this feature, I think it becomes more important to know exactly which token will be saved when the user uses the shortcut. Solvable with a highlight effect? Or spacing between consecutively underlined tokens?

@ShanaryS ShanaryS force-pushed the dictionary-local-tokens branch from de44431 to 911f453 Compare January 12, 2026 23:51
@ShanaryS
Copy link
Copy Markdown
Collaborator Author

I added a highlight for the token but it doesn't seem to work for SidePanel. Also as a general note, the keybinds only work when the player is in focus. I think this is desirable but can be unintuitive if the user is multi tasking. This highlight effect is always there though when hovered, even if the player is not in focus.

@killergerbah
Copy link
Copy Markdown
Owner

Hmm maybe highlight is not ideal then. Could just adjust the styles so that they have boundaries between - at least those styles are connected to the settings already.

}

export enum TokenState {
IGNORED = 0, // If ever adding more states, they should go last (if adding colors for states, use a separate array from tokenStatusColors indexed by TokenState)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate to say that "ignored" is effectively an alias for "fully known?" Asking just for my own information - I guess this is useful if we want to treat "ignored" differently from "fully known" in the future.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are completely equivalent currently. Their use will become important for statistics, so users can ignore places and names in the known words calculation and perhaps other usages. I opted not to give these their own customizable color due to this but we can add support for it later (and any other states that we add).

I designed the states using an array on the tokens even some may be mutually exclusive. Handling those complexities is left to the code to be flexible.

IGNORED = 0, // If ever adding more states, they should go last (if adding colors for states, use a separate array from tokenStatusColors indexed by TokenState)
}

export enum ApplyStrategy {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user accidentally saves a local token, is it not possible to undo that operation right now? I noticed REMOVE is not currently used outside of the dictionary DB. If that's true then maybe we can solve it with a token browser UI. I think you mentioned that at some point already.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see you wrote that in the PR description already, lol

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the plan for managing tokens is essentially exposing the db to the users so they can read and update it directly (where applicable). So they type in a string and we search for tokens and lemmas with that substring and display results. It could be organized by source where local tokens can be deleted or have applicable fields editable, but it would also expose what tokens were processed from Anki. Essentially think of the Anki browser but for the dictionary db.

This is fairly complex (on the UI/UX side) and doesn't offer any real functionality. Uncollecting using hover and clipboard should be fine until we get around to that.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, can be for later

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

ShanaryS commented Jan 16, 2026

I updated all the functions to return objects and also am returning the keys instead of just the count for deletions. This gives more flexibility for future use.

I also changed how existingTokens are handled when importing. Now the existing states will always be preserved exactly as I think this makes the most sense. If the token already exists, then it should be safe to assume that's the state the user wants it in if they recently collected it. Users may end up requesting more control over the import logic, such as a destructive replace or exposing ApplyStrategy. But for now I think using the highest status and existing states is the best default.

Hmm maybe highlight is not ideal then. Could just adjust the styles so that they have boundaries between - at least those styles are connected to the settings already.

I removed the highlight on hover for now, I think it maybe possible to detect if the element is in focus and only displaying it then, we can't just use the window focus as the SubtitlePlayer and Video have separate focuses for the keybinds. In general I don't think it will be a big deal as users should know enough to know what word they are collecting, and we can assume that adjacent words are unlikely to be the same status once users have settled in.

const file = dictionaryDBFileInputRef.current?.files?.[0];
if (file === undefined) return;
await dictionaryProvider.importRecordLocalBulk(
JSON.parse(await file.text()),
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow more flexibiity with the file import? As it is, the user would have to be aware of the JSON schema for the token records, and also pre-lemmatize everything. Meanwhile, import via clipboard uses one token per line. I think users would expect both import from file and import from clipboard to use the same data format. For file import, we could support both - try json parse first then try split by newline.

If we do this, it would make sense to merge the current file import with the clipboard import dialog, as we would want to re-use the lemmatization logic etc.

I can take on this change in a different PR if you agree.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense, the original goal of the export/import is to be handled by asbplayer only and users would be able to covert to it with the exception of the lemma. It think giving the option to parse the lemmas make sense.

I want to clarify, the clipboard import does not split by newline, it parses whatever arbitrary text is there. But the user can dictate words by adding newlines so Yomitan won't see it as a word, we send the text as is. This means though that long text will be slow since we won't be batching though we could by splitting the newlines ourselves. I don't expect users to to paste more than say 100 words though.

let status = item.status;
if (item.status == null || item.status < TokenStatus.UNKNOWN) {
if (!item.states.length) continue; // Status cannot be uncollected unless there is a state
if (status == null) status = TokenStatus.UNCOLLECTED;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status could be undefined as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't but if it was should be treated the same.

@NovaKing001
Copy link
Copy Markdown

NovaKing001 commented Jan 19, 2026

@ShanaryS Hey, I've been using this PR for the past week and have a few ideas/suggestions.

Add more shortcuts

Add shortcuts to set the status of all uncollected words in a subtitle line to either 'unknown' or 'known' this would help speed up the process of adding words.

Treat ignore as its own status

This would allow for ignored words to have word readings, useful for languages such as Japanese, where proper names have ambiguous readings. I have a names dictionary installed on yomitan, and having their readings show up would be a plus.

Clipboard issues

I know that you said that you don't expect users to import more than 100 words ... I tried to import 10k known words, and it wouldn't successfully add them. It would preview for a few minutes and allow me to save; however, saving didn't do anything, and no words were actually added. I have to rely on having an Anki deck with my known words marked as suspended to pseudo-import my words.

Highlighting

The highlighting feature was great for identifying false positives. Making it a toggle in settings would be great. The more dictionaries you have installed in Yomitan, the weirder the parses work.
For example, imagine trying to change the status of チャンク, but instead it changes 2チャン. Highlighting would help identify these Yomitan false positives.
image
image

Some characters are not able to have their status changed

Some characters, such as small よ (ょ), are tokenized but are unable to have their statuses changed
image
In the picture above, all of the small characters aren’t able to be changed.

Future feature?

To mitigate false positives, you can manually select the characters and have them be tokenized/corrected
for example
from
image
to
image
This is currently how it works in Lute
Screencast_20260118_235313.webm
It would probably work by segmenting those characters from the subtitle line and sending them separately to yomitan however I don’t know how that would work with every instance of those characters. I dont know if it would be feasible/ worthwhile. Probably best to wait for this pr yomidevs/yomitan#2254 to help with Yomitan’s false positives/ inaccurate parsing.

Thank you for your hard work! I probably missed a few things, let me know what you think. Thanks!

@killergerbah killergerbah merged commit 664ae42 into killergerbah:main Jan 19, 2026
1 check passed
@killergerbah killergerbah added this to the Extension v1.14.0 milestone Jan 19, 2026
@ShanaryS ShanaryS deleted the dictionary-local-tokens branch January 19, 2026 14:20
@ShanaryS
Copy link
Copy Markdown
Collaborator Author

@NovaKing001

Add more shortcuts

This seems pretty niche and only gets less relevant as more words are marked.

Treat ignore as its own status

From a technical point and don't think it's a good strategy, these flags on the cards should be independent of status. But we can add a toggle to show readings for ignored words.

Clipboard issues

Yeah it's not meant for that big of an import, you use the import words button. It's meant for single words or sentences, maybe a paragraph. If you split it up yourself you be able to use it for that. But I find it hard to believe you have 10k words you can copy and they are all the same state. It sounds like you are using Anki for these words, so why do you need to import that many?

Also what is the structure of the text you are pasting? Does it have punctuation, spaces, or line breaks?

Highlighting

Yeah this can be brought back in some way but some parts of the existing version was unintuitive.

Some characters are not able to have their status changed

These all work fine for me and shouldn't be handled any different, perhaps you are on an outdated branch. The Yomitan parsing for this is incorrect though.

To mitigate false positives, you can manually select the characters and have them be tokenized/corrected

I'm not sure how much value this would bring, if you know the tokenization is incorrect there is no need to mark it. We could only ever apply the override to that specific subtitle event, we won't be allow those characters to say always be parsed as a word. Language is too context sensitive for us to make any parsing decisions on our side so we will just have to rely on better parsers if this algorithm is insufficient.

@NovaKing001
Copy link
Copy Markdown

@ShanaryS Thank you for responding

This seems pretty niche and only gets less relevant as more words are marked.

I have over 10,000 words marked, and I still find it tedious trying to mark all the false positives I'm getting. I think I just worded this wrong. What I really meant was adding some shortcuts related to bulk editing. I would hardly call this niche as Migaku , Lute, and Lingq all have some features related to bulk editing on the spot without having to exit and access a separate word list. Maybe a shortcut that could highlight all uncollected words, and the user could then choose a status using their status shortcuts.

The reason I'm asking for these bulk shortcuts is that when someone inevitably adds dynamic autopausing based on word statuses, it'll autopause on known words that are incorrectly parsed as separate unknown words. It will improve the workflow to quickly bulk edit words without having to hover over each word. Of course, this is all my opinion as someone who has used similar tools in the past; saving a couple of seconds really does improve the user experience.

Yeah it's not meant for that big of an import, you use the import words button. It's meant for single words or sentences, maybe a paragraph. If you split it up yourself you be able to use it for that. But I find it hard to believe you have 10k words you can copy and they are all the same state. It sounds like you are using Anki for these words, so why do you need to import that many?

The words I'm importing are all in newlines, just words, no sentences. As for the import button, I noticed that it's for JSON files. Is it able to accept CSV/txt files? I tried to import a txt file before, and nothing seemed to happen. All of my known words come from LUTE, which can either export as a txt or a csv. I imported these words to Anki to be able to use them in conjunction with asbplayer.

As for the weird Yomitan parsing, I have no clue why this is happening; it might be due to the number of dictionaries I have installed. I can send you an srt file and cross-check the parsing.

Once again, thank you for adding the Yomitan API. It has greatly improved my workflow and opened up possibilities for many additional features I'd like to discuss in the near future.

@ShanaryS
Copy link
Copy Markdown
Collaborator Author

it'll autopause on known words that are incorrectly parsed as separate unknown words

Solving this is currently outside of the design scope. We rely completely on Yomitan parsing and overriding them is not currently planned. That's why even if you notice something off and manually try to fix it, it order for it to apply for the next subtitle or next session would require us making decisions about language parsing. The solution is to find alternatives to the standard Yomitan parser (there is a PR for exposing mecab) that's more accurate.

If you are fine with marking the incorrectly parsed words as known then it's fine. But bulk marking like this won't be a high priority as there is more pressing things to focus on.

Is it able to accept CSV/txt files?

If it needs to be parsed a specific way then that's probably not going to be high priority. If it's just a text file as you say with words (either manually separated or just sentences) then that will be supported. #857 should help with that.

RonzyOnGIT pushed a commit to RonzyOnGIT/asbplayer that referenced this pull request Feb 2, 2026
* export/import local tokens

* collect tokens locally based on hovered element

* import words from clipboard

* support ignored

* add more helper text for buildAnkiCache and disable if no tracks are enabled

* highlight token on hover

* return objects for dictionary query results
RonzyOnGIT pushed a commit to RonzyOnGIT/asbplayer that referenced this pull request Feb 21, 2026
* export/import local tokens

* collect tokens locally based on hovered element

* import words from clipboard

* support ignored

* add more helper text for buildAnkiCache and disable if no tracks are enabled

* highlight token on hover

* return objects for dictionary query results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A function that can mark words we learnt

3 participants