feat(dictionary): support local and ignored tokens#853
feat(dictionary): support local and ignored tokens#853killergerbah merged 7 commits intokillergerbah:mainfrom
Conversation
|
Btw one thing I noticed when testing this, is that the default underline style makes it hard to tell the boundaries between consecutively underlined tokens. With this feature, I think it becomes more important to know exactly which token will be saved when the user uses the shortcut. Solvable with a highlight effect? Or spacing between consecutively underlined tokens? |
de44431 to
911f453
Compare
|
I added a highlight for the token but it doesn't seem to work for SidePanel. Also as a general note, the keybinds only work when the player is in focus. I think this is desirable but can be unintuitive if the user is multi tasking. This highlight effect is always there though when hovered, even if the player is not in focus. |
|
Hmm maybe highlight is not ideal then. Could just adjust the styles so that they have boundaries between - at least those styles are connected to the settings already. |
| } | ||
|
|
||
| export enum TokenState { | ||
| IGNORED = 0, // If ever adding more states, they should go last (if adding colors for states, use a separate array from tokenStatusColors indexed by TokenState) |
There was a problem hiding this comment.
Is it accurate to say that "ignored" is effectively an alias for "fully known?" Asking just for my own information - I guess this is useful if we want to treat "ignored" differently from "fully known" in the future.
There was a problem hiding this comment.
Yes, they are completely equivalent currently. Their use will become important for statistics, so users can ignore places and names in the known words calculation and perhaps other usages. I opted not to give these their own customizable color due to this but we can add support for it later (and any other states that we add).
I designed the states using an array on the tokens even some may be mutually exclusive. Handling those complexities is left to the code to be flexible.
| IGNORED = 0, // If ever adding more states, they should go last (if adding colors for states, use a separate array from tokenStatusColors indexed by TokenState) | ||
| } | ||
|
|
||
| export enum ApplyStrategy { |
There was a problem hiding this comment.
If the user accidentally saves a local token, is it not possible to undo that operation right now? I noticed REMOVE is not currently used outside of the dictionary DB. If that's true then maybe we can solve it with a token browser UI. I think you mentioned that at some point already.
There was a problem hiding this comment.
Oh I see you wrote that in the PR description already, lol
There was a problem hiding this comment.
Yeah, the plan for managing tokens is essentially exposing the db to the users so they can read and update it directly (where applicable). So they type in a string and we search for tokens and lemmas with that substring and display results. It could be organized by source where local tokens can be deleted or have applicable fields editable, but it would also expose what tokens were processed from Anki. Essentially think of the Anki browser but for the dictionary db.
This is fairly complex (on the UI/UX side) and doesn't offer any real functionality. Uncollecting using hover and clipboard should be fine until we get around to that.
There was a problem hiding this comment.
Makes sense, can be for later
|
I updated all the functions to return objects and also am returning the keys instead of just the count for deletions. This gives more flexibility for future use. I also changed how existingTokens are handled when importing. Now the existing states will always be preserved exactly as I think this makes the most sense. If the token already exists, then it should be safe to assume that's the state the user wants it in if they recently collected it. Users may end up requesting more control over the import logic, such as a destructive replace or exposing
I removed the highlight on hover for now, I think it maybe possible to detect if the element is in focus and only displaying it then, we can't just use the window focus as the |
| const file = dictionaryDBFileInputRef.current?.files?.[0]; | ||
| if (file === undefined) return; | ||
| await dictionaryProvider.importRecordLocalBulk( | ||
| JSON.parse(await file.text()), |
There was a problem hiding this comment.
Should we allow more flexibiity with the file import? As it is, the user would have to be aware of the JSON schema for the token records, and also pre-lemmatize everything. Meanwhile, import via clipboard uses one token per line. I think users would expect both import from file and import from clipboard to use the same data format. For file import, we could support both - try json parse first then try split by newline.
If we do this, it would make sense to merge the current file import with the clipboard import dialog, as we would want to re-use the lemmatization logic etc.
I can take on this change in a different PR if you agree.
There was a problem hiding this comment.
Yeah I think that makes sense, the original goal of the export/import is to be handled by asbplayer only and users would be able to covert to it with the exception of the lemma. It think giving the option to parse the lemmas make sense.
I want to clarify, the clipboard import does not split by newline, it parses whatever arbitrary text is there. But the user can dictate words by adding newlines so Yomitan won't see it as a word, we send the text as is. This means though that long text will be slow since we won't be batching though we could by splitting the newlines ourselves. I don't expect users to to paste more than say 100 words though.
| let status = item.status; | ||
| if (item.status == null || item.status < TokenStatus.UNKNOWN) { | ||
| if (!item.states.length) continue; // Status cannot be uncollected unless there is a state | ||
| if (status == null) status = TokenStatus.UNCOLLECTED; |
There was a problem hiding this comment.
status could be undefined as well?
There was a problem hiding this comment.
It can't but if it was should be treated the same.
|
@ShanaryS Hey, I've been using this PR for the past week and have a few ideas/suggestions. Add more shortcutsAdd shortcuts to set the status of all uncollected words in a subtitle line to either 'unknown' or 'known' this would help speed up the process of adding words. Treat ignore as its own statusThis would allow for ignored words to have word readings, useful for languages such as Japanese, where proper names have ambiguous readings. I have a names dictionary installed on yomitan, and having their readings show up would be a plus. Clipboard issuesI know that you said that you don't expect users to import more than 100 words ... I tried to import 10k known words, and it wouldn't successfully add them. It would preview for a few minutes and allow me to save; however, saving didn't do anything, and no words were actually added. I have to rely on having an Anki deck with my known words marked as suspended to pseudo-import my words. HighlightingThe highlighting feature was great for identifying false positives. Making it a toggle in settings would be great. The more dictionaries you have installed in Yomitan, the weirder the parses work. Some characters are not able to have their status changedSome characters, such as small よ (ょ), are tokenized but are unable to have their statuses changed Future feature?To mitigate false positives, you can manually select the characters and have them be tokenized/corrected Thank you for your hard work! I probably missed a few things, let me know what you think. Thanks! |
This seems pretty niche and only gets less relevant as more words are marked.
From a technical point and don't think it's a good strategy, these flags on the cards should be independent of status. But we can add a toggle to show readings for ignored words.
Yeah it's not meant for that big of an import, you use the import words button. It's meant for single words or sentences, maybe a paragraph. If you split it up yourself you be able to use it for that. But I find it hard to believe you have 10k words you can copy and they are all the same state. It sounds like you are using Anki for these words, so why do you need to import that many? Also what is the structure of the text you are pasting? Does it have punctuation, spaces, or line breaks?
Yeah this can be brought back in some way but some parts of the existing version was unintuitive.
These all work fine for me and shouldn't be handled any different, perhaps you are on an outdated branch. The Yomitan parsing for this is incorrect though.
I'm not sure how much value this would bring, if you know the tokenization is incorrect there is no need to mark it. We could only ever apply the override to that specific subtitle event, we won't be allow those characters to say always be parsed as a word. Language is too context sensitive for us to make any parsing decisions on our side so we will just have to rely on better parsers if this algorithm is insufficient. |
|
@ShanaryS Thank you for responding
I have over 10,000 words marked, and I still find it tedious trying to mark all the false positives I'm getting. I think I just worded this wrong. What I really meant was adding some shortcuts related to bulk editing. I would hardly call this niche as Migaku , Lute, and Lingq all have some features related to bulk editing on the spot without having to exit and access a separate word list. Maybe a shortcut that could highlight all uncollected words, and the user could then choose a status using their status shortcuts. The reason I'm asking for these bulk shortcuts is that when someone inevitably adds dynamic autopausing based on word statuses, it'll autopause on known words that are incorrectly parsed as separate unknown words. It will improve the workflow to quickly bulk edit words without having to hover over each word. Of course, this is all my opinion as someone who has used similar tools in the past; saving a couple of seconds really does improve the user experience.
The words I'm importing are all in newlines, just words, no sentences. As for the import button, I noticed that it's for JSON files. Is it able to accept CSV/txt files? I tried to import a txt file before, and nothing seemed to happen. All of my known words come from LUTE, which can either export as a txt or a csv. I imported these words to Anki to be able to use them in conjunction with asbplayer. As for the weird Yomitan parsing, I have no clue why this is happening; it might be due to the number of dictionaries I have installed. I can send you an srt file and cross-check the parsing. Once again, thank you for adding the Yomitan API. It has greatly improved my workflow and opened up possibilities for many additional features I'd like to discuss in the near future. |
Solving this is currently outside of the design scope. We rely completely on Yomitan parsing and overriding them is not currently planned. That's why even if you notice something off and manually try to fix it, it order for it to apply for the next subtitle or next session would require us making decisions about language parsing. The solution is to find alternatives to the standard Yomitan parser (there is a PR for exposing mecab) that's more accurate. If you are fine with marking the incorrectly parsed words as known then it's fine. But bulk marking like this won't be a high priority as there is more pressing things to focus on.
If it needs to be parsed a specific way then that's probably not going to be high priority. If it's just a text file as you say with words (either manually separated or just sentences) then that will be supported. #857 should help with that. |
* export/import local tokens * collect tokens locally based on hovered element * import words from clipboard * support ignored * add more helper text for buildAnkiCache and disable if no tracks are enabled * highlight token on hover * return objects for dictionary query results
* export/import local tokens * collect tokens locally based on hovered element * import words from clipboard * support ignored * add more helper text for buildAnkiCache and disable if no tracks are enabled * highlight token on hover * return objects for dictionary query results





fixes #178
This allows collecting tokens locally without using Anki or to override statuses set by Anki.
Local > Anki word > Anki sentenceQ+[0-5]andQ+Ifor collecting and ignoring tokensAnnotationstabensureStoragePersisted()is used on any user interaction with this feature for additional assurance that the db will not be deleted.VideoPlayersubtitles similar toSubtitleController, it will no longer split by line and instead usewhite-space: pre-wrapIn the future, it would be beneficial to have a way to browse the local words by their token/lemmas, states, status, etc. This would also be a good area to delete tokens, currently users must hover and mark uncollected or use the clipboard import with uncollected.