Feature/mecab [/tokenize] support for mecab by JSchoreels · Pull Request #2254 · yomidevs/yomitan

JSchoreels · 2025-11-30T09:28:22Z

Hello,

For people with mecab installed, I've been working on using mecab as another way to /tokenize sentence.

The main benefit is performance, and its non-greediness which helps not "eating" この世界 into この世+界 or がやる into がや+る (see more examples below). There are some examples of simple vs mecab tokenize with this branch

Input : この世界の片隅に
+	mecab      : この|世界|の|片隅|に
-	simple     : この世|界|の|片隅|に
Input : ぐらい上目遣いで言った方がやる気出るぜ？
+	mecab      : ぐらい|上目遣い|で|言った|方|が|やる気|出る|ぜ|？
-	simple     : ぐらい|上目遣い|で|言った|方|がや|る|気|出る|ぜ|？
Input : 奇襲でもされたときに君が真っ先にやられると全滅確定
+	mecab      : 奇襲|で|も|された|とき|に|君|が|真っ先|に|やられる|と|全滅|確定
-	simple     : 奇襲|でも|された|ときに|君|が|真っ先に|やられる|と|全滅|確定
Input : でも、頑張って
+	mecab      : でも|、|頑張って
-	simple     : でも|、|頑張って
Input : そっかそっか。ならま、いいんじゃねーかな
+	mecab      : そっ|か|そっ|か|。|なら|ま|、|いい|ん|じゃ|ねー|か|な
-	simple     : そっか|そっか|。|なら|ま|、|いいん|じゃねー|かな
Input : そっかそっか
+	mecab      : そっ|か|そっ|か
-	simple     : そっか|そっか
Input : もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの
+	mecab      : もう一度|、|聞く|わ|。|─|─|どうして|私|を|、|『|嫉妬|の|魔女|』|の|名|で|呼ぶ|の
	simple     : もう一度|、|聞く|わ|。──|どうして|私|を|、『|嫉妬|の|魔女|』|の|名|で|呼ぶ|の
Input : 立ち止まった少女に人混みをかき分けて歩み寄り
+	mecab      : 立ち止まった|少女|に|人混み|を|かき分けて|歩み寄り
-	simple     : 立ち止まった|少女|に|人混み|を|かき分けて|歩み寄り
Input : Jonathanです
+	mecab      : Jonathan|です
-	simple     : Jonathan|です

As you can see, there are a few differences, typically for things like そっか that suddenly gets splitted into two, but this won't break lookups made on the そ of そっか to match it during real lookups, it simply helps the tools using the /tokenize endpoint to now that those are two different entries.

mecab normally also split things like ます, た form... But I added some logic to make it as close as possible to the existing tokenizer (minus the greediness).

Mecab output :

+もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの
もう一度	副詞,一般,*,*,*,*,もう一度,モウイチド,モーイチド
、	記号,読点,*,*,*,*,、,、,、
聞く	動詞,自立,*,*,五段・カ行イ音便,基本形,聞く,キク,キク
わ	助詞,終助詞,*,*,*,*,わ,ワ,ワ
。	記号,句点,*,*,*,*,。,。,。
─	記号,一般,*,*,*,*,─,─,─
─	記号,一般,*,*,*,*,─,─,─
どうして	副詞,一般,*,*,*,*,どうして,ドウシテ,ドーシテ
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
、	記号,読点,*,*,*,*,、,、,、
『	記号,括弧開,*,*,*,*,『,『,『
嫉妬	名詞,サ変接続,*,*,*,*,嫉妬,シット,シット
の	助詞,連体化,*,*,*,*,の,ノ,ノ
魔女	名詞,一般,*,*,*,*,魔女,マジョ,マジョ
』	記号,括弧閉,*,*,*,*,』,』,』
の	助詞,連体化,*,*,*,*,の,ノ,ノ
名	名詞,一般,*,*,*,*,名,ナ,ナ
で	助詞,格助詞,一般,*,*,*,で,デ,デ
呼ぶ	動詞,自立,*,*,五段・バ行,基本形,呼ぶ,ヨブ,ヨブ
の	助詞,終助詞,*,*,*,*,の,ノ,ノ

Handling of merges :

                        const shouldMerge = (
                            // 助動詞 or 動詞-接尾 (but not after 記号)
                            ((tokenPos === '助動詞' || (tokenPos === '動詞' && tokenPos2 === '接尾')) && last_token.pos !== '記号') ||
                            // て/で particle after verb
                            (tokenPos === '助詞' && tokenPos2 === '接続助詞' && (term === 'て' || term === 'で') && last_token.pos === '動詞')
                        );
                        if (shouldMerge) {
                            line.pop();
                            term = last_token.term + term;
                            reading = last_token.reading + reading;
                            source = last_token.source + source;
                        }

Another big perk is how fast it can be to parse huge texts. Even with some optimization like block-tokenizing with the simple endpoint, I was able to parse the full Oppenheimer srt files in about 600ms instead of ~95s with the simple tokenizer.

# SIMPLE
Summary:
  total blocks processed: 11
  total subtitle entries: 3243
  total API time: 95978.8 ms
  avg per block : 8725.3 ms
-  wall-clock     : 95994.0 ms
  overall avg ratio (subtime/proc): 111.922
# MECAB
Summary:
  total blocks processed: 11
  total subtitle entries: 3243
  total API time: 601.2 ms
  avg per block : 54.7 ms
+  wall-clock     : 617.1 ms
  overall avg ratio (subtime/proc): 17575.056

It also allow for near realtime tokenizing sentences, useful for projects like asbplayer or in this case my fork of ebook-reader (ttsu-reader)
https://github.com/user-attachments/assets/3847cd0f-e3b8-41d7-a3e1-e02de35500a5
Each tokenize range between 2-3ms instead of 25-100ms for the simple one on my computer.

To keep things backward compatible, if the parser is not set in the query, I fallback to simple parser. But if you add "parser: mecab", it will use mecab.

Something I could take a look on is to, by default (if no parser are specified), use mecab if the user selected mecab in its user options.

Any thoughts/recommendations are of course welcomed.

Actions ;

To decide : Handling default mecab flag from Yomitan to select the default /tokenize called ?
To fix : Handling of translated dictionnaryes, example :

➜  yomitan-scripts mecab -d ~/workspace/yomitan-mecab-installer/data/unidic-mecab-translate/
出会えた気がしたんだ
出会え	verb,ordinary,*,*,potential,continuative,デアウ,出会う,出会え,デアエ,出会える,デアエル,和,*,*,*,*
た	aux-verb,*,*,*,aux|ta,attributive,タ,た,た,タ,た,タ,和,*,*,*,*
気	noun,common,ordinary,*,*,*,キ,気,気,キ,気,キ,漢,キ濁,基本形,*,*
が	particle,case,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
し	verb,non independent?,*,*,irregular suru,continuative,スル,為る,し,シ,する,スル,和,*,*,*,*
た	aux-verb,*,*,*,aux|ta,attributive,タ,た,た,タ,た,タ,和,*,*,*,*
ん	particle,acts on the whole phrase,*,*,*,*,ノ,の,ん,ン,ん,ン,和,*,*,*,*
だ	aux-verb,*,*,*,aux|da,terminal,ダ,だ,だ,ダ,だ,ダ,和,*,*,*,*
EOS

…horeels/yomitan into feature/mecab-tokenizer-improvements

JSchoreels · 2025-11-30T20:07:39Z

I've added handling of translated POS like unidic-mecab-translate. There's a small caveat in the fact the "aux-verb" pos1 has been cut to "aux" by the mecab.py script in the https://github.com/yomidevs/yomitan-mecab-installer

I'm pretty satisfied with the results for 40 sentences, 10 I crafted/extracted from books I saw the original tokenize parsing incorrectly some things like がやる into がや+る, and I generated 30 extra sentences. All the cases are in the file attached below, and here the summary :

============================================================
COMPARISON SUMMARY
============================================================

mecab-ipadic vs mecab-unidic:
  Total sentences tested: 40
  Identical results:      31 (77.5%)
  Different results:      9 (22.5%)

mecab (ipadic/unidic) vs scan:
  Total sentences tested: 40
  Identical results:      10 (25.0%)
  Different results:      30 (75.0%)

Full results : tokenize_test.txt

The mecab ipadic/unidic difference are related to granularity of some words for example

もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの
もう	adverb,*,*,*,*,*,モウ,もう,もう,モー,もう,モー,和,*,*,*,*
一	noun,numeral,*,*,*,*,イチ,一,一,イチ,一,イチ,漢,*,*,チ促,基本形
度	noun,common,counter?,*,*,*,ド,度,度,ド,度,ド,漢,*,*,*,*

もう一度	副詞,一般,*,*,*,*,もう一度,モウイチド,モーイチド

But this is has to be expected since one dictinonary is more granular than the other.

For the mecab vs scan, sometimes it's just a matter of a view details like punctuations being aggregated with the scan method :

Input : もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの
    mecab        : もう一度|、|聞く|わ|。|─|─|どうして|私|を|、|『|嫉妬|の|魔女|』|の|名|で|呼ぶ|の
    scan         : もう一度|、|聞く|わ|。──|どうして|私|を|、『|嫉妬|の|魔女|』|の|名|で|呼ぶ|の

And sometimes it's more about the greediness of the scan method

Input : 会いに行く
    mecab        : 会い|に|行く
    scan         : 会い|に行く

… copula

JSchoreels · 2025-12-10T18:34:37Z

Small update that I'm currently integrating this change with asbplayer latest merged PR killergerbah/asbplayer#813.

I'm still doing my best to achieve as much consistent parsing between ipadic/unidic, but for some stuff I'll have to modify the mecab-api to preserve certain fields to be able to differentiate between だ as a copula or the past tense

よんだ
よん	動詞,自立,*,*,五段・バ行,連用タ接続,よぶ,ヨン,ヨン
だ	助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
EOS
これだ
これ	名詞,代名詞,一般,*,*,*,これ,コレ,コレ
だ	助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
EOS
これです
これ	名詞,代名詞,一般,*,*,*,これ,コレ,コレ
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

…riginal

NovaKing001 · 2026-02-28T03:15:51Z

Hey was just wondering if this is still being worked on?

JSchoreels · 2026-03-07T21:53:47Z

@NovaKing001

Hello ! I'm using it daily yes, since my own fork of asbplayer and ttsu-reader use it.

I'd say I'm pretty happy with the results as it is right now, even if using mecab as a tokenizer compared to the default method has sometimes drawback, for example tokenizing entries that might not have an entry in the dictionnary used by the user. But this is well compensated by the fact that I rarely encounter any bad parsing for things like particles, conjugations, etc.

It also allow me to tokenize huge content quite quickly, which allowed me to create a sidebar for my fork of ttsu-reader that list all tokens and the numbers of occurrences in a book in a very fast fashion.

There are cons too tho, for example the mecab.py bridge has size limit that means that I can't send enormous epub in one go, which mecab can handle with no problem as a standalone solution.

So yeah, I think it's starting to get mature enough :)

Kuuuube · 2026-03-08T01:58:12Z

This seems fine but I'm not going to give this a test and look to approve while the js tests are still failing.

You can ignore the playwright test fails or link checker but all the others need to pass.

NovaKing001 · 2026-03-08T02:47:16Z

@JSchoreels Thats great! I'll give your fork of ttsu-reader a try, ever since the maintainer of lute got ill, I've been looking for an alternative to an epub reader that can parse while keeping the structure of an epub's images intact.

JSchoreels · 2026-03-08T16:49:14Z

@Kuuuube alright I'll look into those.

Also, I'll probably have to merge another PR for the mecab.py in the https://github.com/yomidevs/yomitan-mecab-installer, there's some fields that were not mapped from some dictionnaries and I had to add those for certain token merging rules !

I'll update this PR and create the other one in the upcoming days !

…enizer-improvements

…horeels/yomitan into feature/mecab-tokenizer-improvements

JSchoreels · 2026-03-09T16:56:16Z

Hello,

Normally npm run test should be good now.

I also created this PR that is a dependency of this one : yomidevs/yomitan-mecab-installer#12

I added a bit more output for the tokenize with mecab, now it returns the lemma and lemma_reading of each token, which would allow clients like asbplayer/ttsu-reader to directly get all the inflection -> lemma with having to re-query Yomitan behind the /tokenize.

It will look something like this :

        const expectedContent = [
            [
                {text: '思', reading: 'おも', lemma: '思い出す', lemmaReading: 'おもいだす'},
                {text: 'い', reading: ''},
                {text: '出', reading: 'だ'},
                {text: 'せなく', reading: ''},
            ],
            [
                {text: 'なった', reading: '', lemma: '成る', lemmaReading: 'なる'},
            ],
        ];

JSchoreels added 2 commits November 21, 2025 18:29

Improves mecab tokenizer

9701ef2

Better handling of auxiliary verbs

2adbdde

JSchoreels requested a review from a team as a code owner November 30, 2025 09:28

JSchoreels added 3 commits November 30, 2025 10:34

Merge branch 'master' into feature/mecab-tokenizer-improvements

5d4bc47

Handling of translated POS like unidic-mecab-translate

1b91851

Merge branch 'feature/mecab-tokenizer-improvements' of github.com:JSc…

8af28a8

…horeels/yomitan into feature/mecab-tokenizer-improvements

JSchoreels added 3 commits November 30, 2025 22:07

Improve handling of copula

9b99202

Improve code structure to handle ipadic/unidic translated

d56f1cc

Add better ta/da handling. IPADIC can't however differentiate it from…

50c1a26

… copula

JSchoreels added 2 commits December 11, 2025 17:33

Improves edge cases

fa181fe

Reduce differences of treatment between unidic translate and unidic o…

4cecea5

…riginal

Kuuuube added kind/enhancement The issue or PR is a new feature or request area/api The issue or PR is related to API between Yomitan and other scripts or applications labels Dec 31, 2025

NovaKing001 mentioned this pull request Jan 19, 2026

feat(dictionary): support local and ignored tokens killergerbah/asbplayer#853

Merged

Make .startWith more robust with ?.

140bd9c

Merge upstream/main

c67ded6

JSchoreels added 3 commits March 9, 2026 14:30

Add lemma, lemma_reading in first subtoken when tokenizing with mecab

8958661

Use lemmaReading instead of lemma_reading in the json

7a8a2e0

Fixes npm run test

36958fe

JSchoreels mentioned this pull request Mar 9, 2026

Name more mecab fields to allow finer rules in Yomitan yomidevs/yomitan-mecab-installer#12

Merged

JSchoreels added 3 commits March 9, 2026 17:52

Merge branch 'master' into feature/mecab-tokenizer-improvements

caa5cd8

Merge remote-tracking branch 'upstream/master' into feature/mecab-tok…

76a76a9

…enizer-improvements

Merge branch 'feature/mecab-tokenizer-improvements' of github.com:JSc…

416290b

…horeels/yomitan into feature/mecab-tokenizer-improvements

Kuuuube approved these changes Mar 9, 2026

View reviewed changes

Kuuuube added this pull request to the merge queue Mar 9, 2026

Merged via the queue into yomidevs:master with commit 78eee11 Mar 9, 2026
19 of 25 checks passed

Kuuuube mentioned this pull request Mar 9, 2026

Add parser option to tokenize api docs yomidevs/yomitan-api#27

Merged

NovaKing001 mentioned this pull request Mar 9, 2026

Add support for mecab parser via the new parser option in yomitan’s api killergerbah/asbplayer#933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/mecab [/tokenize] support for mecab#2254

Feature/mecab [/tokenize] support for mecab#2254
Kuuuube merged 18 commits intoyomidevs:masterfrom
JSchoreels:feature/mecab-tokenizer-improvements

JSchoreels commented Nov 30, 2025 •

edited

Loading

Uh oh!

JSchoreels commented Nov 30, 2025 •

edited

Loading

Uh oh!

JSchoreels commented Dec 10, 2025

Uh oh!

NovaKing001 commented Feb 28, 2026

Uh oh!

JSchoreels commented Mar 7, 2026 •

edited

Loading

Uh oh!

Kuuuube commented Mar 8, 2026

Uh oh!

NovaKing001 commented Mar 8, 2026

Uh oh!

JSchoreels commented Mar 8, 2026

Uh oh!

JSchoreels commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JSchoreels commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JSchoreels commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JSchoreels commented Dec 10, 2025

Uh oh!

NovaKing001 commented Feb 28, 2026

Uh oh!

JSchoreels commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kuuuube commented Mar 8, 2026

Uh oh!

NovaKing001 commented Mar 8, 2026

Uh oh!

JSchoreels commented Mar 8, 2026

Uh oh!

JSchoreels commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JSchoreels commented Nov 30, 2025 •

edited

Loading

JSchoreels commented Nov 30, 2025 •

edited

Loading

JSchoreels commented Mar 7, 2026 •

edited

Loading