GitHub - softmatcha/softmatcha2: A fast and soft pattern search for trillion-scale corpora.

About

This is a tool to search in up to trillion-scale corpora, finding not just exact matches but also similar matches from the corpus data. In our experiments on an AWS instance, the median latency was less than 90 milliseconds, and the p95 latency was less than 300 milliseconds for more than 6TB of text data.

| Rank | Score | #Match | String
---------------------------------------------------
|    1 | 100.0 |    838 | olympics gold medalist
|    2 |  89.1 |    456 | olympics gold medallist
|    3 |  86.4 | 95,816 | olympic gold medalist
|    4 |  84.8 |    500 | olympics silver medalist
|    5 |  82.0 | 22,506 | olympic gold medallist
|    6 |  80.9 |     75 | olympics silver medallist
|    7 |  79.6 |  8,945 | olympic silver medalist
|    8 |  77.1 |  2,409 | olympic silver medallist
|    9 |  75.1 |    128 | olympics , gold medalist
|   10 |  73.1 |      2 | olympics . gold medalist
|   11 |  71.0 |     63 | olympics gold medalists
|   12 |  69.3 | 10,292 | olympic gold medalists
|   13 |  69.2 |      7 | olympics gold champion
|   14 |  69.1 |      1 | olympics gold olympic
|   15 |  68.9 |      7 | olympics silver medalists
|   16 |  67.7 |    160 | olympic gold champion
|   17 |  67.6 |     11 | olympic gold olympic
|   18 |  67.5 |    467 | olympic silver medalists
|   19 |  66.9 |     15 | olympics , gold medallist
|   20 |  65.6 |    196 | paralympics gold medalist

Quick Start

1. Compilation

The first step is to compile the program using the following commands:

# only for the first time using SoftMatcha 2
$ rm -rf ~/gensim-data/glove-wiki-gigaword-300
$ rm -rf ~/.cache/huggingface/hub/models--facebook--fasttext* # If you use languages other than English

# install/compilation
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
$ source ~/.bashrc
$ sudo apt-get update
$ sudo apt-get install -y build-essential python3.12-dev pkg-config libicu-dev
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source ~/.cargo/env
$ uv sync
$ uv run maturin develop --release --manifest-path rust/Cargo.toml

2. Build Index: `softmatcha-index`

The next step is to build indices with the following command. The final filesize is typically ~10x the size of the raw text for small corpora, but less than 3x for larger corpora.

$ uv run softmatcha-index --index [index directory] [text file]

Example:
$ uv run softmatcha-index --index corpus corpus.txt

For faster indexing, we recommend setting the indexing memory usage via --mem_size. For faster search, we also recommend setting the search memory usage via --mem_size_ex.

Note that a large mem_size_ex increases loading time, so we suggest using a lower value (e.g., 100) for small corpora.

$ uv run softmatcha-index --index corpus --mem_size=5000 --mem_size_ex=1000 corpus.txt

(5,000MB memory for indexing, 1,000MB memory for execution (search))

3. Search: `softmatcha-search`

Finally, you can search for phrases using the following command:

$ uv run softmatcha-search --index [index directory] [pattern]

Example:
$ uv run softmatcha-search --index corpus "olympics gold medalist"

To adjust the number of outputs, similarity thresholds, or max runtime, use the following options:

Example:
$ uv run softmatcha-search --index corpus --num_candidates=100 --min_similarity=0.2 --max_runtime=20 "olympics gold medalist"

4. Output Examples: `softmatcha-exact`

You can also search for exact match examples (KWIC) with the following commands:

$ uv run softmatcha-exact --index [index directory] [pattern]

Example:
$ uv run softmatcha-exact --index corpus "olympics gold medalist"
$ uv run softmatcha-exact --index corpus --display=20 --padding=200 "olympics gold medalist" # Output up to 20 examples with +/- 200 bytes context

Multilingual Support

To search in languages other than English, build an index by specifying the backend model:

$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-ja-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-zh-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-fr-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-de-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-it-vectors corpus.txt

Then, perform the search as follows:

$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-ja-vectors "金メダル"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-zh-vectors "中国"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-fr-vectors "France"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-de-vectors "Deutschland"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-it-vectors "Italia"

Citation

@article{yoneda-preprint-2026-softmatcha2,
  title         = "{SoftMatcha 2: A Fast and Soft Pattern Matcher for
                   Trillion-scale Corpora}",
  author        = "Yoneda, Masataka and Matsushita, Yusuke and Kamoda, Go and
                   Suenaga, Kohei and Akiba, Takuya and Waga, Masaki and Yokoi,
                   Sho",
  journal       = "arXiv [cs.CL]",
  month         =  "11~" # feb,
  year          =  2026,
  url           = "http://dx.doi.org/10.48550/arXiv.2602.10908",
  archivePrefix = "arXiv",
  primaryClass  = "cs.CL",
  doi           = "10.48550/arXiv.2602.10908"
}

License

This software is mainly developed by Masataka Yoneda and published under the Apache License 2.0 (see LICENSE). It also contains portions derived from SoftMatcha v1, developed by Hiroyuki Deguchi, which are distributed under the MIT License (see LICENSE-MIT). In particular, in src/softmatcha, tokenizers/, src/, utils/, registry.py, configs.py, functional.py, stopwatch.py, typing.py, and conftest.py are based on SoftMatcha v1 and may include modifications in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bench		bench
rust		rust
src/softmatcha		src/softmatcha
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Quick Start

1. Compilation

2. Build Index: `softmatcha-index`

3. Search: `softmatcha-search`

4. Output Examples: `softmatcha-exact`

Multilingual Support

Citation

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Quick Start

1. Compilation

2. Build Index: softmatcha-index

3. Search: softmatcha-search

4. Output Examples: softmatcha-exact

Multilingual Support

Citation

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Build Index: `softmatcha-index`

3. Search: `softmatcha-search`

4. Output Examples: `softmatcha-exact`

Packages