Skip to content

softmatcha/softmatcha2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This is a tool to search in up to trillion-scale corpora, finding not just exact matches but also similar matches from the corpus data. In our experiments on an AWS instance, the median latency was less than 90 milliseconds, and the p95 latency was less than 300 milliseconds for more than 6TB of text data.

| Rank | Score | #Match | String
---------------------------------------------------
|    1 | 100.0 |    838 | olympics gold medalist
|    2 |  89.1 |    456 | olympics gold medallist
|    3 |  86.4 | 95,816 | olympic gold medalist
|    4 |  84.8 |    500 | olympics silver medalist
|    5 |  82.0 | 22,506 | olympic gold medallist
|    6 |  80.9 |     75 | olympics silver medallist
|    7 |  79.6 |  8,945 | olympic silver medalist
|    8 |  77.1 |  2,409 | olympic silver medallist
|    9 |  75.1 |    128 | olympics , gold medalist
|   10 |  73.1 |      2 | olympics . gold medalist
|   11 |  71.0 |     63 | olympics gold medalists
|   12 |  69.3 | 10,292 | olympic gold medalists
|   13 |  69.2 |      7 | olympics gold champion
|   14 |  69.1 |      1 | olympics gold olympic
|   15 |  68.9 |      7 | olympics silver medalists
|   16 |  67.7 |    160 | olympic gold champion
|   17 |  67.6 |     11 | olympic gold olympic
|   18 |  67.5 |    467 | olympic silver medalists
|   19 |  66.9 |     15 | olympics , gold medallist
|   20 |  65.6 |    196 | paralympics gold medalist

Quick Start

1. Compilation

The first step is to compile the program using the following commands:

# only for the first time using SoftMatcha 2
$ rm -rf ~/gensim-data/glove-wiki-gigaword-300
$ rm -rf ~/.cache/huggingface/hub/models--facebook--fasttext* # If you use languages other than English

# install/compilation
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
$ source ~/.bashrc
$ sudo apt-get update
$ sudo apt-get install -y build-essential python3.12-dev pkg-config libicu-dev
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source ~/.cargo/env
$ uv sync
$ uv run maturin develop --release --manifest-path rust/Cargo.toml

2. Build Index: softmatcha-index

The next step is to build indices with the following command. The final filesize is typically ~10x the size of the raw text for small corpora, but less than 3x for larger corpora.

$ uv run softmatcha-index --index [index directory] [text file]

Example:
$ uv run softmatcha-index --index corpus corpus.txt

For faster indexing, we recommend setting the indexing memory usage via --mem_size. For faster search, we also recommend setting the search memory usage via --mem_size_ex.

Note that a large mem_size_ex increases loading time, so we suggest using a lower value (e.g., 100) for small corpora.

$ uv run softmatcha-index --index corpus --mem_size=5000 --mem_size_ex=1000 corpus.txt

(5,000MB memory for indexing, 1,000MB memory for execution (search))

3. Search: softmatcha-search

Finally, you can search for phrases using the following command:

$ uv run softmatcha-search --index [index directory] [pattern]

Example:
$ uv run softmatcha-search --index corpus "olympics gold medalist"

To adjust the number of outputs, similarity thresholds, or max runtime, use the following options:

Example:
$ uv run softmatcha-search --index corpus --num_candidates=100 --min_similarity=0.2 --max_runtime=20 "olympics gold medalist"

4. Output Examples: softmatcha-exact

You can also search for exact match examples (KWIC) with the following commands:

$ uv run softmatcha-exact --index [index directory] [pattern]

Example:
$ uv run softmatcha-exact --index corpus "olympics gold medalist"
$ uv run softmatcha-exact --index corpus --display=20 --padding=200 "olympics gold medalist" # Output up to 20 examples with +/- 200 bytes context

Multilingual Support

To search in languages other than English, build an index by specifying the backend model:

$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-ja-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-zh-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-fr-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-de-vectors corpus.txt
$ uv run softmatcha-index --index corpus --backend=fasttext --model=fasttext-it-vectors corpus.txt

Then, perform the search as follows:

$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-ja-vectors "金メダル"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-zh-vectors "中国"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-fr-vectors "France"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-de-vectors "Deutschland"
$ uv run softmatcha-search --index corpus --backend=fasttext --model=fasttext-it-vectors "Italia"

Citation

@article{yoneda-preprint-2026-softmatcha2,
  title         = "{SoftMatcha 2: A Fast and Soft Pattern Matcher for
                   Trillion-scale Corpora}",
  author        = "Yoneda, Masataka and Matsushita, Yusuke and Kamoda, Go and
                   Suenaga, Kohei and Akiba, Takuya and Waga, Masaki and Yokoi,
                   Sho",
  journal       = "arXiv [cs.CL]",
  month         =  "11~" # feb,
  year          =  2026,
  url           = "http://dx.doi.org/10.48550/arXiv.2602.10908",
  archivePrefix = "arXiv",
  primaryClass  = "cs.CL",
  doi           = "10.48550/arXiv.2602.10908"
}

License

This software is mainly developed by Masataka Yoneda and published under the Apache License 2.0 (see LICENSE). It also contains portions derived from SoftMatcha v1, developed by Hiroyuki Deguchi, which are distributed under the MIT License (see LICENSE-MIT). In particular, in src/softmatcha, tokenizers/, src/, utils/, registry.py, configs.py, functional.py, stopwatch.py, typing.py, and conftest.py are based on SoftMatcha v1 and may include modifications in this repository.

About

A fast and soft pattern search for trillion-scale corpora.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors