A soft and fast pattern matcher for billion-scale corpora.
Paper | Website & Demo | Citation
You can install via PyPi:
pip install softmatchaFor the development purposes, you can install from the source via uv:
git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
uv syncor pip:
git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
pip install -e ./Before running pip install, you need to setup libraries and environment variables:
brew install pkg-config icu4c
export CFLAGS="-std=c++11"
export PATH="$(brew --prefix)/opt/icu4c/bin:$(brew --prefix)/opt/icu4c/sbin:$PATH"
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$(brew --prefix)/opt/icu4c/lib/pkgconfig"
pip install softmatcha
SoftMatcha implements two search types: scan and index.
- Scan: search texts without indexing and any preprocessing like
grep, which is useful for small corpora. - Index: search texts with an index, effectively works on billion-scale corpora.
softmatcha-grep searches corpora without indexing:
$ softmatcha-grep "the jazz musician" corpus.txt
The first arugment is the pattern string and the second one is a file or files to be searched.
The other arguments can be seen by softmatcha-grep -h.
softmatcha-index builds a search index from corpora:
$ softmatcha-index --index corpus.idx corpus.txt
softmatcha-search quickly searches patterns with a search index:
$ softmatcha-search --index corpus.idx "the jazz musician"
For development purposes,
--profile=truemeasures the execution time.--logoutputs the verbose information.
For searchers,
--backend {gensim,fasttext,transformers}: Backend framework for embeddings.--model <NAME>: Name of word embeddings.--thresholdspecifies the threshold for soft matching.
For controlling outputs,
-n,--line_numberprints line number with output lines.-o,--only_matchingoutputs only matched patterns.
- gensim
- fastText
- transformers (embedding layers)
- Naive search:
--search naive - Quick search (default):
--search quick
- Inverted index search
If you use this software, please cite:
@inproceedings{
deguchi-iclr-2025-softmatcha,
title={SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches},
author={Deguchi, Hiroyuki and Kamoda, Go and Matsushita, Yusuke and Taguchi, Chihiro and Waga, Masaki and Suenaga, Kohei and Yokoi, Sho},
booktitle={The Thirteenth International Conference on Learning Representations (ICLR 2025)},
year={2025},
url={https://openreview.net/forum?id=Q6PAnqYVpo}
}This software is mainly developed by Hiroyuki Deguchi and published under the MIT-license.