Skip to content

softmatcha/softmatcha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A soft and fast pattern matcher for billion-scale corpora.

PyPi GitHub

Paper | Website & Demo | Citation

Installation

You can install via PyPi:

pip install softmatcha

For the development purposes, you can install from the source via uv:

git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
uv sync

or pip:

git clone https://github.com/softmatcha/softmatcha.git
cd softmatcha/
pip install -e ./

MacOS

Before running pip install, you need to setup libraries and environment variables:

brew install pkg-config icu4c
export CFLAGS="-std=c++11"
export PATH="$(brew --prefix)/opt/icu4c/bin:$(brew --prefix)/opt/icu4c/sbin:$PATH"
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$(brew --prefix)/opt/icu4c/lib/pkgconfig"
pip install softmatcha

Quick start

SoftMatcha implements two search types: scan and index.

  • Scan: search texts without indexing and any preprocessing like grep, which is useful for small corpora.
  • Index: search texts with an index, effectively works on billion-scale corpora.

Scan: softmatcha-grep

softmatcha-grep searches corpora without indexing:

$ softmatcha-grep "the jazz musician" corpus.txt

The first arugment is the pattern string and the second one is a file or files to be searched. The other arguments can be seen by softmatcha-grep -h.

Index: softmatcha-index and softmatcha-search

softmatcha-index builds a search index from corpora:

$ softmatcha-index --index corpus.idx corpus.txt

softmatcha-search quickly searches patterns with a search index:

$ softmatcha-search --index corpus.idx "the jazz musician"

Options

For development purposes,

  • --profile=true measures the execution time.
  • --log outputs the verbose information.

For searchers,

  • --backend {gensim,fasttext,transformers}: Backend framework for embeddings.
  • --model <NAME>: Name of word embeddings.
  • --threshold specifies the threshold for soft matching.

For controlling outputs,

  • -n, --line_number prints line number with output lines.
  • -o, --only_matching outputs only matched patterns.

List of implementations

Embeddings

Searchers

Scan: softmatcha-grep

  • Naive search: --search naive
  • Quick search (default): --search quick

Index: softmatcha-index and softmatcha-search

  • Inverted index search

Citation

If you use this software, please cite:

@inproceedings{
  deguchi-iclr-2025-softmatcha,
  title={SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches},
  author={Deguchi, Hiroyuki and Kamoda, Go and Matsushita, Yusuke and Taguchi, Chihiro and Waga, Masaki and Suenaga, Kohei and Yokoi, Sho},
  booktitle={The Thirteenth International Conference on Learning Representations (ICLR 2025)},
  year={2025},
  url={https://openreview.net/forum?id=Q6PAnqYVpo}
}

License

This software is mainly developed by Hiroyuki Deguchi and published under the MIT-license.

About

A soft and fast pattern matcher for billion-scale corpora.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages